如何判断字符串的编码/加密类型

=Start=

缘由：

由前段时间的一个数据变形（加密、编码）case引起的一些思考和尝试，简单记录一下，方便后面有需要的时候参考。

属于隐写术的一个初级版本，端上的这种变形千变万化，没有完全穷举的可能，这里只是做个思考，真正想要做好数据安全工作，需要强化“左移”的思想。

正文：

参考解答：

Ciphey的实际试用效果一般，可能是因为它具备的功能和我的期望不一致，我期望的是让它帮我快速判定输入的字符串/文件的编码/加密类型是什么（以辅助判断后面该如何操作），但是根据它的描述它的功能是自动解密、解码和破解哈希值。

# 在macOS系统上安装 Ciphey
brew install ciphey

# 运行 Ciphey 的3种方式

1. 文件输入

ciphey -f encrypted.txt

2. 不符合要求的输入(Unqualified input)

ciphey -- "Encrypted input"

3. 常规方式

ciphey -t "Encrypted input"

通过使用Python读取文件内容然后判断读取的字符串中是否包含可打印字符的方式也不太可行（编码后存储的时候有人习惯用字符串的方式有人习惯用bytes的方式，字符串的判断不判断意义不大——因为基本进行过base64等编码处理，bytes的方式你也判断不出来）。

我想了想，一个基本可以跑通的流程是——先借助 file/wc 等命令判断文件类型，然后再对文件的 base64 行数量等指标进行统计，给出一些关于这个文件的标签、预测信息即可，不需要进行自动解密、解码处理，先记录后处理（因为你当时处理不一定能处理的过来，准确性和性能损耗可能都不行）。

import subprocess
'''
file命令返回的文件类型信息中包含以下的关键词时需要关注：
with very long lines
Multitracker Version
data
'''
def get_file_info(filepath, info_type='file'):
    # 构建命令
    if info_type.strip().lower() == 'wc':
        cmd = 'wc {}'.format(filepath)
    else:
        cmd = 'file -b {}'.format(filepath)

    # 执行命令并返回结果
    try:
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    except (OSError, ValueError) as e:
        print("{0} failed, reason {1}".format(cmd, str(e)))
        return -1, str(e)
    stdout_data, stderr_data = p.communicate()
    if p.returncode != 0:
        print("{0} failed, status code {1} stdout {2} stderr {3}".format(cmd, p.returncode, stdout_data, stderr_data))
        return p.returncode, stderr_data
    return p.returncode, stdout_data.strip()

print(get_file_info("1.txt"))
print(get_file_info("2.txt"))
print(get_file_info("3.txt"))
print(get_file_info("1.txt", "wc"))
print(get_file_info("2.txt", "wc"))
print(get_file_info("3.txt", "wc"))

#!/usr/bin/env python3
# coding=utf-8

import base64
import sys

'''
a    YQ==
ab    YWI=
abc    YWJj

hello    aGVsbG8=
hello7    aGVsbG83
'''

base64_char_set = {
    'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
    ,'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'
    ,'0','1','2','3','4','5','6','7','8','9'
    ,'+','/','='
}

# true->1, false->0
def is_str_base64_encode(astr):

    # base64编码后的数据长度肯定是 4 的倍数
    str_len = len(astr)
    if 0 == str_len or str_len%4 != 0:
        return 0

    # base64编码后的数据中，等号(=)只会出现在字符串最后，可能没有或者一个等号或者两个等号
    if astr.count('=') > 2:
        return 0
    elif astr.count('=') != astr[-2:].count('='):
        return 0

    # base64编码后的字符串只可能包含(A-Z,a-z,0-9,+,/,=)字符
    # 但这个判断涉及到正则处理比较消耗资源就不进行了，最主要是，即便满足这种情况也不一定是base64编码
    for x in astr:
        if x in base64_char_set:
            continue
        else:
            return 0
    #for

    # 直接尝试解码，能解码成功则说明OK，这个最准确
    try:
        base64.b64decode(astr)
    except Exception as e:
        print(e)
        return 0

    return 1

def main():
    line_count = 0
    b64_count = 0
    with open(sys.argv[1], 'rb') as fp:
        for line in fp:
            line = line.decode(errors='ignore').strip() # https://stackoverflow.com/a/50359833
            if line:
                line_count += 1
                b64_count += is_str_base64_encode(line)
            #if
        #for
        print('{0}:b64_line_rate = {3}\nline_count = {1}\nb64_count = {2}\n'.format(sys.argv[1], line_count, b64_count, b64_count/line_count))
    #with

if __name__ == '__main__':
    main()

参考链接：

Ciphey的实际试用效果一般，可能是因为它的功能和我的期望不一致，我期望的是让它帮我判定输入的字符串/文件的编码/加密类型，但是它的描述是自动解密、解码和破解哈希值
Hacker Tools: Ciphey – Automatic decryption, decoding & cracking (在不知道密钥或密码的情况下自动解密、解码和破解哈希值)
https://blog.intigriti.com/2021/08/11/hacker-tools-ciphey/
https://github.com/Ciphey/Ciphey

Ciphey currently supports 51 encryptions, encodings, compression methods, and hashes.
https://github.com/Ciphey/Ciphey/wiki/Supported-Ciphers

The Cyber Swiss Army Knife – a web app for encryption, encoding, compression and data analysis
https://github.com/gchq/CyberChef
https://gchq.github.io/CyberChef/

Cipher Identifier – Tool to identify/recognize the type of encryption/encoding applied to a message (more 200 ciphers/codes are detectable). Cipher identifier to quickly decrypt/decode any text.
https://www.dcode.fr/cipher-identifier

Enjoy Encoding & Decoding!
https://dencode.com/

Encryption vs. Hashing vs. Salting – What’s the Difference?
https://www.pingidentity.com/en/resources/blog/post/encryption-vs-hashing-vs-salting.html

Encryption vs Encoding vs Hashing
https://www.geeksforgeeks.org/encryption-encoding-hashing/

Cryptography with Python – Quick Guide
https://www.tutorialspoint.com/cryptography_with_python/cryptography_with_python_quick_guide.htm

How to Encrypt and Decrypt Files in Python
https://thepythoncode.com/article/encrypt-decrypt-files-symmetric-python

How to determine what type of encoding/encryption has been used?
https://security.stackexchange.com/questions/3989/how-to-determine-what-type-of-encoding-encryption-has-been-used

Test if a python string is printable
https://stackoverflow.com/questions/3636928/test-if-a-python-string-is-printable/50731077#50731077

TypeError: a bytes-like object is required, not ‘str’
https://bobbyhadz.com/blog/python-typeerror-bytes-like-object-is-required-not-str

=END=

30 11 月, 2023

Docker

KnowledgeBase, Programing, Security, Tools

base64, Ciphey, file, Python, Tools