如何判断字符串的编码/加密类型


=Start=

缘由:

由前段时间的一个数据变形(加密、编码)case引起的一些思考和尝试,简单记录一下,方便后面有需要的时候参考。

属于隐写术的一个初级版本,端上的这种变形千变万化,没有完全穷举的可能,这里只是做个思考,真正想要做好数据安全工作,需要强化“左移”的思想。

正文:

参考解答:

Ciphey的实际试用效果一般,可能是因为它具备的功能和我的期望不一致,我期望的是让它帮我快速判定输入的字符串/文件的编码/加密类型是什么(以辅助判断后面该如何操作),但是根据它的描述它的功能是自动解密、解码和破解哈希值

# 在macOS系统上安装 Ciphey
brew install ciphey

# 运行 Ciphey 的3种方式

1. 文件输入

ciphey -f encrypted.txt

2. 不符合要求的输入(Unqualified input)

ciphey -- "Encrypted input"

3. 常规方式

ciphey -t "Encrypted input"

通过使用Python读取文件内容然后判断读取的字符串中是否包含可打印字符的方式也不太可行(编码后存储的时候有人习惯用字符串的方式有人习惯用bytes的方式,字符串的判断不判断意义不大——因为基本进行过base64等编码处理,bytes的方式你也判断不出来)。

我想了想,一个基本可以跑通的流程是——先借助 file/wc 等命令判断文件类型,然后再对文件的 base64 行数量等指标进行统计,给出一些关于这个文件的标签、预测信息即可,不需要进行自动解密、解码处理,先记录后处理(因为你当时处理不一定能处理的过来,准确性和性能损耗可能都不行)

import subprocess
'''
file命令返回的文件类型信息中包含以下的关键词时需要关注:
with very long lines
Multitracker Version
data
'''
def get_file_info(filepath, info_type='file'):
    # 构建命令
    if info_type.strip().lower() == 'wc':
        cmd = 'wc {}'.format(filepath)
    else:
        cmd = 'file -b {}'.format(filepath)

    # 执行命令并返回结果
    try:
        p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    except (OSError, ValueError) as e:
        print("{0} failed, reason {1}".format(cmd, str(e)))
        return -1, str(e)
    stdout_data, stderr_data = p.communicate()
    if p.returncode != 0:
        print("{0} failed, status code {1} stdout {2} stderr {3}".format(cmd, p.returncode, stdout_data, stderr_data))
        return p.returncode, stderr_data
    return p.returncode, stdout_data.strip()

print(get_file_info("1.txt"))
print(get_file_info("2.txt"))
print(get_file_info("3.txt"))
print(get_file_info("1.txt", "wc"))
print(get_file_info("2.txt", "wc"))
print(get_file_info("3.txt", "wc"))
#!/usr/bin/env python3
# coding=utf-8

import base64
import sys

'''
a    YQ==
ab    YWI=
abc    YWJj

hello    aGVsbG8=
hello7    aGVsbG83
'''

base64_char_set = {
    'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
    ,'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'
    ,'0','1','2','3','4','5','6','7','8','9'
    ,'+','/','='
}

# true->1, false->0
def is_str_base64_encode(astr):

    # base64编码后的数据长度肯定是 4 的倍数
    str_len = len(astr)
    if 0 == str_len or str_len%4 != 0:
        return 0

    # base64编码后的数据中,等号(=)只会出现在字符串最后,可能没有或者一个等号或者两个等号
    if astr.count('=') > 2:
        return 0
    elif astr.count('=') != astr[-2:].count('='):
        return 0

    # base64编码后的字符串只可能包含(A-Z,a-z,0-9,+,/,=)字符
    # 但这个判断涉及到正则处理比较消耗资源就不进行了,最主要是,即便满足这种情况也不一定是base64编码
    for x in astr:
        if x in base64_char_set:
            continue
        else:
            return 0
    #for

    # 直接尝试解码,能解码成功则说明OK,这个最准确
    try:
        base64.b64decode(astr)
    except Exception as e:
        print(e)
        return 0

    return 1

def main():
    line_count = 0
    b64_count = 0
    with open(sys.argv[1], 'rb') as fp:
        for line in fp:
            line = line.decode(errors='ignore').strip() # https://stackoverflow.com/a/50359833
            if line:
                line_count += 1
                b64_count += is_str_base64_encode(line)
            #if
        #for
        print('{0}:b64_line_rate = {3}\nline_count = {1}\nb64_count = {2}\n'.format(sys.argv[1], line_count, b64_count, b64_count/line_count))
    #with

if __name__ == '__main__':
    main()
参考链接:

Ciphey的实际试用效果一般,可能是因为它的功能和我的期望不一致,我期望的是让它帮我判定输入的字符串/文件的编码/加密类型,但是它的描述是自动解密、解码和破解哈希值
Hacker Tools: Ciphey – Automatic decryption, decoding & cracking (在不知道密钥或密码的情况下自动解密、解码和破解哈希值)
https://blog.intigriti.com/2021/08/11/hacker-tools-ciphey/
https://github.com/Ciphey/Ciphey

Ciphey currently supports 51 encryptions, encodings, compression methods, and hashes.
https://github.com/Ciphey/Ciphey/wiki/Supported-Ciphers

The Cyber Swiss Army Knife – a web app for encryption, encoding, compression and data analysis
https://github.com/gchq/CyberChef
https://gchq.github.io/CyberChef/

Cipher Identifier – Tool to identify/recognize the type of encryption/encoding applied to a message (more 200 ciphers/codes are detectable). Cipher identifier to quickly decrypt/decode any text.
https://www.dcode.fr/cipher-identifier

Enjoy Encoding & Decoding!
https://dencode.com/

Encryption vs. Hashing vs. Salting – What’s the Difference?
https://www.pingidentity.com/en/resources/blog/post/encryption-vs-hashing-vs-salting.html

Encryption vs Encoding vs Hashing
https://www.geeksforgeeks.org/encryption-encoding-hashing/

Cryptography with Python – Quick Guide
https://www.tutorialspoint.com/cryptography_with_python/cryptography_with_python_quick_guide.htm

How to Encrypt and Decrypt Files in Python
https://thepythoncode.com/article/encrypt-decrypt-files-symmetric-python

How to determine what type of encoding/encryption has been used?
https://security.stackexchange.com/questions/3989/how-to-determine-what-type-of-encoding-encryption-has-been-used

Test if a python string is printable
https://stackoverflow.com/questions/3636928/test-if-a-python-string-is-printable/50731077#50731077

TypeError: a bytes-like object is required, not ‘str’
https://bobbyhadz.com/blog/python-typeerror-bytes-like-object-is-required-not-str

=END=


《 “如何判断字符串的编码/加密类型” 》 有 4 条评论

  1. How to determine what type of encoding/encryption has been used?
    如何确定使用了哪种类型的编码/加密?
    https://security.stackexchange.com/questions/3989/how-to-determine-what-type-of-encoding-encryption-has-been-used
    `
    总结:逆向工程、基于经验做测试和猜测

    ==

    问题:
    Is there a way to find what type of encryption/encoding is being used? For example, I am testing a web application which stores the password in the database in an encrypted format (WeJcFMQ/8+8QJ/w0hHh+0g==). How do I determine what hashing or encryption is being used?
    是否有办法找到正在使用的加密/编码类型?例如,我正在测试一个网络应用程序,该程序在数据库中以加密格式( WeJcFMQ/8+8QJ/w0hHh+0g== )存储密码。如何确定使用的是散列还是加密?

    回答一:
    Your example string (WeJcFMQ/8+8QJ/w0hHh+0g==) is Base64 encoding for a sequence of 16 bytes, which do not look like meaningful ASCII or UTF-8. If this is a value stored for password verification (i.e. not really an “encrypted” password, rather a “hashed” password) then this is probably the result of a hash function computed over the password; the one classical hash function with a 128-bit output is MD5. But it could be about anything.
    你的示例字符串 ( WeJcFMQ/8+8QJ/w0hHh+0g== ) 是一个 16 字节序列的 Base64 编码,看起来不像有意义的 ASCII 或 UTF-8。如果这是为密码验证而存储的值(即不是真正的 “加密”密码,而是 “散列”密码),那么这很可能是对密码进行散列计算的结果;**具有 128 位输出的经典散列函数是 MD5。但也有可能是任何东西**。

    The “normal” way to know that is to look at the application code. Application code is incarnated in a tangible, fat way (executable files on a server, source code somewhere…) which is not, and cannot be, as much protected as a secret key can. So reverse engineering is the “way to go”.
    了解这一点的 “正常”方法是查看应用程序代码。应用程序代码是以有形的、胖乎乎的方式(服务器上的可执行文件、某处的源代码……)体现出来的,它没有也不可能像秘钥那样受到保护。因此,逆向工程是 “必由之路”。

    Barring reverse engineering, you can make a few experiments to try to make educated guesses:
    除逆向工程外,您可以做一些实验,尝试做出有根据的猜测:

    * If the same user “changes” his password but reuses the same, does the stored value changes ? If yes, then part of the value is probably a randomized “salt” or IV (assuming symmetric encryption).
    如果同一个用户 “更改”了密码,但又重复使用了相同的密码,那么存储的值会发生变化吗?如果是,那么部分值可能是随机 “盐”或 IV(假设是对称加密)。

    * Assuming that the value is deterministic from the password for a given user, if two users choose the same password, does it result in the same stored value ? If no, then the user name is probably part of the computation. You may want to try to compute MD5(“username:password”) or other similar variants, to see if you get a match.
    假设给定用户的密码值是确定的,那么如果两个用户选择了相同的密码,会产生相同的存储值吗?如果不是,那么用户名可能是计算的一部分。您可以尝试计算 MD5(“username:password”) 或其他类似变量,看看是否匹配。

    * Is the password length limited ? Namely, if you set a 40-character password and cannot successfully authenticate by typing only the first 39 characters, then this means that all characters are important, and this implies that this really is password hashing, not encryption (the stored value is used to verify a password, but the password cannot be recovered from the stored value alone).
    密码长度有限制吗?也就是说,如果您设置了一个 40 个字符的密码,但只输入前 39 个字符就无法成功验证,那么这就意味着所有字符都很重要,这就意味着这确实是密码哈希算法,而不是加密(存储值用于验证密码,但仅凭存储值无法恢复密码)。

    ==
    Thanks for the inputs.. Pls tell me more about how you confirmed its a Base64 encoding for a sequence of 16 bytes. Regarding your experiments, Yes, this is a value stored for password verification. 1) if a user changes password, then the stored value changes too.. 2) if two users choose same password, the stored value is the same 3) password length is not limited.
    感谢您的意见。请告诉我更多你是如何确认 16 字节序列的 Base64 编码的。关于您的实验,是的,这是一个用于密码验证的存储值。1) 如果用户更改了密码,那么存储的值也会改变。2) 如果两个用户选择相同的密码,存储的值也是相同的 3) 密码长度不受限制。

    @Learner: any sequence of 24 characters, such that the first 22 are letters, digits, ‘+’ or ‘/’, and the last two are ‘=’ signs, is a valid Base64 encoding of a 128-bit value. And any 128-bit value, when encoded with Base64, yields such a sequence.
    @Learner: 任何由 24 个字符组成的序列,如果前 22 个字符是字母、数字、’+’或’/’,最后两个是’=’符号,那么这个序列就是 128 位值的有效 Base64 编码。而任何 128 位数值在使用 Base64 编码时,都会产生这样的序列。

    ==

    回答二:

    Generally speaking, using experience to make educated guesses is how these things are done.
    一般来说,利用经验进行有根据的猜测是做这些事情的方法。
    `

  2. How can I detect if hashes are salted? [duplicate]
    如何检测哈希值是否加盐?
    https://security.stackexchange.com/questions/105438/how-can-i-detect-if-hashes-are-salted
    `
    总结:
    因此,如果事先不知道所使用的哈希值,也没有可用的源代码/二进制文件来进行逆向工程,基本上只能靠猜测。为了增加一点压力,请谨慎选择,因为选择错误的哈希值会导致大量时间的浪费!

    ==
    question:
    Is it possible to detect hash function of a hash if I don’t have access to PHP code? I know that if a hash is some kind of MD5, but I don’t know if there is salt etc.
    如果我无法访问 PHP 代码,有可能检测哈希值的哈希函数吗?我知道如果哈希值是某种 MD5,但不知道是否有盐等。

    answer(s):

    Some tools make a educated guess regarding the encryption and salt type but there are numerous types of encryption schemes, some so closely related that the hashes nearly looks the same.
    有些工具会对加密和盐的类型进行有根据的猜测,但加密方案的类型繁多,有些甚至密切相关,哈希值看起来几乎一样。

    Searched around and found some interesting tools to find the encryption type and they can be broken down into two categories namely with source / binary available and without any source binary.
    通过搜索,我找到了一些有趣的工具来查找加密类型,这些工具可分为两类,即有源代码/二进制文件和无源代码二进制文件。

    Finding the encryption type through reverse engineering can be achieved via tools such as:
    通过逆向工程查找加密类型可通过以下工具实现:

    http://www.autistici.org/ratsoul/iss.html – A plugin for immunity debugger that identifies common encryption or encoding functions / structures etc.
    http://aluigi.altervista.org/mytoolz.htm#signsrch – is the binary version of the immunity plugin version
    http://www.hexblog.com/?p=27 – a plugin for OllyDbg to determine the type of encryption
    https://www.hex-rays.com/products/ida/tech/flirt/index.shtml – a plugin for IDA Pro to determine standard called libraries, could be used to identify encryption libraries

    Then there is the “educated” guess script:
    然后是 “有根据的 “猜测脚本:

    http://code.google.com/p/hash-identifier/ is a script that compares various attributes such as length, contained char types etc to produce a possible hash type used. Seems to be included in Backtrack5 standard.

    And the websites that allow for manual verification such as:
    以及允许人工验证的网站,如

    http://www.insidepro.com/hashes.php – Allows you to enter a password and compare the hash to your example hash
    http://forum.insidepro.com/viewtopic.php?t=8225 – Lists various encrypted hashes to allow for a manual comparison

    So basically it seems that without prior knowledge of the hash used, with no source / binary available to reverse engineer, you are basically left with serious guess work. And to add a little pressure, choose carefully since choosing the wrong hash can lead to a LOT of wasted time!
    因此,如果事先不知道所使用的哈希值,也没有可用的源代码/二进制文件来进行逆向工程,基本上只能靠猜测。为了增加一点压力,请谨慎选择,因为选择错误的哈希值会导致大量时间的浪费!
    `

  3. Linux file command classifying files
    https://unix.stackexchange.com/questions/151008/linux-file-command-classifying-files
    `
    1. The filesystem tests are based on examining the return from a stat(2) system call.

    2. The magic number tests are used to check for files with data in particular fixed formats.

    3. The language tests look for particular strings (cf names.h) that can appear anywhere in the first few blocks of a file. For example, the keyword .br indicates that the file is most likely a troff(1) input file, just as the keyword struct indicates a C program.
    `

  4. $ man file
    `
    file tests each argument in an attempt to classify it. There are three sets of tests, performed in this order: filesystem tests, magic tests, and language tests. The first test that succeeds causes the file type to be printed.

    The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually “binary” or non-printable). Exceptions are well-known file formats (core files, tar archives) that are known to contain binary data. When modifying magic files or the program itself, make sure to preserve these keywords. Users depend on knowing that all the readable files in a directory have the word “text” printed. Don’t do as Berkeley did and change “shell commands text” to “shell script”.

    The filesystem tests are based on examining the return from a stat(2) system call. The program checks to see if the file is empty, or if it’s some sort of special file. Any known file types appropriate to the system you are running on (sockets, symbolic links, or named pipes (FIFOs) on those systems that implement them) are intuited if they are defined in the system header file .

    The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable (compiled program) a.out file, whose format is defined in , and possibly in the standard include directory. These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic number” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. The information identifying these files is read from the compiled magic file /usr/share/file/magic.mgc, or the files in the directory /usr/share/file/magic if the compiled file does not exist.

    If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as “text” because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only “character data” because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.

    Once file has determined the character set used in a text-type file, it will attempt to determine in what language the file is written. The language tests look for particular strings (cf. ) that can appear anywhere in the first few blocks of a file. For example, the keyword .br indicates that the file is most likely a troff(1) input file, just as the keyword struct indicates a C program. These tests are less reliable than the previous two groups, so they are performed last. The language test routines also test for some miscellany (such as tar(1) archives, JSON files).

    Any file that cannot be identified as having been written in any of the character sets listed above is simply said to be “data”.

    ==
    file命令对每个参数进行测试,试图对其进行分类。 共有三组测试,按以下顺序进行:文件系统测试、魔术字测试和语言测试。 第一个测试成功后,文件类型将被打印出来。

    打印出来的类型通常包含文本(文件只包含打印字符和一些常用控制字符,在 ASCII 终端上可能可以安全读取)、可执行(文件包含以某种 UNIX 内核或其他内核可以理解的形式编译程序的结果)或数据(数据通常是 “二进制 “或不可打印的)。 众所周知的包含二进制数据的文件格式(核心文件、tar 存档)除外。 在修改魔法文件或程序本身时,请确保保留这些关键字。 用户需要知道目录中所有可读文件都印有 “文本”字样。 不要像伯克利那样,把 “shell 命令文本”改成 “shell 脚本”。

    1. 文件系统测试基于对 stat(2) 系统调用返回值的检查。 程序会检查文件是否为空,或者是否是某种特殊文件。 如果系统头文件 中定义了与所运行系统相适应的任何已知文件类型(套接字、符号链接或已命名管道(FIFO),在那些实现了这些功能的系统上),那么这些文件类型就会被直观地识别出来。

    2. 魔术字测试用于检查包含特定固定格式数据的文件。 典型的例子是二进制可执行文件(编译程序)a.out 文件,其格式在标准 include 目录中的 、 和可能的 中定义。 这些文件在靠近文件开头的特定位置存储了一个 “神奇数字”,它告诉 UNIX 操作系统该文件是二进制可执行文件,以及其中的几种类型。 神奇数字 “的概念已延伸应用到数据文件中。 在文件的一个小的固定偏移量上有一些不变标识符的任何文件通常都可以用这种方法来描述。 标识这些文件的信息是从编译后的魔术文件 /usr/share/file/magic.mgc 中读取的,如果编译后的文件不存在,则从 /usr/share/file/magic 目录中的文件读取。

    如果文件与 magic 文件中的任何条目都不匹配,则会对其进行检查,看其是否是文本文件。 ASCII、ISO-8859-x、非 ISO 8 位扩展-ASCII 字符集(如 Macintosh 和 IBM PC 系统上使用的字符集)、UTF-8 编码 Unicode、UTF-16 编码 Unicode 和 EBCDIC 字符集可以通过每种字符集中构成可打印文本的不同字节范围和序列来区分。 如果文件通过了其中任何一项测试,就会报告其字符集。 ASCII、ISO-8859-x、UTF-8 和扩展-ASCII 文件被认定为 “文本”,因为它们几乎可以在任何终端上阅读;UTF-16 和 EBCDIC 只是 “字符数据”,因为它们虽然包含文本,但在阅读前需要翻译。 此外,文件还将尝试确定文本类型文件的其他特征。 如果文件的行以 CR、CRLF 或 NEL 结尾,而不是 Unix 标准的 LF 结尾,就会被报告。 包含内嵌转义序列或重码的文件也会被识别出来。

    3. 一旦文件确定了文本类型文件中使用的字符集,它就会尝试确定文件是用什么语言编写的。 语言测试会查找特定的字符串(参见 ),这些字符串可以出现在文件前几块的任何地方。 例如,关键字 .br 表示文件很可能是 troff(1) 输入文件,就像关键字 struct 表示 C 程序一样。 这些测试的可靠性不如前两组,因此最后进行。 语言测试例程还测试一些杂项(如 tar(1) 存档、JSON 文件)。

    任何无法确定是用上述字符集编写的文件都被简单地称为”data”。
    `

    file (command)
    https://en.wikipedia.org/wiki/File_(command)

回复 abc 取消回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注