用Python解压缩gzip格式的文件/内容

如何识别gzip格式的文件？

http://search.aol.com/aol/search?q=how+to+identify+gzip+file

利用文件的Magic Number识别文件类型：

如何查看/打印二进制格式的数据？

最后是使用了binascii模块：
https://docs.python.org/2/library/binascii.html

样例：

import zlib, binascii
fp = open('aaa', 'rb')
con_0 = fp.read()
fp.close()
print con_0[0], con_0[1]
print binascii.hexlify(con_0[:2])
#print len(con_0)

decompressed_data = zlib.decompress(con_0, 16+zlib.MAX_WBITS)
print decompressed_data[:2]
#print len(decompressed_data)

在最近正在写的packet重组HTTP的stream/session代码里面，有时会出现文件乱码的情况，在经过一些调试之后，发现是因为在Response里面对数据进行了压缩（而是用的压缩算法类型取决于Request和Response协商的结果）：

Request: #请求
...
Accept: text/html, text/plain, text/css, text/sgml, */*;q=0.01
Accept-Encoding: gzip, compress, bzip2 #可能需要进行的解压缩操作
...

Response: #响应
...
Content-Type: text/html
Content-Type: text/xml; charset=UTF-8
Content-Type: application/octet-stream #八位字节流，直接binascii.hexlify即可
...
Content-Encoding: gzip #文件中有乱码（以gzip格式编码文件内容）

参考链接：

搜索关键字：

Using Python’s gzip and StringIO to compress data
How can I decompress a gzip stream with zlib
how to uncompress zlib data in python

22 12 月, 2014

admin

KnowledgeBase, Programing

binascii, gzip, Python

《 “用Python解压缩gzip格式的文件/内容” 》有 4 条评论

Lester说道：

2016-11-15 17:15

你好，我现在想做对gzip压缩http传输的支持，这需要在判断已经压缩的基础上进行解压，请问是要从http包开始处理？还是处理已经接收文件，对文件进行解压？

回复
- admin说道：
  
  2016-11-21 17:19
  
  不好意思，之前比较忙，没有及时回复你的问题。
  之前我处理gzip压缩的文件时，一般都是先接收完之后再解压，没有以stream的方式去处理；后来我Google了一下之后发现SO上有讲以stream方式来处理的方法： http://stackoverflow.com/questions/2695152/in-python-how-do-i-decode-gzip-encoding 你可以参考一下。
  
  回复
a-z说道：

2017-03-13 11:17

各种文件格式的文件头 Magic Header 列表
http://garykessler.net/library/file_sigs.html

回复
abc说道：

2023-11-24 19:23

“file” command confusing C & C++
https://unix.stackexchange.com/questions/140238/file-command-confusing-c-c
`
From man page of file command,
摘自 file 命令的手册

file command actually performs 3 tests on determining the file type.
file 命令在确定文件类型时实际上执行了 3 项测试。

1. First test
The filesystem tests are based on examining the return from a stat(2) system call.
文件系统测试基于检查 stat(2) 系统调用的返回值。

2. Second test
The magic number tests are used to check for files with data in particular fixed formats.
魔数字测试用于检查数据是否为特定格式的文件。

3. Third test
The language tests look for particular strings (cf names.h) that can appear anywhere in the first few blocks of a file. For example, the keyword .br indicates that the file is most likely a troff(1) input file, just as the keyword struct indicates a C program.
语言测试查找特定字符串（参见 names.h），这些字符串可以出现在文件前几块的任何位置。例如，关键字 .br 表示文件很可能是 troff(1) 输入文件，就像关键字 struct 表示 C 程序一样。

The output of the file command is generally based on the result of any of the tests that succeeds.
file命令的输出通常基于任何一个测试成功的结果。
`

回复

ASPIRE

用Python解压缩gzip格式的文件/内容

如何识别gzip格式的文件？

利用文件的Magic Number识别文件类型：

如何查看/打印二进制格式的数据？

参考链接：

搜索关键字：

解压缩bz2格式的数据：

HTTP协议中的压缩格式：

为什么主流的Web站点使用gzip压缩算法？

《 “用Python解压缩gzip格式的文件/内容” 》有 4 条评论

发表回复取消回复

用Python解压缩gzip格式的文件/内容

如何识别gzip格式的文件？

利用文件的Magic Number识别文件类型：

如何查看/打印二进制格式的数据？

参考链接：

搜索关键字：

解压缩bz2格式的数据：

HTTP协议中的压缩格式：

为什么主流的Web站点使用gzip压缩算法？

《 “用Python解压缩gzip格式的文件/内容” 》 有 4 条评论

发表回复 取消回复

《 “用Python解压缩gzip格式的文件/内容” 》有 4 条评论

发表回复取消回复