如何判断字符串的编码

其实文章的标题取得有点大了，因为程序是无法准确的判断出某一字符串究竟是使用了何种编码方式的（因为一个字符串中可以包含各种编码的字符，就算人判断起来都很困难，更何况是通过简单的程序了），从而，这里只是大概的判断一下字符串的编码方式，然后尝试性的去按照自己认定的编码方式去进行解码(如果需要的话)。

1.Python的处理方式

搜索关键字：

http://search.aol.com/aol/search?q=use+python+to+determine+the+string+encoding

'''
In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.
In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:
'''
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

'''
Here is a small snippet to help you to guess the encoding. It guesses between latin1 and utf8 quite good. It converts a byte string to a unicode string.
Attention: Order of encoding_guess_list is import. Example: "latin1" always succeeds.
'''
encoding_guess_list=['utf8', 'latin1']
def try_unicode(string, errors='strict'):
    if isinstance(string, unicode):
        return string
    assert isinstance(string, str), repr(string)
    for enc in encoding_guess_list:
        try:
            return string.decode(enc, errors)
        except UnicodeError, exc:
            continue
    raise UnicodeError('Failed to convert %r' % string)
def test_try_unicode():
    for start, should in [
        ('\xfc', u'ü'),
        ('\xc3\xbc', u'ü'),
        ('\xbb', u'\xbb'), # postgres/psycopg2 latin1: RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
        ]:
        result=try_unicode(start, errors='strict')
        if not result==should:
            raise Exception(u'Error: start=%r should=%r result=%r' % (start, should, result))

一些参考链接：

2.PHP的处理方式

参考：如何用PHP检测字符串是否为UTF-8编码

待续……

21 1 月, 2015

admin

KnowledgeBase, Other, Tools

encoding, Unicode

《 “如何判断字符串的编码” 》有 5 条评论

a-z说道：

2017-02-11 09:03

Python 编码错误的本质原因
https://mp.weixin.qq.com/s/6SW7qYWUypxSDHIKNGT43A
`
完全理解字符编码与 Python 的渊源前，我们有必要把一些基础概念弄清楚，虽然有些概念我们每天都在接触甚至在使用它，但并不一定真正理解它。比如：字节、字符、字符集、字符码、字符编码。

Python2 字符类型
在 python2 中和字符串相关的数据类型有 str 和 unicode 两种类型，它们继承自 basestring，而 str 类型的字符串的编码格式可以是 ascii、utf-8、gbk等任何一种类型。
str 与 unicode 的转换
UnicodeXXXError 错误的原因
`

回复
hi说道：

2018-12-03 21:11

python unicode转中文及转换默认编码
https://www.cnblogs.com/technologylife/p/6071787.html
`
In [50]: s1 = u’\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8′

In [51]: type(s1)
Out[51]: unicode

In [52]: s1
Out[52]: u’\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8′

In [53]: print s1
人生苦短，py是岸

In [54]:

In [54]: s2 = r’\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8′

In [55]: type(s2)
Out[55]: str

In [56]: s2
Out[56]: ‘\\u4eba\\u751f\\u82e6\\u77ed\\uff0cpy\\u662f\\u5cb8’

In [57]: print s2
\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8

In [58]: s2 = s2.decode(‘unicode_escape’)

In [59]: type(s2)
Out[59]: unicode

In [60]: s2
Out[60]: u’\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8′

In [61]: print s2
人生苦短，py是岸
`

python，unicode转换中文，中文转换unicode
https://blog.csdn.net/monkey7777/article/details/52689422

回复
hi说道：

2018-12-03 21:11

python 2 与 python 3 —— 转义及编码（\u, \x）
https://blog.csdn.net/lanchunhui/article/details/53119136
`
\x：只是 16 进制的意思，后边跟两位，则表示单字节编码；
`
\x 开头编码的数据解码成中文
https://www.cnblogs.com/xiaoqi/p/5101795.html
`
In [62]: print “\xE5\x85\x84\xE5\xBC\x9F\xE9\x9A\xBE\xE5\xBD\x93 \xE6\x9D\x9C\xE6\xAD\x8C”.decode(‘utf-8’)
兄弟难当杜歌

In [63]: “\xE5\x85\x84\xE5\xBC\x9F\xE9\x9A\xBE\xE5\xBD\x93 \xE6\x9D\x9C\xE6\xAD\x8C”.decode(‘utf-8′)
Out[63]: u’\u5144\u5f1f\u96be\u5f53 \u675c\u6b4c’
`

回复
hi说道：

2019-05-17 12:11

字符编码那些事儿
https://mp.weixin.qq.com/s/zKysQ–tJASxvBDFJJ5saw
`
一、二进制和字节
二、标准ASCII
三、ASCII 扩展字符集
四、GB2312
五、GBK
六、GB18030
六、UNICODE
七、UTF，UTF8，UTF16
八、“锟斤拷��” 是什么
九、ICU
十、事实标准
字符集：UNICODE
字节编码：UTF8
国际化：ICU
`

回复
hi说道：

2024-09-05 17:42

python判断是否汉字的5种方法实例
https://www.jb51.net/python/290637ks9.htm
`
1. 使用Python内置的ord() — ord()函数将字符转换为Unicode编码，然后判断其范围是否在汉字的范围内：

2. 使用Python内置的unicodedata库 — if ‘CJK’ in unicodedata.name(char): return True

3. 使用正则表达式 — 使用 [^\u4e00-\u9fa5] 可以匹配所有非汉字字符，而 [^\x00-\xff] 可以匹配所有双字节字符，包括汉字和符号等

4. 使用中文字符集 — if b’\xb0\xal’ <= word.encode('gb2312') <= b'\xd7\xf9': return True

5. 使用第三方库 — 例如 xpinyin 库可以将一个字符串转换为拼音，并判断字符串是否为汉字
`

回复

ASPIRE

如何判断字符串的编码

1.Python的处理方式

搜索关键字：

一些参考链接：

2.PHP的处理方式

《 “如何判断字符串的编码” 》有 5 条评论

发表回复取消回复

如何判断字符串的编码

1.Python的处理方式

搜索关键字：

一些参考链接：

2.PHP的处理方式

《 “如何判断字符串的编码” 》 有 5 条评论

发表回复 取消回复

《 “如何判断字符串的编码” 》有 5 条评论

发表回复取消回复