Python中的字节和字符串


=Start=

缘由:

整理总结一下最近遇到比较多的Python中字节和字符串之间的小知识点,方便以后快速参考、学习。

正文:

参考解答:
Python 3中的bytes和str类型

Python 3最重要的新特性大概要算是对文本和二进制数据作了更为清晰的区分。文本总是Unicode,由str类型表示,二进制数据则由bytes类型表示。Python 3不会以任意隐式的方式混用str和bytes,正是这使得两者的区分特别清晰。你不能拼接字符串和字节包,也无法在字节包里搜索字符串(反之亦然),也不能将字符串传入参数为字节包的函数(反之亦然)。这是件好事。

字符串可以编码encode()成字节包,而字节包可以解码decode()成字符串

# Python 3 交互终端
>>> website = 'https://ixyzero.com/blog/'
>>> type(website)
<class 'str'>
>>> website
'https://ixyzero.com/blog/'

# 将 string 转换成 bytes ,使用 .encode() 方法
>>> website_bytes_utf8 = website.encode(encoding="utf-8")
>>> type(website_bytes_utf8)
<class 'bytes'>
>>> website_bytes_utf8
b'https://ixyzero.com/blog/'

# 将 bytes 转换成 string ,使用 .decode() 方法
>>> website_string = website_bytes_utf8.decode()
>>> type(website_string)
<class 'str'>
>>> website_string
'https://ixyzero.com/blog/'
>>>

&

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

Python中如何获取某个字符串的「字节长度」?(python get string byte length)

def utf8len(s):
    return len(s.encode('utf-8'))

&

# getsizeof(object, default) -> int
# Return the size of object in bytes.
# 这种方法获取的是Python对象的bytes大小,和我们期望的效果并不相同,而且不同版本、系统的值也并不一致
import sys
sys.getsizeof(s)

>>> len("hello".encode("utf8"))
5
>>> len("你好".encode("utf8"))
6

####
Python 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>>
>>> utf8len('你好')
6
>>> utf8len('hello')
5
>>> sys.getsizeof('你好')
43
>>>
>>> sys.getsizeof('hello')
42
>>>

####
Python 3.6.5 (default, Apr 10 2018, 20:17:30)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof('你好')
78
>>> sys.getsizeof('hello')
54
>>>
>>> utf8len('你好')
6
>>> utf8len('hello')
5
>>>
参考链接:

=END=


《“Python中的字节和字符串”》 有 3 条评论

  1. Python 3’s f-Strings: An Improved String Formatting Syntax (Guide)
    https://realpython.com/python-f-strings/
    `
    * “Old-school” String Formatting in Python
    ____* Option #1: %-formatting
    ____* Option #2: str.format()
    * f-Strings: A New and Improved Way to Format Strings in Python
    ____* Simple Syntax
    ____* Arbitrary Expressions
    ____* Multiline f-Strings
    ____* Speed
    * Python f-Strings: The Pesky Details
    ____* Quotation Marks
    ____* Dictionaries
    ____* Braces
    ____* Backslashes
    ____* Inline Comments
    * Go Forth and Format!
    * Further Reading
    `

    Python格式化字符串f-string概览
    https://blog.csdn.net/sunxb10/article/details/81036693
    `
    f-string,亦称为格式化字符串常量(formatted string literals),是Python3.6新引入的一种字符串格式化方法,该方法源于PEP 498 – Literal String Interpolation,主要目的是使格式化字符串的操作更加简便。f-string在形式上是以 f 或 F 修饰符引领的字符串(f’xxx’ 或 F’xxx’),以大括号 {} 标明被替换的字段;f-string在本质上并不是字符串常量,而是一个在运行时运算求值的表达式。

    f-string在功能方面不逊于传统的%-formatting语句和str.format()函数,同时性能又优于二者,且使用起来也更加简洁明了,因此对于Python3.6及以后的版本,推荐使用f-string进行字符串格式化。
    `

    神奇的 f-strings
    https://zhuanlan.zhihu.com/p/62774871

    How to Add New Line in Python f-strings
    https://towardsdatascience.com/how-to-add-new-line-in-python-f-strings-7b4ccc605f4a
    `
    Essentially, you have three options;
    The first is to define a new line as a string variable and reference that variable in f-string curly braces.
    The second workaround is to use os.linesep that returns the new line character
    and the final approach is to use chr(10) that corresponds to the Unicode new line character.

    简单来说,还是定义一个值为换行符的字符串变量,然后在f-string中进行引用,这个相对来说更简便一些。
    `
    https://stackoverflow.com/questions/44780357/how-to-use-newline-n-in-f-string-to-format-output-in-python-3-6

  2. UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte
    https://www.w3docs.com/snippets/python/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-start-byte.html
    `
    byte_string = b’\xa5′
    text = byte_string.decode(‘utf8′, errors=’ignore’)
    print(‘done’)
    print(text) # prints nothing

    byte_string = b’\xa5′
    text = byte_string.decode(‘utf8′, errors=’replace’)
    print(‘done’)
    print(text) #� (U+FFFD, the official REPLACEMENT CHARACTER)
    `

  3. Test if a python string is printable
    https://stackoverflow.com/questions/3636928/test-if-a-python-string-is-printable/50731077#50731077
    `
    >>> hello = ‘Hello World!’
    >>> bell = chr(7)
    >>> import string
    >>> all(c in string.printable for c in hello)
    True
    >>> all(c in string.printable for c in bell)
    False

    >>> printset = set(string.printable)
    >>> helloset = set(hello)
    >>> bellset = set(bell)
    >>> helloset
    set([‘!’, ‘ ‘, ‘e’, ‘d’, ‘H’, ‘l’, ‘o’, ‘r’, ‘W’])
    >>> helloset.issubset(printset)
    True
    >>> set(bell).issubset(printset)
    False

    import string
    printset = set(string.printable)
    isprintable = set(yourstring).issubset(printset)
    `

    Python String isprintable() Method
    https://www.w3schools.com/python/ref_string_isprintable.asp
    `
    txt = “Hello! Are you #1?”
    x = txt.isprintable()
    print(x)

    在Python中,可以使用字符串的 isprintable() 方法来检查字符串是否包含不可打印字符。如果字符串中不包含不可打印字符,该方法将返回True,否则将返回False。
    请注意,isprintable()方法只能检查字符是否可打印,而不能检查字符是否是ASCII字符。如果需要检查字符串是否仅包含ASCII字符,可以使用 isascii()方法。
    `
    https://docs.python.org/3/library/string.html#string.printable
    `
    string.printable
    String of ASCII characters which are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace.
    `

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注