Python知识积累_1

最近用Python处理数据什么的较多，期间碰到了不少坑，觉得有必要整理总结一下，先定个框架，内容慢慢填补：

1.记录Python脚本的执行时间；

经常在一些数据处理较为集中，原始数据较大的情况下需要对执行的Python脚本计算/统计其执行时间，以方便预估整个项目要花费的时间，之前没怎么碰到这种需求，这几天碰到了之后，就上网取搜了一下，找到几种常见方法如下：

使用Python的time模块

import time
start_time = time.time()
main()
print("--- %s seconds ---" % time.time() - start_time)

对于较小的代码片段可以使用Python的timeit模块

>>> import timeit
#执行命令
>>> t2 = timeit.Timer('x=range(1000)')
#显示时间
>>> t2.timeit()
10.620039563513103

#执行命令
>>> t1 = timeit.Timer('sum(x)', 'x = (i for i in range(1000))')
#显示时间
>>> t1.timeit()
0.1881566039438201

在IPython下的使用方法为：

In [4]: %timeit y=map(lambda x:x**10, range(32))
10000 loops, best of 3: 51.9 us per loop

使用IPython的%run魔术命令

In [5]: %run -t py_deal_with_results.py

%run has special flags for timing the execution of your scripts (-t), or for running them under the control of either Python’s pdb debugger (-d) or profiler (-p).

2.Python中常见操作的时间复杂度；

list操作

本质上，一个list列表就是用array数组来呈现的，它的最大花销在于移动不断增长的分配空间大小，因为对于list列表来说所有的操作都需要移动元素，特别是插入和删除操作，如果你需要在两端插入/删除元素时，考虑使用collections.deque替代list

Operation	Average Case	Amortized Worst Case
Copy	O(n)	O(n)
Append[1]	O(1)	O(1)
Insert	O(n)	O(n)
Get Item	O(1)	O(1)
Set Item	O(1)	O(1)
Delete Item	O(n)	O(n)
Iteration	O(n)	O(n)
Get Slice	O(k)	O(k)
Del Slice	O(n)	O(n)
Set Slice	O(k+n)	O(k+n)
Extend[1]	O(k)	O(k)
Sort	O(n log n)	O(n log n)
Multiply	O(nk)	O(nk)
x in s	O(n)
min(s), max(s)	O(n)
Get Length	O(1)	O(1)

set操作

Operation	Average case	Worst Case
x in s	O(1)	O(n)
Union s\|t	O(len(s)+len(t))
Intersection s&t	O(min(len(s), len(t))	O(len(s) * len(t))
Difference s-t	O(len(s))
s.difference_update(t)	O(len(t))
Symmetric Difference s^t	O(len(s))	O(len(s) * len(t))
s.symmetric_difference_update(t)	O(len(t))	O(len(t) * len(s))

set的实现和dict字典的实现非常相似。

dict操作

在此假设Python使用的Hash函数能够使碰撞发生的几率非常小，测试用例也非常平均。

在Python中有一种快速构建字典的方法——{}，它并不影响算法的复杂度，但是它能够显著的影响常数因子，如何快速的完成一个典型的程序。

Operation	Average Case	Amortized Worst Case
Copy[2]	O(n)	O(n)
Get Item	O(1)	O(n)
Set Item[1]	O(1)	O(n)
Delete Item	O(1)	O(n)
Iteration[2]	O(n)	O(n)

从上面几张表可以看出，在list中要尽量避免使用insert()/delete()等时间复杂度非常高的操作，因为它们涉及到元素的移动，会耗费时间；在涉及到判断一个元素是否已经存在{查找}的情况，最好使用dict/set来操作，时间复杂度只有O(1)。

在Python中使用list和dict作为查找表的时间、空间效率对比：http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table

3.Python的列表、集合操作小结；

在Python中，向List添加元素，有如下4种方法（append(),extend(),insert(), +加号）：

append() 追加单个元素到List的尾部，只接受一个参数，参数可以是任何数据类型，被追加的元素在List中保持着原结构类型。此元素如果是一个list，那么这个list将作为一个整体进行追加，注意append()和extend()的区别。

>>> list1=['a','b']
>>> list1.append('c')
>>> list1
['a', 'b', 'c']

extend() 将一个列表中每个元素分别添加到另一个列表中，只接受一个参数；extend()相当于是将list B 连接到list A上。

>>> list1
['a', 'b', 'c']
>>> list1.extend('d')
>>> list1
['a', 'b', 'c', 'd']

insert() 将一个元素插入到列表中，但其参数有两个（如insert(1,”g”)），第一个参数是索引点，即插入的位置，第二个参数是插入的元素。

>>> list1
['a', 'b', 'c', 'd']
>>> list1.insert(1,'x')
>>> list1
['a', 'x', 'b', 'c', 'd']

+ 加号，将两个list相加，会返回到一个新的list对象，注意与前三种的区别。前面三种方法（append, extend, insert）可对列表增加元素的操作，他们没有返回值，是直接修改了原数据对象。注意：将两个list相加，需要创建新的list对象，从而需要消耗额外的内存，特别是当list较大时，尽量不要使用“+”来添加list，而应该尽可能使用List的append()方法。

>>> list1
['a', 'x', 'b', 'c', 'd']
>>> list2=['y','z']
>>> list3=list1+list2
>>> list3
['a', 'x', 'b', 'c', 'd', 'y', 'z']

append()方法和“+”方法的性能对比：

集合操作（intersection()/difference()）：

print list(set(set1).difference(set(set2))) # in set1 not in set2
print list(set(set1).intersection(set(set2))) # in both set1 and set2

4.Python中的字典操作及相关运用；

def fileToDict(fileName):
    return {k.strip():v.strip() for k, v in (l.split() for l in open(fileName))}
ipAll = fileToDict('ip_with_counts.txt')
print len(dic)

dic = {}
for line in open('ip_with_counts.txt'):
    dic[line.split()[1].strip()] = line.split()[0].strip()

ipAll = [line for line in open('ip_with_counts.txt')]
print len(ipAll)

top = [line.strip() for line in open('top.txt')]
for ip in top:
    if ip.split()[1].strip() in dic:
        print ip +"t"+ dic.get(ip.split()[1].strip())
    else:
        print ip +"t"+"0"

因为dict用到了Hash算法，所以在dict中查找元素/判断是否存在，速度很快（此时要尽量避免使用list）；上面的代码中使用了2种方法将文件内容读取至字典，第一种是比较Pythonic的方法，但是无法将所有的内容读取，第二种方法虽然比较原始，但实际使用效果是最好的，谨记——能解决问题的方法就是好方法，谁管它是不是好看？

在Python中，dict字典我用的最多的地方就是去重&计数：

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
####

from collections import defaultdict
def get_counts2(sequence):
    counts = defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts
####

def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz,count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
####

from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)

5.Python文件读取的一些例子；

rawData = [line.strip() for line in open('ipList.txt')]

top1000 = [line.strip() for line in open('topRate.txt').readlines()[:1000]]

for line in sys.stdin:
    line = line.strip()

for line in open('fileName.txt'):
    line = line.strip()

6.Python的字符串加解密操作（hashlib、base64）；

import hashlib

a = "a test string"
print 'md5 = %s' % (hashlib.md5(a).hexdigest(),)
print 'sha1 = %s' % (hashlib.sha1(a).hexdigest(),)
print 'sha224 = %s' % (hashlib.sha224(a).hexdigest(),)
print 'sha256 = %s' % (hashlib.sha256(a).hexdigest(),)
print 'sha384 = %s' % (hashlib.sha384(a).hexdigest(),)
print 'sha512 = %s' % (hashlib.sha512(a).hexdigest(),)

import base64
test = 'a test str'

#转成bytes string
bytesString = test.encode(encoding="utf-8")
print(bytesString)

#base64 编码
encodestr = base64.b64encode(bytesString)
print(encodestr)
print(encodestr.decode())

#解码
decodestr = base64.b64decode(encodestr)
print(decodestr.decode())

7.用Python操作MySQL数据库；

import MySQLdb
conn = MySQLdb.connect(host="127.0.0.1",user="root",passwd="pass111",db="test",charset="utf8")

cur = conn.cursor()

cur.execute("insert into users (username,password,email) values (%s,%s,%s)",("python","123456","[email protected]"))
conn.commit()

cur.executemany("insert into users (username,password,email) values (%s,%s,%s)",(("google","111222","[email protected]"),("facebook","222333","[email protected]"),("github","333444","[email protected]"),("docker","444555","[email protected]")))
conn.commit()

在Ubuntu上安装MySQLdb：

http://zhoujianghai.iteye.com/blog/1520666

8.用Python生成验证码；

参考：在Python中用PIL做验证码

一些参考链接：