用Python计算文本的相似度
因为后期会需要用到这方面的知识,所以先提前准备准备;如何判断网页返回内容的相似度?
准备好关键字,然后开始搜索:http://search.aol.com/aol/search?q=use+python+to+calculate++text+similarity
找到了几个Python的方法和库:
- difflib库
- Google的diff-match-patch库
- Levenshtein扩展
- 还有高大上的“TF-IDF方法”{之前在《数学之美》中看到过,但这里我就不考虑了}
下面主要记录用不同的Python库来计算两段文本之间的相似度(最后要得到的就是一个百分比):
方法一:difflib
>>> import difflib >>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio() 1.0 >>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio() 0.80000000000000004 >>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio() 0.0
方法二:Levenshtein
import Levenshtein 报错:ImportError: No module named Levenshtein
于是去:python-Levenshtein 下载源码进行安装(在http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-levenshtein 其实也有编译好的exe),第一次安装的时候报错:error: Unable to find vcvarsall.bat ,但其实我是装了VS2010的,所以执行如下步骤正常安装:
1.设置环境变量,执行:
SET VS90COMNTOOLS=%VS100COMNTOOLS%
2.再去安装:
setup.py install
就可以正常,编译,安装了。
$ python >>> import Levenshtein >>> help(Levenshtein.ratio) ratio(...) Compute similarity of two strings. ratio(string1, string2) The similarity is a number between 0 and 1, it's usually equal or somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's based on real minimal edit distance. Examples: >>> ratio('Hello world!', 'Holly grail!') 0.58333333333333337 >>> ratio('Brian', 'Jesus') 0.0 >>> help(Levenshtein.distance) distance(...) Compute absolute Levenshtein distance of two strings. distance(string1, string2) Examples (it's hard to spell Levenshtein correctly): >>> distance('Levenshtein', 'Lenvinsten') 4 >>> distance('Levenshtein', 'Levensthein') 2 >>> distance('Levenshtein', 'Levenshten') 1 >>> distance('Levenshtein', 'Levenshtein') 0
方法三:FuzzyWuzzy
git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy cd fuzzywuzzy python setup.py install >>> from fuzzywuzzy import fuzz >>> from fuzzywuzzy import process Simple Ratio >>> fuzz.ratio("this is a test", "this is a test!") 96 Partial Ratio >>> fuzz.partial_ratio("this is a test", "this is a test!") 100 Token Sort Ratio >>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 90 >>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100 Token Set Ratio >>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 84 >>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100
import diff_match_patch textA = "the cat in the red hat" textB = "the feline in the blue hat" dmp = diff_match_patch.diff_match_patch() #create a diff_match_patch object diffs = dmp.diff_main(textA, textB) # All 'diff' jobs start with invoking diff_main() d_value = dmp.diff_levenshtein(diffs) print d_value maxLenth = max(len(textA), len(textB)) print float(d_value)/float(maxLenth) similarity = (1 - float(d_value)/float(maxLenth)) * 100 print similarity
上面这段代码的思路也是先计算Levenshtein距离,然后再将其和两字符串的最大长度相除,得到相似度(不清楚这样和直接使用Levenshtein扩展有什么区别,毕竟那个直接是用C写成的,速度可能还要快一些,直接一些)
参考链接:
- use python to calculate text similarity – AOL Search Results
- Good Python modules for fuzzy string comparison? – Stack Overflow
- c# – Text difference algorithm – Stack Overflow
- language agnostic – Algorithm to find articles with similar text – Stack Overflow
- python difflib – AOL Search Results
- python Levenshtein – AOL Search Results
- Computing string similarity with TF-IDF and Python | thesis | graus.nu
- Levenshtein distance – Wikipedia, the free encyclopedia
- TF-IDF – 维基百科,自由的百科全书
- https://github.com/seatgeek/fuzzywuzzy
- python – Building an HTML Diff/Patch Algorithm – Stack Overflow
- Useless Factor: Matching, diffing and merging XML
- https://code.google.com/p/google-diff-match-patch/wiki/API
《 “用Python计算文本的相似度” 》 有 5 条评论
相似度计算之minhash
https://www.biaodianfu.com/minhash.html
https://en.wikipedia.org/wiki/MinHash
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
洗稿,技术上怎么判断文章相似性?
https://mp.weixin.qq.com/s/GFpVvMEn4gvcLEMyZOkEFA
高效相似度计算 LSH minHash simHash的学习
https://blog.csdn.net/u011467621/article/details/49685107
文本内容相似度计算方法:simhash
https://www.biaodianfu.com/simhash.html
三种重要哈希介绍
https://blog.csdn.net/ACdreamers/article/details/45462881
`
一致性哈希
局部敏感哈希
GeoHash
`
一致性hash算法释义
http://www.cnblogs.com/haippy/archive/2011/12/10/2282943.html
局部敏感哈希(Locality-Sensitive Hashing, LSH)方法介绍
https://blog.csdn.net/icvpr/article/details/12342159
SimHash与重复信息识别
https://mp.weixin.qq.com/s/63ZFk3E8ESlTcX9g2paeaw
Google用来处理海量文本去重的simhash算法原理及实现
https://mp.weixin.qq.com/s/XZZZENuFV8VAMekQzcL9Fw
https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html
Python实现的字符串相似性检测算法库
https://github.com/luozhouyang/python-string-similarity
使用结构和样式指标比较html的相似性(Compare html similarity using structural and style metrics)
https://github.com/matiskay/html-similarity