用Python计算文本的相似度

用Python计算文本的相似度

因为后期会需要用到这方面的知识，所以先提前准备准备；如何判断网页返回内容的相似度？

准备好关键字，然后开始搜索：http://search.aol.com/aol/search?q=use+python+to+calculate++text+similarity

找到了几个Python的方法和库：

difflib库
Google的diff-match-patch库
Levenshtein扩展
还有高大上的“TF-IDF方法”{之前在《数学之美》中看到过，但这里我就不考虑了}

下面主要记录用不同的Python库来计算两段文本之间的相似度（最后要得到的就是一个百分比）：

方法一：difflib

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0

方法二：Levenshtein

import Levenshtein 报错：ImportError: No module named Levenshtein

于是去：python-Levenshtein 下载源码进行安装（在http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-levenshtein 其实也有编译好的exe），第一次安装的时候报错：error: Unable to find vcvarsall.bat ，但其实我是装了VS2010的，所以执行如下步骤正常安装：

1.设置环境变量，执行：

SET VS90COMNTOOLS=%VS100COMNTOOLS%

2.再去安装：

setup.py install

就可以正常，编译，安装了。

$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

方法三：FuzzyWuzzy

git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
cd fuzzywuzzy
python setup.py install

>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process

Simple Ratio
>>> fuzz.ratio("this is a test", "this is a test!")
    96

Partial Ratio
>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

方法四：google-diff-match-patch

import diff_match_patch
textA = "the cat in the red hat"
textB = "the feline in the blue hat"

dmp = diff_match_patch.diff_match_patch()  #create a diff_match_patch object
diffs = dmp.diff_main(textA, textB)   # All 'diff' jobs start with invoking diff_main()

d_value = dmp.diff_levenshtein(diffs)
print d_value

maxLenth = max(len(textA), len(textB))
print float(d_value)/float(maxLenth)

similarity = (1 - float(d_value)/float(maxLenth)) * 100
print similarity

上面这段代码的思路也是先计算Levenshtein距离，然后再将其和两字符串的最大长度相除，得到相似度（不清楚这样和直接使用Levenshtein扩展有什么区别，毕竟那个直接是用C写成的，速度可能还要快一些，直接一些）

参考链接：

8 11 月, 2014

admin

KnowledgeBase, Tools

diff-match-patch, difflib, FuzzyWuzzy, Levenshtein, Python

《 “用Python计算文本的相似度” 》有 5 条评论

hi说道：

2018-06-04 15:44

相似度计算之minhash
https://www.biaodianfu.com/minhash.html
https://en.wikipedia.org/wiki/MinHash
https://en.wikipedia.org/wiki/Locality-sensitive_hashing

洗稿，技术上怎么判断文章相似性？
https://mp.weixin.qq.com/s/GFpVvMEn4gvcLEMyZOkEFA

回复
hi说道：

2018-06-04 15:49

高效相似度计算 LSH minHash simHash的学习
https://blog.csdn.net/u011467621/article/details/49685107

文本内容相似度计算方法：simhash
https://www.biaodianfu.com/simhash.html

三种重要哈希介绍
https://blog.csdn.net/ACdreamers/article/details/45462881
`
一致性哈希
局部敏感哈希
GeoHash
`
一致性hash算法释义
http://www.cnblogs.com/haippy/archive/2011/12/10/2282943.html
局部敏感哈希(Locality-Sensitive Hashing, LSH)方法介绍
https://blog.csdn.net/icvpr/article/details/12342159

回复
hi说道：

2018-06-04 15:49

SimHash与重复信息识别
https://mp.weixin.qq.com/s/63ZFk3E8ESlTcX9g2paeaw

Google用来处理海量文本去重的simhash算法原理及实现
https://mp.weixin.qq.com/s/XZZZENuFV8VAMekQzcL9Fw
https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

回复
hi说道：

2018-06-22 11:07

Python实现的字符串相似性检测算法库
https://github.com/luozhouyang/python-string-similarity

回复
hi说道：

2019-02-12 17:13

使用结构和样式指标比较html的相似性(Compare html similarity using structural and style metrics)
https://github.com/matiskay/html-similarity

回复

ASPIRE

用Python计算文本的相似度

《 “用Python计算文本的相似度” 》有 5 条评论

发表回复取消回复

用Python计算文本的相似度

《 “用Python计算文本的相似度” 》 有 5 条评论

发表回复 取消回复

《 “用Python计算文本的相似度” 》有 5 条评论

发表回复取消回复