用Python计算文本的相似度


Python计算文本的相似度

因为后期会需要用到这方面的知识,所以先提前准备准备;如何判断网页返回内容的相似度?

准备好关键字,然后开始搜索:http://search.aol.com/aol/search?q=use+python+to+calculate++text+similarity

找到了几个Python的方法和库:

下面主要记录用不同的Python库来计算两段文本之间的相似度(最后要得到的就是一个百分比):

方法一:difflib

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0

方法二:Levenshtein

import Levenshtein 报错:ImportError: No module named Levenshtein

于是去:python-Levenshtein 下载源码进行安装(在http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-levenshtein 其实也有编译好的exe),第一次安装的时候报错:error: Unable to find vcvarsall.bat ,但其实我是装了VS2010的,所以执行如下步骤正常安装:

1.设置环境变量,执行:

SET VS90COMNTOOLS=%VS100COMNTOOLS%

2.再去安装:

setup.py install

就可以正常,编译,安装了。

$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

方法三:FuzzyWuzzy

git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
cd fuzzywuzzy
python setup.py install

>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process

Simple Ratio
>>> fuzz.ratio("this is a test", "this is a test!")
    96

Partial Ratio
>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

方法四:google-diff-match-patch

import diff_match_patch
textA = "the cat in the red hat"
textB = "the feline in the blue hat"

dmp = diff_match_patch.diff_match_patch()  #create a diff_match_patch object
diffs = dmp.diff_main(textA, textB)   # All 'diff' jobs start with invoking diff_main()

d_value = dmp.diff_levenshtein(diffs)
print d_value

maxLenth = max(len(textA), len(textB))
print float(d_value)/float(maxLenth)

similarity = (1 - float(d_value)/float(maxLenth)) * 100
print similarity

上面这段代码的思路也是先计算Levenshtein距离,然后再将其和两字符串的最大长度相除,得到相似度(不清楚这样和直接使用Levenshtein扩展有什么区别,毕竟那个直接是用C写成的,速度可能还要快一些,直接一些)

 

参考链接:


《 “用Python计算文本的相似度” 》 有 5 条评论

  1. 高效相似度计算 LSH minHash simHash的学习
    https://blog.csdn.net/u011467621/article/details/49685107

    文本内容相似度计算方法:simhash
    https://www.biaodianfu.com/simhash.html

    三种重要哈希介绍
    https://blog.csdn.net/ACdreamers/article/details/45462881
    `
    一致性哈希
    局部敏感哈希
    GeoHash
    `
    一致性hash算法释义
    http://www.cnblogs.com/haippy/archive/2011/12/10/2282943.html
    局部敏感哈希(Locality-Sensitive Hashing, LSH)方法介绍
    https://blog.csdn.net/icvpr/article/details/12342159

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注