# 自然语言处理中句子相似度计算的几种方法

• 编辑距离计算
• 杰卡德系数计算
• TF 计算
• TFIDF 计算
• Word2Vec 计算

## 编辑距离计算

• 第一步，在 s 和 t 之间加入字符 e。
• 第二步，把 r 替换成 t。

```import distance

def edit_distance(s1, s2):
return distance.levenshtein(s1, s2)

s1 = 'string'
s2 = 'setting'
print(edit_distance(s1, s2))
```

 2

```pip3 install distance
```

```import distance

def edit_distance(s1, s2):
return distance.levenshtein(s1, s2)

strings = [
'你在干什么',
'你在干啥子',
'你在做什么',
'你好啊',
'我喜欢吃香蕉'
]

target = '你在干啥'
results = list(filter(lambda x: edit_distance(x, target) <= 2, strings))
print(results)
```

```['你在干什么', '你在干啥子']
```

## 杰卡德系数计算

```from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

def jaccard_similarity(s1, s2):
return ' '.join(list(s))

# 将字中间加入空格
# 转化为TF矩阵
cv = CountVectorizer(tokenizer=lambda s: s.split())
corpus = [s1, s2]
vectors = cv.fit_transform(corpus).toarray()
# 求交集
numerator = np.sum(np.min(vectors, axis=0))
# 求并集
denominator = np.sum(np.max(vectors, axis=0))
# 计算杰卡德系数
return 1.0 * numerator / denominator

s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(jaccard_similarity(s1, s2))
```

```['么', '什', '你', '呢', '嘛', '在', '干']
```

```cv.get_feature_names()
```

```[[0 0 1 1 1 1 1]
[1 1 1 1 0 1 1]]
```

```0.5714285714285714
```

## TF 计算

```cosθ=a·b/|a|*|b|
```

```from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from scipy.linalg import norm

def tf_similarity(s1, s2):
return ' '.join(list(s))

# 将字中间加入空格
# 转化为TF矩阵
cv = CountVectorizer(tokenizer=lambda s: s.split())
corpus = [s1, s2]
vectors = cv.fit_transform(corpus).toarray()
# 计算TF系数
return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))

s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(tf_similarity(s1, s2))
```

```0.7302967433402214
```

## TFIDF 计算

```from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.linalg import norm

def tfidf_similarity(s1, s2):
return ' '.join(list(s))

# 将字中间加入空格
# 转化为TF矩阵
cv = TfidfVectorizer(tokenizer=lambda s: s.split())
corpus = [s1, s2]
vectors = cv.fit_transform(corpus).toarray()
# 计算TF系数
return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))

s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(tfidf_similarity(s1, s2))
```

```[[0.         0.         0.4090901  0.4090901  0.57496187 0.4090901 0.4090901 ]
[0.49844628 0.49844628 0.35464863 0.35464863 0.  0.35464863 0.35464863]]
```

```0.5803329846765686
```

## Word2Vec 计算

Word2Vec，顾名思义，其实就是将每一个词转换为向量的过程。如果不了解的话可以参考： https://blog.csdn.net/itplus/article/details/37969519

```import gensim
import jieba
import numpy as np
from scipy.linalg import norm

model_file = './word2vec/news_12g_baidubaike_20g_novel_90g_embedding_64.bin'

def vector_similarity(s1, s2):
def sentence_vector(s):
words = jieba.lcut(s)
v = np.zeros(64)
for word in words:
v += model[word]
v /= len(words)
return v

v1, v2 = sentence_vector(s1), sentence_vector(s2)
return np.dot(v1, v2) / (norm(v1) * norm(v2))
```

```s1 = '你在干嘛'
s2 = '你正做什么'
vector_similarity(s1, s2)
```

```0.6701133967824016
```

```strings = [
'你在干什么',
'你在干啥子',
'你在做什么',
'你好啊',
'我喜欢吃香蕉'
]

target = '你在干啥'

for string in strings:
print(string, vector_similarity(string, target))
```

```你在干什么 0.8785495016487204

```