TF-IDF 术语频次-逆文档频次

用IF-IDF,而不是直接使用词语出现的次数来分析语料,主要是为了减弱语料中频繁出现而信息量很少的词语的权重。

术语频次 矩阵

中的元素 表示术语

在文档

中出现的频次,通常用同文档中出现频次最多的词语的出现次数进行标准化。

逆文档频次 向量

其中:

为文档总数,

为包含术语

的文档数

术语频次 - 逆文档频次 矩阵

中的元素

from stemming import porter
import numpy as np

documents = ['this is the first document, this really is',
             'nothing will stop this from been the second doument, second is not a bad order',
             'I wonder if three documents would be ok as an example, example like this is stupid',
             'ok I think four documents is enough, I I I I think so.']

def TFIDF(documents):
    terms = []
    termsInDocs = dict()
    # word list and count 'in what documnet' did the word appears
    for i, d in enumerate(documents):       
        content = [porter.stem(term) for term in d.strip(',.').split()]
        for c in content:
            if c not in terms:
                termsInDocs[c] = [i]
                terms.append(c)
            else:
                termsInDocs[c].append(i)
    # generate the TF matrix
    F = np.zeros((len(terms),len(documents)))
    for i, term in enumerate(terms):
        for doc in termsInDocs[term]:
            F[i, doc] += 1
    TF = F/F.max(0)
    
    IDF = np.ones((len(terms),1))
    for i in range(len(terms)):
        IDF[i] = math.log(float(len(terms))/len(set(termsInDocs[terms[i]])))
    
    TF_IDF = TF*IDF
    return F, TF, IDF, TF_IDF, terms

F, TF, IDF, TF_IDF, terms = TFIDF(documents)

print TF_IDF

[[ 2.45673577  1.22836789  2.45673577  0.        ]
 [ 2.1690537   1.08452685  2.1690537   0.43381074]
 [ 1.43110044  1.43110044  0.          0.        ]
 [ 1.77767403  0.          0.          0.        ]
 [ 1.77767403  0.          0.          0.        ]
 [ 1.77767403  0.          0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          3.55534806  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          1.77767403  0.          0.        ]
 [ 0.          0.          2.86220088  2.86220088]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          2.86220088  0.57244018]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          2.86220088  0.57244018]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          3.55534806  0.        ]
 [ 0.          0.          0.          1.42213922]
 [ 0.          0.          0.          0.71106961]
 [ 0.          0.          0.          0.71106961]
 [ 0.          0.          0.          0.71106961]]

Scikit-learn的特征提取功能中有 TF-IDF 的提取。注意到 scikit-learn TF-IDF 提取函数需要的输入是一个词频矩阵,行代表文档,列代表词语,因此需要转置上面获得的

矩阵。并且 scikit-learn

中用的计算公式是:

,另外标准化的方式也不一样。
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()

tfidf = transformer.fit_transform(F.transpose())
print tfidf.todense().transpose()

[[ 0.50697202  0.16317423  0.17343408  0.        ]
 [ 0.41448285  0.13340562  0.14179372  0.10639789]
 [ 0.31310565  0.20155263  0.          0.        ]
 [ 0.39713482  0.          0.          0.        ]
 [ 0.39713482  0.          0.          0.        ]
 [ 0.39713482  0.          0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.51128791  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.25564396  0.          0.        ]
 [ 0.          0.          0.21422559  0.80374331]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.21422559  0.16074866]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.21422559  0.16074866]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.27171799  0.        ]
 [ 0.          0.          0.          0.40777859]
 [ 0.          0.          0.          0.2038893 ]
 [ 0.          0.          0.          0.2038893 ]
 [ 0.          0.          0.          0.2038893 ]]
我来评几句
登录后评论

已发表评论数()

相关站点

热门文章