Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

計算關鍵詞重要程度(TF-IDF實作)Calculate cosine-similarity between documents using TF-IDF

Notifications You must be signed in to change notification settings

Larix/TF-IDF_Tutorial

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TF-IDF(Term Frequency - Inverse Document Frequency)

評估文檔中詞的重要程度,進而提取關鍵詞
Calculate cosine-similarity between documents using TF-IDF 此專案以Python3進行開發,以新聞資料進行tf-idf結合cosine similarity實作的範例

TF-IDF Introduction:

TF-IDF是一種統計方法,用以評估一字詞對於一個檔案集或一個語料庫中的其中一份檔案的重要程度。
字詞的重要性隨著它在檔案中出現的次數(TF)成正比增加,但同時會隨著它在語料庫中出現的頻率(IDF)成反比下降。

image image

Cosine Similarity Introduction:

餘絃相似度(cosine similarity)是資訊檢索中常用的相似度計算方式,可用來計算文件之間的相似度,
也可以計算詞彙之間的相似度,更可以計算查詢字串與文件之間的相似度。

image image

IDF補充:

image image

補充:

新聞資料大概只有200篇,斷詞使用jieba,有許多詞只出現在某一篇新聞文檔,考慮過濾這些詞,有可能是斷錯的詞彙。

About

計算關鍵詞重要程度(TF-IDF實作)Calculate cosine-similarity between documents using TF-IDF

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
Morty Proxy This is a proxified and sanitized view of the page, visit original site.