Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

tychen5/NLP_FakeNewsDetection

Open more actions menu

Repository files navigation

Fake News Detection

Information Retrieval and Text Mining project

  • News Insight
  • News Classification
  • Text Regression

Presentation Video

https://www.youtube.com/watch?v=9PFZ0_C2Sxo&feature=share

Topic: Fake News Analysis and Insight

  1. 蒐集各方假新聞dataset
  2. 可以從假新聞或真新聞中分析出什麼樣的消息?
    • 用怎樣的方法分析或比較?
  3. 假新聞相較於真新聞有怎樣的特徵?
    • 怎麼抓取特徵或關鍵字?
    • 可能用到的情緒字 / 情緒分析
    • 依照詞性去對假真新聞決定可能會有那些常用字。EX:文字雲
    • 語意分析
  4. 假新聞分類、評比
    • 特性、提醒使用者

Problem Description

https://docs.google.com/document/d/10-7H9bPJYQRMdOUdugDlWeifdpvoN9twXZGT-m1fhdc/edit?usp=sharing

Implement Report

https://docs.google.com/document/d/1I9SWihDkgXx1NCYCsY-0e_XDicAK346PqQu5wMaesd0/edit

Presentation slides

https://docs.google.com/presentation/d/1lRDR40UfcLpdRUSnfMbi6eOsR_jjxFdOcKwa8HvxHh8/edit#slide=id.p

Outline:

  • 動機
  • 做甚麼
  • solution insight
  • solution regression / classification

Motivation & Goal

動機: 為什麼要做?因為假新聞氾濫、影響閱聽人、帶選舉風向的問題

  1. 假新聞的程度
  2. 真假新聞之間有什麼區別
  3. (假)新聞的種類
  • 比較不同方法的performance

Solution

  1. TF-IDF。給Tagging
  2. POS (part-of-speech tagging) EX:openNLP、NLTK => a.每個不同dataset的詞性常出現哪些字 b. dictionary by overall dataset依詞性要用哪些字
  3. Sentiment Analysis EX:TextBlob、
  4. feature selection: 關鍵字、類別鑑別力
  5. 作者、來源的助益性。每一種類別的差別
  6. regression (ML方法、DL方法) / classification (IR方法)

GOAL

  1. 在相同dictionary大小下: 沒有分詞性情況下跑出來幾分,有詞性的dictionary跑出來幾分 EX:名詞dictionary幾分,動詞正確率幾趴?
  2. 前面所做的insight可以跟最後面產生的dict有關連
  3. 假新聞的程度、分類,兩者testing dataset互為兩者
  4. 時間切三塊或五塊: 選前、選舉正負一個禮拜、選後,主題、用字、情感的變動

Dataset

  1. 分為十類別(第二個dataset八類、第一個dataset兩類): 第三個dataset的True、mostly Tru放進去第一個dataset的true;第三個dataset的barely-true、false、pants-fire放進去第一個dataset的False
  2. 濾除標點符號跟數字、大寫變小寫 ,只留下 content(最長的attribute)、label (假新聞的程度、類別)

三個dataset的text,label合併資料集:https://drive.google.com/drive/u/2/folders/19CER5SrMU29n3UPAkQc2hPu3HA8vyqbc

Method

目前只看news content

  1. 十個類別的POS、overall dataset的POS https://drive.google.com/drive/folders/1C-6U9TcyUwgxzdArvAXPsnjx9yrPhxsh?usp=sharing
  2. 十個類別的長條圖of情緒分析。文獻探討: 詞性、情緒、feature selection、分類、回歸等等套件的論文
  3. 十個類別的文字雲、頻率圖=>做一個overall的,把各類別常見的term的濾掉
  4. 3 kind of feature selecion、tfidf of building overall dictionary

bs類別代表意義不大

testing Kaggle: https://www.kaggle.com/c/fake-news/submit

(測試clf好壞結果、reg好壞結果)

Possible Dataset:

Motivation Reference

Possible Goal:

REF

Datasets for sentiment analysis are available online.[1][2]

The following is a list of a few open source sentiment analysis tools.

  • GATE plugins
  • SEAS(gsi-upm/SEAS)
  • SAGA(gsi-upm/SAGA)
  • Stanford Sentiment Analysis Module (Deeply Moving: Deep Learning for Sentiment Analysis)
  • LingPipe (Sentiment Analysis Tutorial)
  • TextBlob (Tutorial: Quickstart)[3]
  • Opinion Finder (OpinionFinder | MPQA)
  • Clips pattern.en (pattern.en | CLiPS)

Open Source Dictionary or resources:

  • SentiWordNet
  • Bing Liu Datasets (Opinion Mining, Sentiment Analysis, Opinion Extraction)
  • General Inquirer Dataset (General Inquirer Categories)
  • MPQA Opinion Corpus (MPQA Resources)
  • WordNet-Affect (WordNet Domains)
  • SenticNet
  • Emoji Sentiment Ranking

文獻探討: 其他人怎麼做的

方向: 文字分類(classification) or 程度回歸(regression)

文字分類

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.