Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

alemol/FELTS

Open more actions menu
 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FELTS

FELTS is a Fast Extractor for Large Term Sets. It was successfully tested with over 9.5 millions of distinct multiword terms composed of over 4.5 million distinct words (Wikipedia article titles for french + english + spanish). In this particular task :

  • it allows extracting from any text, all occurences of wikipedia french, english or spanish entries
  • it only requires 500 Mb of RAM
  • it can process ten million of words less than an hour

USE :

  • create a dictionnary file with a sorted list of multiword terms (one term per line, one space between words).
  • set the DICT variable in makefile to your dictionnary file
  • make the hash function :

make mph

  • start a server, e.g :

bin/felts_server -p 11111 -d sample.dic -f sample.mph

  • extract terms, e.g. :

cat text_in.txt | sed 's/[[:space:]][[:space:]]*/ /g' | sed 's/^[[:space:]]//' | bin/felts_client localhost 11111 | sed '/^$/d' > terms_out.txt

WARNING : input text should be utf-8, lower case, without punctuation and words must be separated by a single space. (that justifies the sequence of filters used before sending the text to felts_client)

About

FELTS is a Fast Extractor for Large Term Sets - successfully tested with over 5 millions distinct multiword terms composed of over 2 million distinct words

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.