FELTS

FELTS is a Fast Extractor for Large Term Sets. It was successfully tested with over 9.5 millions of distinct multiword terms composed of over 4.5 million distinct words (Wikipedia article titles for french + english + spanish). In this particular task :

it allows extracting from any text, all occurences of wikipedia french, english or spanish entries
it only requires 500 Mb of RAM
it can process ten million of words less than an hour

USE :

create a dictionnary file with a sorted list of multiword terms (one term per line, one space between words).
set the DICT variable in makefile to your dictionnary file
make the hash function :

make mph

start a server, e.g :

bin/felts_server -p 11111 -d sample.dic -f sample.mph

extract terms, e.g. :

cat text_in.txt | sed 's/[[:space:]][[:space:]]*/ /g' | sed 's/^[[:space:]]//' | bin/felts_client localhost 11111 | sed '/^$/d' > terms_out.txt

WARNING : input text should be utf-8, lower case, without punctuation and words must be separated by a single space. (that justifies the sequence of filters used before sending the text to felts_client)

Name	Name	Last commit message	Last commit date
Latest commit History 46 Commits 46 Commits
cmph-2.0	cmph-2.0
demo	demo
dic	dic
socio	socio
src	src
README.md	README.md
TODO	TODO
license.en.txt	license.en.txt
license.fr.txt	license.fr.txt
makefile	makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FELTS

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

FELTS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages