GitHub - willf/segment: A tool to segment text based on frequencies and the Viterbi algorithm "#TheBoyWhoLived" => ['#', 'The', 'Boy', 'Who', 'Lived']

This module segments text according word frequency using the Viterbi algorithm. Probably due to Peter Norvig somehow.

Three sources of frequency information is provided.

One is from the Google NGram corpus, a general web corpus.

The second is from the Rovereto Twitter N-Gram Corpus, which is better for some Twitter data.

The third is from a webcrawl dataset of anchor text provided by Vinay Goel of the Internet Archive.

> from segment.segmenter import Analyzer
> e = Analyzer('en')
> e.segment("AbeLincoln")
['Abe', 'Lincoln']
> e.segment("BieberHeartsBeliebers")
['Bi', 'e', 'ber', 'Hearts', 'Be', 'lieber', 's']
> t = Analyzer('twitter')
> t.segment("BieberHeartsBeliebers")
['Bieber', 'Hearts', 'Beliebers']
> t = Analyzer('anchor')
> t.segment("wordpress&sex")
['wordpress', '&', 'sex']

Name	Name	Last commit message	Last commit date
Latest commit History 9 Commits 9 Commits
segment	segment
.gitignore	.gitignore
README.md	README.md
requirements.txt	requirements.txt
setup.py	setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

willf/segment

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages