Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BuildDream18/python-word-frequency

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Problem Statement:

Detailed below are the key points to the problem that needs solving:

  • There is an input file on a Unix system, and it is plain text.

  • Each line in the file contains one token (a string of arbitrary chars).

  • The input file is extremely large in terms of number of lines.

  • The program has as much RAM as needed.

  • The program may make one pass only over the data, reading each line just once.

  • At the end of the first pass, the program prints a report for each unique token that it found. It should print out the token value itself and the number of times the token was found in the data file.

  • Sort order of the output does not matter, but each unique token may appear only once in the output.

Questions to Address:

  • What algorithmic approach would you take and what data structures would be used?

  • Why is it the best (fastest) solution?

Notes:

  • This should work on a single machine.

  • There is no rhyme or reason to the distribution that governs the characteristics of the string data that you may encounter

  • Scalability is important and must be taken into account as part of the solution.

3 Approaches

  • Standard approach; Reads a large file line by line (for memory efficiency and scalability). Then checks each word against a dictionary and increments the word count if a word is present or adds a new word.
  • Efficient approach; As above but uses pre initialized python Collection defaultdict. This removes the use of exceptions for addition / increment and is faster.
  • Parallel approach; As the problem lends itself to the mapreduce approach I added a parallel method using python multiprocessing. The large text file is pre split to allow for multi processing. However, this is not an efficient implementation.

Code

  • Tests are made for each approach and are found in "test"
  • Tests also have performance benchmarking for timing comparisons.
  • I try to write in a functional style with a focus on code readability and clarity.
  • Can be run from /examples (see init.py)

Benchmarks

  • Stanard approach: 0.4 seconds
  • Efficient approach: 0.3 seconds
  • Parallel approach: 0.8 seconds

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.