Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add parallel processing to OCR text extraction of full documents#124

Open
ntodd wants to merge 2 commits intodocumentcloud:masterdocumentcloud/docsplit:masterfrom
ntodd:masterntodd/docsplit:masterCopy head branch name to clipboard
Open

Add parallel processing to OCR text extraction of full documents#124
ntodd wants to merge 2 commits intodocumentcloud:masterdocumentcloud/docsplit:masterfrom
ntodd:masterntodd/docsplit:masterCopy head branch name to clipboard

Conversation

@ntodd
Copy link

@ntodd ntodd commented Dec 18, 2014

Leverage the GNU Parallel tool to OCR multiple pages in parallel. If Parallel is installed, a full document extraction will generate an image for each page and then spawn a tesseract process for each available core. If Parallel is not installed or a subset of pages are indicated, the old behavior will be used. This speeds up OCR processing significantly on multi-core machines.

With a bit more work, this could be leveraged by the other OCR code paths.

Nate Todd added 2 commits December 18, 2014 17:20
Use GNU Parallel if installed to parallelize tesseract OCR on full document text extraction.  If Parallel is not installed, use previous behavior.
@deuxshaish
Copy link

I like this a lot.. Will test and observe, thanks for the commit

@pickhardt
Copy link

This is a great idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.