Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add option to generate hOCR output instead of raw text when performing OCR via tesseract#81

Closed
jhosteny wants to merge 2 commits intodocumentcloud:masterdocumentcloud/docsplit:masterfrom
jhosteny:masterCopy head branch name to clipboard
Closed

Add option to generate hOCR output instead of raw text when performing OCR via tesseract#81
jhosteny wants to merge 2 commits intodocumentcloud:masterdocumentcloud/docsplit:masterfrom
jhosteny:masterCopy head branch name to clipboard

Conversation

@jhosteny
Copy link

This patch forces tesseract to genrate hOCR output when the --hocr option is added. It also suppresses text cleaning. This addresses issue #80.

@knowtheory
Copy link
Member

Hey @jhosteny, have you tested out this patch? As far as i'm aware, you have to actually pass in a config file, which this pull request doesn't actually supply.

@jhosteny
Copy link
Author

@knowtheory, sorry for the late reply. Yes, I am using my fork with this change in a project, and no additional configuration is necessary. I'm running with the latest tesseract on ubuntu raring. Here are the details:

tesseract 3.02.01
 leptonica-1.69
  libgif 4.1.6 : libjpeg 8b : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.7

I may have missed something, but it didn't look like there was a test that runs tesseract. If you'd rather wait until one is there, I can work on that as part of a new patch.

@jsfenfen
Copy link

@knowtheory: This works for me while running "Tesseract Open Source OCR Engine v3.02.02" on Ubuntu 12.04, w/ leptonica 1.69. I think that the argument--i.e. "hocr" -- is actually the name of the config file to use, and I'm guessing it only works if a config file of that name is in the right place (maybe /somewhere/tessdata/configs/ ). The documentation isn't especially clear. The hocr file used is defined here http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr -- the whole set of default configs is available here: http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata/configs

For the sake of argument, would it make sense for the patch to just give the option of specifying a path to a config file? That way a more complex config file could be used, and it wouldn't be explicitly dependent on the tesseract library shipping with the default configs.

@jhosteny
Copy link
Author

Close in lieu of #92

@jhosteny jhosteny closed this Aug 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.