Fix escaping when extracting text using OCR by floehopper · Pull Request #149 · documentcloud/docsplit

floehopper · Jul 2, 2018

Previously the output filename passed to the tesseract command was not shell-escaped. This meant that the filename was truncated and did not match the filename expected by Docsplit::TextExtractor#clean_text resulting in the following exception:

Errno::ENOENT: No such file or directory @ rb_sysopen - test/output/PDF file with spaces 'single' and "double quotes".txt
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:90:in `initialize'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:90:in `open'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:90:in `clean_text'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:80:in `extract_from_ocr'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:36:in `block in extract'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:32:in `each'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:32:in `extract'
/Users/jamesmead/Code/freerange/docsplit/lib/docsplit.rb:52:in `extract_text'
test/unit/test_extract_text.rb:58:in `test_name_escaping_while_extracting_text_using_ocr'

Previously the output filename passed to the tesseract command was not shell-escaped. This meant that the filename was truncated and did not match the filename expected by Docsplit::TextExtractor#clean_text resulting in the following exception: Errno::ENOENT: No such file or directory @ rb_sysopen - test/output/PDF file with spaces 'single' and "double quotes".txt /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:90:in `initialize' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:90:in `open' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:90:in `clean_text' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:80:in `extract_from_ocr' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:36:in `block in extract' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:32:in `each' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit/text_extractor.rb:32:in `extract' /Users/jamesmead/Code/freerange/docsplit/lib/docsplit.rb:52:in `extract_text' test/unit/test_extract_text.rb:58:in `test_name_escaping_while_extracting_text_using_ocr'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix escaping when extracting text using OCR#149

Fix escaping when extracting text using OCR#149
floehopper wants to merge 1 commit intodocumentcloud:masterdocumentcloud/docsplit:masterfrom
freerange:fix-escaping-when-extracting-text-using-ocrfreerange/docsplit:fix-escaping-when-extracting-text-using-ocrCopy head branch name to clipboard

floehopper commented Jul 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Search code, repositories, users, issues, pull requests...

Conversation

floehopper commented Jul 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant