[PDF-1022] Improve text selection ordering for OCR's documents. - ICEsoft JIRA Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 6.1.2
Fix Version/s: 6.1.3
Component/s: Core/Parsing
Labels:
None
Environment:
OS/PRO common rendering core

Description

OCR programs do a pretty cool job at capturing text but layout can be a little different then a document that was type set for print. If the page is when scanned is slightly skewed the text coordinates will reflect the skew.

Our code for detecting spaces and line breaks wasn't designed for the text that might slowing move vertically from the start of a line to the ends.

This bug will capture changes needed to improve word and line detection and work ordering.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

2 B 3.16_09-07-2013.pdf

12/Sep/16 4:40 AM

173 kB

Christoph Keimel
2 B 3.16_09-07-2016.pdf

12/Sep/16 4:40 AM

4.04 MB

Christoph Keimel
2 B 3.16_09-09-2016.pdf

12/Sep/16 4:40 AM

74 kB

Christoph Keimel

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Christoph Keimel added a comment - 12/Sep/16 4:40 AM

Attached 3 documents to test the effect

Show

Christoph Keimel added a comment - 12/Sep/16 4:40 AM Attached 3 documents to test the effect

Hide

Permalink

Patrick Corless added a comment - 28/Sep/16 1:34 AM

A few changes have been made to how we detect spaces and vertically layed out text (which is actually horizontal viewed). The changes help make text selection feel more fluid and the extracted text similar to the screen representation.

The PDF in question contain a very different dialect then PDF that would be normally designed for printing. In the OCR documents text is more or less layed out left to right but the text isn't necessary contiguous. Words in a sentence are generally layed out left to right, in this case several words in a sentence may not be drawn until much later in the document painting. This still results in some quirks in the text selection flow. This can however be corrected with the system property -Dorg.icepdf.core.views.page.text.preserveColumns=false which enable vertical sorting or lines.

Show

Patrick Corless added a comment - 28/Sep/16 1:34 AM A few changes have been made to how we detect spaces and vertically layed out text (which is actually horizontal viewed). The changes help make text selection feel more fluid and the extracted text similar to the screen representation. The PDF in question contain a very different dialect then PDF that would be normally designed for printing. In the OCR documents text is more or less layed out left to right but the text isn't necessary contiguous. Words in a sentence are generally layed out left to right, in this case several words in a sentence may not be drawn until much later in the document painting. This still results in some quirks in the text selection flow. This can however be corrected with the system property -Dorg.icepdf.core.views.page.text.preserveColumns=false which enable vertical sorting or lines.

Hide

Permalink

Patrick Corless added a comment - 29/Sep/16 11:03 PM

Further work has been to don compensate for these types of PDFs. The page contains a main set of text as one would expect but it also includes optional content that is drawn at a later time which causes problems with our text sorting. We generally just add the optional content at the end of the page text and do regular sorting. This works fine for most documents as the optional content represents logical blocks of text. In the documents in question the optional content contains only one or two works which are not part of a logical block of text. As a result I've added code that tries to insert this text into the correct line. This significantly smooths out the text selection.

Overall the text selection experience has been improved but further work will be done in the future to include the notation of a paragraph. But for the time being the following system properties should be used with the patch release.

-Dorg.icepdf.core.views.page.text.preserveColumns=false
-Dorg.icepdf.core.views.page.text.spaceFraction=1

Show

Patrick Corless added a comment - 29/Sep/16 11:03 PM Further work has been to don compensate for these types of PDFs. The page contains a main set of text as one would expect but it also includes optional content that is drawn at a later time which causes problems with our text sorting. We generally just add the optional content at the end of the page text and do regular sorting. This works fine for most documents as the optional content represents logical blocks of text. In the documents in question the optional content contains only one or two works which are not part of a logical block of text. As a result I've added code that tries to insert this text into the correct line. This significantly smooths out the text selection. Overall the text selection experience has been improved but further work will be done in the future to include the notation of a paragraph. But for the time being the following system properties should be used with the patch release. -Dorg.icepdf.core.views.page.text.preserveColumns=false -Dorg.icepdf.core.views.page.text.spaceFraction=1

Hide

Permalink

Patrick Corless added a comment - 29/Sep/16 11:04 PM

Marking as fixed.

Show

Patrick Corless added a comment - 29/Sep/16 11:04 PM Marking as fixed.

Improve text selection ordering for OCR's documents.

Details

Description

Attachments

Activity

People

Dates