ICEpdf
  1. ICEpdf
  2. PDF-1022

Improve text selection ordering for OCR's documents.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.1.2
    • Fix Version/s: 6.1.3
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      OS/PRO common rendering core

      Description

      OCR programs do a pretty cool job at capturing text but layout can be a little different then a document that was type set for print. If the page is when scanned is slightly skewed the text coordinates will reflect the skew.

      Our code for detecting spaces and line breaks wasn't designed for the text that might slowing move vertically from the start of a line to the ends.

      This bug will capture changes needed to improve word and line detection and work ordering.
      1. 2 B 3.16_09-07-2013.pdf
        173 kB
        Christoph Keimel
      2. 2 B 3.16_09-07-2016.pdf
        4.04 MB
        Christoph Keimel
      3. 2 B 3.16_09-09-2016.pdf
        74 kB
        Christoph Keimel

        Activity

        Hide
        Christoph Keimel added a comment -

        Attached 3 documents to test the effect

        Show
        Christoph Keimel added a comment - Attached 3 documents to test the effect
        Hide
        Patrick Corless added a comment -

        A few changes have been made to how we detect spaces and vertically layed out text (which is actually horizontal viewed). The changes help make text selection feel more fluid and the extracted text similar to the screen representation.

        The PDF in question contain a very different dialect then PDF that would be normally designed for printing. In the OCR documents text is more or less layed out left to right but the text isn't necessary contiguous. Words in a sentence are generally layed out left to right, in this case several words in a sentence may not be drawn until much later in the document painting. This still results in some quirks in the text selection flow. This can however be corrected with the system property -Dorg.icepdf.core.views.page.text.preserveColumns=false which enable vertical sorting or lines.

        Show
        Patrick Corless added a comment - A few changes have been made to how we detect spaces and vertically layed out text (which is actually horizontal viewed). The changes help make text selection feel more fluid and the extracted text similar to the screen representation. The PDF in question contain a very different dialect then PDF that would be normally designed for printing. In the OCR documents text is more or less layed out left to right but the text isn't necessary contiguous. Words in a sentence are generally layed out left to right, in this case several words in a sentence may not be drawn until much later in the document painting. This still results in some quirks in the text selection flow. This can however be corrected with the system property -Dorg.icepdf.core.views.page.text.preserveColumns=false which enable vertical sorting or lines.
        Hide
        Patrick Corless added a comment -

        Further work has been to don compensate for these types of PDFs. The page contains a main set of text as one would expect but it also includes optional content that is drawn at a later time which causes problems with our text sorting. We generally just add the optional content at the end of the page text and do regular sorting. This works fine for most documents as the optional content represents logical blocks of text. In the documents in question the optional content contains only one or two works which are not part of a logical block of text. As a result I've added code that tries to insert this text into the correct line. This significantly smooths out the text selection.

        Overall the text selection experience has been improved but further work will be done in the future to include the notation of a paragraph. But for the time being the following system properties should be used with the patch release.

        -Dorg.icepdf.core.views.page.text.preserveColumns=false
        -Dorg.icepdf.core.views.page.text.spaceFraction=1

        Show
        Patrick Corless added a comment - Further work has been to don compensate for these types of PDFs. The page contains a main set of text as one would expect but it also includes optional content that is drawn at a later time which causes problems with our text sorting. We generally just add the optional content at the end of the page text and do regular sorting. This works fine for most documents as the optional content represents logical blocks of text. In the documents in question the optional content contains only one or two works which are not part of a logical block of text. As a result I've added code that tries to insert this text into the correct line. This significantly smooths out the text selection. Overall the text selection experience has been improved but further work will be done in the future to include the notation of a paragraph. But for the time being the following system properties should be used with the patch release. -Dorg.icepdf.core.views.page.text.preserveColumns=false -Dorg.icepdf.core.views.page.text.spaceFraction=1
        Hide
        Patrick Corless added a comment -

        Marking as fixed.

        Show
        Patrick Corless added a comment - Marking as fixed.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: