ICEpdf
  1. ICEpdf
  2. PDF-1022

Improve text selection ordering for OCR's documents.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.1.2
    • Fix Version/s: 6.1.3
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      OS/PRO common rendering core

      Description

      OCR programs do a pretty cool job at capturing text but layout can be a little different then a document that was type set for print. If the page is when scanned is slightly skewed the text coordinates will reflect the skew.

      Our code for detecting spaces and line breaks wasn't designed for the text that might slowing move vertically from the start of a line to the ends.

      This bug will capture changes needed to improve word and line detection and work ordering.
      1. 2 B 3.16_09-07-2013.pdf
        173 kB
        Christoph Keimel
      2. 2 B 3.16_09-07-2016.pdf
        4.04 MB
        Christoph Keimel
      3. 2 B 3.16_09-09-2016.pdf
        74 kB
        Christoph Keimel

        Activity

        Repository Revision Date User Message
        ICEsoft Public SVN Repository #49503 Tue Nov 08 11:11:15 MST 2016 patrick.corless PDF-1022 changed how we order optional text and added code that will try and insert the optional text's words into the list of text that was provided by the page.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/PageText.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #49330 Thu Sep 29 23:08:36 MDT 2016 patrick.corless PDF-1022 changed how we order optional text and added code that will try and insert the optional text's words into the list of text that was provided by the page.
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/PageText.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #49323 Wed Sep 28 00:37:45 MDT 2016 patrick.corless PDF-1022 change to detection of vertical character layout and auto space detection.
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/GlyphText.java
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/WordText.java

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: