ICEpdf
  1. ICEpdf
  2. PDF-1269

Text extraction ordering issue.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.3, 6.3.1
    • Fix Version/s: 6.3.2
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any

      Description

      The file in question is quite interesting in that the results of the text extraction change given the order the pages are extracted. Further investigation is needed.

        Activity

        Patrick Corless created issue -
        Patrick Corless made changes -
        Field Original Value New Value
        Fix Version/s 6.3.2 [ 13175 ]
        Hide
        Patrick Corless added a comment -

        Pages 16 and 17 share content via an xobject that represents the small table "DESIGN FAILURE REATE (FIT)". When page 17 is loaded on it's own the content is correctly converted to the page's page space and the page text extraction algorithm correctly sort the page content.

        When 16 and 17 are loaded in sequence the table is first loaded for page 16 and the text sprites are updated to that coordinate space. When 17 is loaded we have code that should update the table's xobject text sprites but that doesn't appear to be happening or is happening incorrectly. The code wasn't written with this corner case in mind, we'd need to first back out any transform and apply the new one.

        Show
        Patrick Corless added a comment - Pages 16 and 17 share content via an xobject that represents the small table "DESIGN FAILURE REATE (FIT)". When page 17 is loaded on it's own the content is correctly converted to the page's page space and the page text extraction algorithm correctly sort the page content. When 16 and 17 are loaded in sequence the table is first loaded for page 16 and the text sprites are updated to that coordinate space. When 17 is loaded we have code that should update the table's xobject text sprites but that doesn't appear to be happening or is happening incorrectly. The code wasn't written with this corner case in mind, we'd need to first back out any transform and apply the new one.
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #52866 Tue Jan 08 21:04:31 MST 2019 patrick.corless PDF-1269 ensure reused xObject text is properly converted to new coordinate
        space.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/core-awt/src/main/java/org/icepdf/core/pobjects/graphics/text/PageText.java
        Hide
        Patrick Corless added a comment -

        Added code to ensure we undue the previous xobject space from the text objects before applying the new space. This insures the glyphs will be in the correct location when the sorting takes place.
        Marking as fixed.

        Show
        Patrick Corless added a comment - Added code to ensure we undue the previous xobject space from the text objects before applying the new space. This insures the glyphs will be in the correct location when the sorting takes place. Marking as fixed.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: