ICEpdf
  1. ICEpdf
  2. PDF-438

Extracting text from document doesn't work properly.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3.2
    • Fix Version/s: 4.3.4
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      ICEpdf PRO 4.3.2, ICEpdf Viewer

      Description

      While extracting text from attached document I have found that line:

      "last flight (if one was defined for that flight). Regardless of the data,"

      consists of 2 LineText objects:
      1. "last flight (if one was" and
      2. "defined for that flight). Regardless of the data,".

      It looks like space between words "was" and "defined" is missing so if I would search for the word "defined" you will not find it.

      Adding space manualy between LineText objects causes problem in different line:

      "airport reference point latitude/longitude position shows adjacent to the".

      It consists of:
      1. "airport reference p" and
      2. "oint latitude/longitude position shows adjacent to the".

      If I put space between them I will get "airport reference p oint latitude/longitude position shows adjacent to the" and searching for a word "point" fails.
      1. example.pdf
        47 kB
        Evgheni Sadovoi

        Activity

        Evgheni Sadovoi created issue -
        Evgheni Sadovoi made changes -
        Field Original Value New Value
        Attachment example.pdf [ 14459 ]
        Evgheni Sadovoi made changes -
        Salesforce Case [5007000000MGD1l]
        Patrick Corless made changes -
        Fix Version/s 5.0 [ 10314 ]
        Hide
        Patrick Corless added a comment -

        Targeting 4.3.3

        Show
        Patrick Corless added a comment - Targeting 4.3.3
        Patrick Corless made changes -
        Fix Version/s 4.3.3 [ 10333 ]
        Fix Version/s 5.0 [ 10314 ]
        Patrick Corless made changes -
        Fix Version/s 4.3.4 [ 10341 ]
        Fix Version/s 4.3.3 [ 10333 ]
        Hide
        Patrick Corless added a comment -

        I've taken some time to look closer into specifically what is going on with the PDF question. The PDF's text content stream is encoded a bit differently using the "td" token mid line. Normally the "td" are used to sepcify a jump to the next line using the same offset as the previous line.

        That all said we have some code that will try and detect if the y offset is larger enough to justify inserting a new line character. In this case the old y value would be something like 542.1345 and 541.98, visually font look like its drawn at virtually at the same spot but the line feed check passes and a new line inserted because the new value isn't the same as the last.

        The problem is really that of float precision. As a work around I added Math.round on the y offset values to try and do "softer" comparison and avoid the extra line feed being inserted. Overall the fix seems to work quite will on a various documents. I still need to create a new test suite for text extraction to get better measure on any possible regression.

        Show
        Patrick Corless added a comment - I've taken some time to look closer into specifically what is going on with the PDF question. The PDF's text content stream is encoded a bit differently using the "td" token mid line. Normally the "td" are used to sepcify a jump to the next line using the same offset as the previous line. That all said we have some code that will try and detect if the y offset is larger enough to justify inserting a new line character. In this case the old y value would be something like 542.1345 and 541.98, visually font look like its drawn at virtually at the same spot but the line feed check passes and a new line inserted because the new value isn't the same as the last. The problem is really that of float precision. As a work around I added Math.round on the y offset values to try and do "softer" comparison and avoid the extra line feed being inserted. Overall the fix seems to work quite will on a various documents. I still need to create a new test suite for text extraction to get better measure on any possible regression.
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #30469 Fri Aug 10 14:23:20 MDT 2012 patrick.corless PDF-438 updated text extraction new line detection to round to the nearest int. We had a few corner cases where extra line spaces were being inserted, because of float numbers precision issues.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/ContentParser.java
        Hide
        Patrick Corless added a comment -

        Closing issue.

        Show
        Patrick Corless added a comment - Closing issue.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Evgheni Sadovoi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: