ICEpdf
  1. ICEpdf
  2. PDF-760

Improve duplicate word text extraction detection

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.0.6_P01
    • Fix Version/s: 5.0.7
    • Component/s: Viewer RI
    • Labels:
      None
    • Environment:
      any

      Description

       A client has submitted a patch to improve the detection of duplicated words that sometimes occur in PDF documents created using Chrystal Reports. The PDF in question plot out out a bunch of text followed by the same text plotted out again.

      We had added some experimental code that was activated with -Dorg.icepdf.core.views.page.text.trim.duplicates=true . This code tried to look for duplicate text by comparing text based on a mid point. The client has come back with an improved algorithm where a key is generated based on the words bounds and text. Any text that has a duplicate key is trimmed.

      This code should work just fine going forward. We'll have to run a QA test for text extraction to be sure though.

        Activity

        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 5.0.7 [ 11470 ]
        Fix Version/s 5.1 [ 10675 ]
        Resolution Fixed [ 1 ]
        Patrick Corless made changes -
        Fix Version/s 5.1 [ 10675 ]
        Fix Version/s 5.0.7 [ 11470 ]
        Patrick Corless made changes -
        Attachment PageText.java.patch [ 17202 ]
        Patrick Corless made changes -
        Field Original Value New Value
        Fix Version/s 5.0.7 [ 11470 ]
        Patrick Corless created issue -

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: