ICEpdf
  1. ICEpdf
  2. PDF-760

Improve duplicate word text extraction detection

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.0.6_P01
    • Fix Version/s: 5.0.7
    • Component/s: Viewer RI
    • Labels:
      None
    • Environment:
      any

      Description

       A client has submitted a patch to improve the detection of duplicated words that sometimes occur in PDF documents created using Chrystal Reports. The PDF in question plot out out a bunch of text followed by the same text plotted out again.

      We had added some experimental code that was activated with -Dorg.icepdf.core.views.page.text.trim.duplicates=true . This code tried to look for duplicate text by comparing text based on a mid point. The client has come back with an improved algorithm where a key is generated based on the words bounds and text. Any text that has a duplicate key is trimmed.

      This code should work just fine going forward. We'll have to run a QA test for text extraction to be sure though.

        Activity

        Hide
        Patrick Corless added a comment -

        Patch has been applied and as shipped with 5.0.7

        Show
        Patrick Corless added a comment - Patch has been applied and as shipped with 5.0.7

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: