[PDF-760] Improve duplicate word text extraction detection - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0.6_P01
Fix Version/s: 5.0.7
Component/s: Viewer RI
Labels:
None
Environment:
any

Description

A client has submitted a patch to improve the detection of duplicated words that sometimes occur in PDF documents created using Chrystal Reports. The PDF in question plot out out a bunch of text followed by the same text plotted out again.

We had added some experimental code that was activated with -Dorg.icepdf.core.views.page.text.trim.duplicates=true . This code tried to look for duplicate text by comparing text based on a mid point. The client has come back with an improved algorithm where a key is generated based on the words bounds and text. Any text that has a duplicate key is trimmed.

This code should work just fine going forward. We'll have to run a QA test for text extraction to be sure though.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

PageText.java.patch

26/May/14 10:47 AM

2 kB

Patrick Corless

Activity

Hide

Permalink

Patrick Corless added a comment - 12/Aug/14 4:00 PM

Patch has been applied and as shipped with 5.0.7

Show

Patrick Corless added a comment - 12/Aug/14 4:00 PM Patch has been applied and as shipped with 5.0.7

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

26/May/14 8:27 AM

Updated:

01/Apr/15 3:01 PM

Resolved:

12/Aug/14 4:00 PM