[PDF-760] Improve duplicate word text extraction detection - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0.6_P01
Fix Version/s: 5.0.7
Component/s: Viewer RI
Labels:
None
Environment:
any

Description

A client has submitted a patch to improve the detection of duplicated words that sometimes occur in PDF documents created using Chrystal Reports. The PDF in question plot out out a bunch of text followed by the same text plotted out again.

We had added some experimental code that was activated with -Dorg.icepdf.core.views.page.text.trim.duplicates=true . This code tried to look for duplicate text by comparing text based on a mid point. The client has come back with an improved algorithm where a key is generated based on the words bounds and text. Any text that has a duplicate key is trimmed.

This code should work just fine going forward. We'll have to run a QA test for text extraction to be sure though.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

PageText.java.patch

26/May/14 10:47 AM

2 kB

Patrick Corless

Activity

Ascending order - Click to sort in descending order

Patrick Corless created issue - 26/May/14 8:27 AM

Patrick Corless made changes - 26/May/14 10:14 AM

Field	Original Value	New Value
Fix Version/s		5.0.7 [ 11470 ]

Patrick Corless made changes - 26/May/14 10:47 AM

Attachment

PageText.java.patch [ 17202 ]

Patrick Corless made changes - 31/Jul/14 1:52 PM

Fix Version/s		5.1 [ 10675 ]
Fix Version/s	5.0.7 [ 11470 ]

Patrick Corless made changes - 12/Aug/14 4:00 PM

Status	Open [ 1 ]	Resolved [ 5 ]
Fix Version/s		5.0.7 [ 11470 ]
Fix Version/s	5.1 [ 10675 ]
Resolution		Fixed [ 1 ]

Patrick Corless made changes - 01/Apr/15 3:01 PM

Status

Resolved [ 5 ]

Closed [ 6 ]

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

26/May/14 8:27 AM

Updated:

01/Apr/15 3:01 PM

Resolved:

12/Aug/14 4:00 PM