[PDF-438] Extracting text from document doesn't work properly. - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.3.2
Fix Version/s: 4.3.4
Component/s: Core/Parsing
Labels:
None
Environment:
ICEpdf PRO 4.3.2, ICEpdf Viewer

Description

While extracting text from attached document I have found that line:

"last flight (if one was defined for that flight). Regardless of the data,"

consists of 2 LineText objects:
1. "last flight (if one was" and
2. "defined for that flight). Regardless of the data,".

It looks like space between words "was" and "defined" is missing so if I would search for the word "defined" you will not find it.

Adding space manualy between LineText objects causes problem in different line:

"airport reference point latitude/longitude position shows adjacent to the".

It consists of:
1. "airport reference p" and
2. "oint latitude/longitude position shows adjacent to the".

If I put space between them I will get "airport reference p oint latitude/longitude position shows adjacent to the" and searching for a word "point" fails.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

example.pdf

24/May/12 1:42 PM

47 kB

Evgheni Sadovoi

Activity

Ascending order - Click to sort in descending order

Evgheni Sadovoi created issue - 24/May/12 1:42 PM

Evgheni Sadovoi made changes - 24/May/12 1:42 PM

Field	Original Value	New Value
Attachment		example.pdf [ 14459 ]

Evgheni Sadovoi made changes - 24/May/12 1:43 PM

Salesforce Case

[5007000000MGD1l]

Patrick Corless made changes - 25/May/12 6:05 PM

Fix Version/s

5.0 [ 10314 ]

Hide

Permalink

Patrick Corless added a comment - 11/Jul/12 9:41 AM

Targeting 4.3.3

Show

Patrick Corless added a comment - 11/Jul/12 9:41 AM Targeting 4.3.3

Patrick Corless made changes - 11/Jul/12 9:41 AM

Fix Version/s		4.3.3 [ 10333 ]
Fix Version/s	5.0 [ 10314 ]

Patrick Corless made changes - 25/Jul/12 3:53 PM

Fix Version/s		4.3.4 [ 10341 ]
Fix Version/s	4.3.3 [ 10333 ]

Hide

Permalink

Patrick Corless added a comment - 09/Aug/12 3:18 PM

I've taken some time to look closer into specifically what is going on with the PDF question. The PDF's text content stream is encoded a bit differently using the "td" token mid line. Normally the "td" are used to sepcify a jump to the next line using the same offset as the previous line.

That all said we have some code that will try and detect if the y offset is larger enough to justify inserting a new line character. In this case the old y value would be something like 542.1345 and 541.98, visually font look like its drawn at virtually at the same spot but the line feed check passes and a new line inserted because the new value isn't the same as the last.

The problem is really that of float precision. As a work around I added Math.round on the y offset values to try and do "softer" comparison and avoid the extra line feed being inserted. Overall the fix seems to work quite will on a various documents. I still need to create a new test suite for text extraction to get better measure on any possible regression.

Show

Patrick Corless added a comment - 09/Aug/12 3:18 PM I've taken some time to look closer into specifically what is going on with the PDF question. The PDF's text content stream is encoded a bit differently using the "td" token mid line. Normally the "td" are used to sepcify a jump to the next line using the same offset as the previous line. That all said we have some code that will try and detect if the y offset is larger enough to justify inserting a new line character. In this case the old y value would be something like 542.1345 and 541.98, visually font look like its drawn at virtually at the same spot but the line feed check passes and a new line inserted because the new value isn't the same as the last. The problem is really that of float precision. As a work around I added Math.round on the y offset values to try and do "softer" comparison and avoid the extra line feed being inserted. Overall the fix seems to work quite will on a various documents. I still need to create a new test suite for text extraction to get better measure on any possible regression.

Repository	Revision	Date	User	Message
ICEsoft Public SVN Repository	#30469	Fri Aug 10 14:23:20 MDT 2012	patrick.corless	~~PDF-438~~ updated text extraction new line detection to round to the nearest int. We had a few corner cases where extra line spaces were being inserted, because of float numbers precision issues.
				Files Changed
				MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/ContentParser.java

Hide

Permalink

Patrick Corless added a comment - 10/Aug/12 3:24 PM

Closing issue.

Show

Patrick Corless added a comment - 10/Aug/12 3:24 PM Closing issue.

Patrick Corless made changes - 10/Aug/12 3:24 PM

Status	Open [ 1 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Patrick Corless made changes - 01/Apr/15 3:01 PM

Status

Resolved [ 5 ]

Closed [ 6 ]

People

Assignee:

Patrick Corless

Reporter:

Evgheni Sadovoi

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

24/May/12 1:42 PM

Updated:

01/Apr/15 3:01 PM

Resolved:

10/Aug/12 3:24 PM