[PDF-424] Document text search works incorrectly in some cases - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 4.3.2
Fix Version/s: 5.0
Component/s: Core/Parsing
Labels:
None
Environment:
any

ICEsoft Forum Reference:
http://jforum.icesoft.org/JForum/posts/list/20786.page#73244

Description

In some cases text search works incorrectly. You may see more details and examples in th linked forum topic.

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Patrick Corless added a comment - 23/Apr/12 9:14 AM

The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded.

BT
/F1 11.04 Tf
1 0 0 1 85.104 774.84 Tm
0 g
0 G
[(t)] TJ
ET
BT
1 0 0 1 88.824 774.84 Tm
[(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ
ET

There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good.

The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.

Show

Patrick Corless added a comment - 23/Apr/12 9:14 AM The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded. BT /F1 11.04 Tf 1 0 0 1 85.104 774.84 Tm 0 g 0 G [(t)] TJ ET BT 1 0 0 1 88.824 774.84 Tm [(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ ET There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good. The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.

Hide

Permalink

Patrick Corless added a comment - 24/Aug/12 10:29 AM

I have a fix for the usage case mentioned but also came across a slightly different case in ~~PDF-438~~. The issue is similar in that the tokens don't aid in human readability. The interesting postscript is as follows:

(PIL)Tj
9.96 0 0 9.96 137.16 83.04 Tm
-5.963 -1.217 Td
[(O)-7(T)-6(/)-1(REF)-307(W)2(P)-10(T)]TJ

The string "PILOT/REF WPT" should be extracted but instead "PIL \n OT/REF WPT". The reason for the line return is that we assume that the Tm transform will insert a line return because of the changed y translation. When in fact the Tm translation is immediately trumped by the translation specifed by the Td.

After spending some time in the content parser and looking at how we detect line breaks and spaces I think we could rework the logic. The only time we should be attempting a line break would be before one of the text showing operations TJ, Tj, ', " or during the TJ processing. I'll give it a shot and see what I can up with.

Show

Patrick Corless added a comment - 24/Aug/12 10:29 AM I have a fix for the usage case mentioned but also came across a slightly different case in PDF-438 . The issue is similar in that the tokens don't aid in human readability. The interesting postscript is as follows: (PIL)Tj 9.96 0 0 9.96 137.16 83.04 Tm -5.963 -1.217 Td [(O)-7(T)-6(/)-1(REF)-307(W)2(P)-10(T)] TJ The string "PILOT/REF WPT" should be extracted but instead "PIL \n OT/REF WPT". The reason for the line return is that we assume that the Tm transform will insert a line return because of the changed y translation. When in fact the Tm translation is immediately trumped by the translation specifed by the Td. After spending some time in the content parser and looking at how we detect line breaks and spaces I think we could rework the logic. The only time we should be attempting a line break would be before one of the text showing operations TJ, Tj, ', " or during the TJ processing. I'll give it a shot and see what I can up with.

Hide

Permalink

Patrick Corless added a comment - 02/Apr/13 1:24 PM

These seems to be a one off document. Fixing this issue will break many other documents. Closing for no. Client can reopen the issue if requested.

Show

Patrick Corless added a comment - 02/Apr/13 1:24 PM These seems to be a one off document. Fixing this issue will break many other documents. Closing for no. Client can reopen the issue if requested.

Document text search works incorrectly in some cases

Details

Description

Activity

People

Dates