Details
-
Type: Bug
-
Status: Closed
-
Priority: Major
-
Resolution: Won't Fix
-
Affects Version/s: 4.3.2
-
Fix Version/s: 5.0
-
Component/s: Core/Parsing
-
Labels:None
-
Environment:any
-
ICEsoft Forum Reference:
Description
In some cases text search works incorrectly. You may see more details and examples in th linked forum topic.
Activity
Aliaksei Kuliashou
created issue -
Patrick Corless
made changes -
Field | Original Value | New Value |
---|---|---|
Salesforce Case | [] | |
Component/s | Core [ 10022 ] | |
Fix Version/s | 4.3.3 [ 10333 ] |
Patrick Corless
made changes -
Salesforce Case | [] | |
Fix Version/s | 4.3.4 [ 10341 ] | |
Fix Version/s | 4.3.3 [ 10333 ] |
Migration
made changes -
Fix Version/s | 4.5 [ 10342 ] | |
Fix Version/s | 4.3.4 [ 10341 ] |
Patrick Corless
made changes -
Fix Version/s | 5.0 [ 10314 ] | |
Fix Version/s | 4.5 [ 10342 ] |
Patrick Corless
made changes -
Status | Open [ 1 ] | Resolved [ 5 ] |
Resolution | Won't Fix [ 2 ] |
Patrick Corless
made changes -
Status | Resolved [ 5 ] | Closed [ 6 ] |
The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded.
BT
/F1 11.04 Tf
1 0 0 1 85.104 774.84 Tm
0 g
0 G
[(t)] TJ
ET
BT
1 0 0 1 88.824 774.84 Tm
[(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ
ET
There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good.
The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.