The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded.
BT
/F1 11.04 Tf
1 0 0 1 85.104 774.84 Tm
0 g
0 G
[(t)] TJ
ET
BT
1 0 0 1 88.824 774.84 Tm
[(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ
ET
There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good.
The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.
The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded.
BT
/F1 11.04 Tf
1 0 0 1 85.104 774.84 Tm
0 g
0 G
[(t)] TJ
ET
BT
1 0 0 1 88.824 774.84 Tm
[(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ
ET
There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good.
The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.