ICEpdf
  1. ICEpdf
  2. PDF-424

Document text search works incorrectly in some cases

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 4.3.2
    • Fix Version/s: 5.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any

      Description

      In some cases text search works incorrectly. You may see more details and examples in th linked forum topic.

        Activity

        Hide
        Patrick Corless added a comment -

        The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded.

        BT
        /F1 11.04 Tf
        1 0 0 1 85.104 774.84 Tm
        0 g
        0 G
        [(t)] TJ
        ET
        BT
        1 0 0 1 88.824 774.84 Tm
        [(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ
        ET

        There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good.

        The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.

        Show
        Patrick Corless added a comment - The first issue with the "text" is split into two words t and "ext" is related to how the PDF is encoded. BT /F1 11.04 Tf 1 0 0 1 85.104 774.84 Tm 0 g 0 G [(t)] TJ ET BT 1 0 0 1 88.824 774.84 Tm [(ex)-3(t)9( )-4(thes)10(e )7(wo)-7(rd)4(s)11( m)-7(u)3(s)11(t )-3(w)8(o)-5(r)12(k)] TJ ET There is two text blocks defined, one for the t and the other for the rest of the words. Our text extractor treats each block as separate sentences and has worked well for us in the past. There is now notion of words, paragraphs in a content stream and we have look for clues that might indicate a space or line break. I can try removing our BT/line break code but I think it will to more harm then good. The second PDF is interesting as there are actually no space characters (0x32), instead the end of text character (0x03) is used. So any searches using spaces won't correctly get picked up. I'm pretty sure I can tweak our word detection code to pick on this and other white space characters. As a result the of the 0x03 character most sentences are treated as a whole word which explains why a search for "highlight" selects the whole line.
        Hide
        Patrick Corless added a comment -

        I have a fix for the usage case mentioned but also came across a slightly different case in PDF-438. The issue is similar in that the tokens don't aid in human readability. The interesting postscript is as follows:

        (PIL)Tj
        9.96 0 0 9.96 137.16 83.04 Tm
        -5.963 -1.217 Td
        [(O)-7(T)-6(/)-1(REF)-307(W)2(P)-10(T)]TJ

        The string "PILOT/REF WPT" should be extracted but instead "PIL \n OT/REF WPT". The reason for the line return is that we assume that the Tm transform will insert a line return because of the changed y translation. When in fact the Tm translation is immediately trumped by the translation specifed by the Td.

        After spending some time in the content parser and looking at how we detect line breaks and spaces I think we could rework the logic. The only time we should be attempting a line break would be before one of the text showing operations TJ, Tj, ', " or during the TJ processing. I'll give it a shot and see what I can up with.

        Show
        Patrick Corless added a comment - I have a fix for the usage case mentioned but also came across a slightly different case in PDF-438 . The issue is similar in that the tokens don't aid in human readability. The interesting postscript is as follows: (PIL)Tj 9.96 0 0 9.96 137.16 83.04 Tm -5.963 -1.217 Td [(O)-7(T)-6(/)-1(REF)-307(W)2(P)-10(T)] TJ The string "PILOT/REF WPT" should be extracted but instead "PIL \n OT/REF WPT". The reason for the line return is that we assume that the Tm transform will insert a line return because of the changed y translation. When in fact the Tm translation is immediately trumped by the translation specifed by the Td. After spending some time in the content parser and looking at how we detect line breaks and spaces I think we could rework the logic. The only time we should be attempting a line break would be before one of the text showing operations TJ, Tj, ', " or during the TJ processing. I'll give it a shot and see what I can up with.
        Hide
        Patrick Corless added a comment -

        These seems to be a one off document. Fixing this issue will break many other documents. Closing for no. Client can reopen the issue if requested.

        Show
        Patrick Corless added a comment - These seems to be a one off document. Fixing this issue will break many other documents. Closing for no. Client can reopen the issue if requested.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Aliaksei Kuliashou
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: