ICEpdf
  1. ICEpdf
  2. PDF-853

DocumentSearchController.searchPage returns 0 match on an existing search text

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.1.1
    • Fix Version/s: 5.1.2
    • Component/s: API
    • Labels:
      None
    • Environment:
      Windows 7

      Description

      Hello,
      I am used to the searchPage method of the DocumentSearchController and it generally works fine.
      Unfortunately, it fails in the following case:

      Searching for "Article 45 of the Constitution" returns 0 match.
      I tried using the API (I.e DocumentSearchController.searchPage) and via the Search tab of the viewer.
      This term exists in the document and Acrobat Reader found several occurrences.

      In order to be sure that it is not an issue with the cmap, I checked the cids and unics of a page containing this pattern (p 239)
      and they seem correct:
      cid=0x41:unic=A:0x41
      cid=0x72:unic=r:0x72
      cid=0x74:unic=t:0x74
      cid=0x69:unic=i:0x69
      cid=0x63:unic=c:0x63
      cid=0x6c:unic=l:0x6c
      cid=0x65:unic=e:0x65
      cid=0x20:unic= :0x20
      cid=0x34:unic=4:0x34
      cid=0x35:unic=5:0x35
      cid=0x20:unic= :0x20
      cid=0x6f:unic=o:0x6f
      cid=0x66:unic=f:0x66
      cid=0x20:unic= :0x20
      cid=0x74:unic=t:0x74
      cid=0x68:unic=h:0x68
      cid=0x65:unic=e:0x65
      cid=0x20:unic= :0x20
      cid=0x43:unic=C:0x43
      cid=0x6f:unic=o:0x6f
      cid=0x6e:unic=n:0x6e
      cid=0x73:unic=s:0x73
      cid=0x74:unic=t:0x74
      cid=0x69:unic=i:0x69
      cid=0x74:unic=t:0x74
      cid=0x75:unic=u:0x75
      cid=0x74:unic=t:0x74
      cid=0x69:unic=i:0x69
      cid=0x6f:unic=o:0x6f
      cid=0x6e:unic=n:0x6e

      Moreover, this term is not split over multiple lines.
      The pdf is accessible on http://www.itu.int/dms_pub/itu-s/oth/02/02/S02020000244501PDFE.pdf

      I am really puzzled, thank you very much for your help.

        Activity

        Hide
        Patrick Corless added a comment -

        Thanks for posting this issue. We're currently working with a customer on improving our text extraction ordering. In the document in question an extra space is being added to work. A fix for this should soon be available.

        Show
        Patrick Corless added a comment - Thanks for posting this issue. We're currently working with a customer on improving our text extraction ordering. In the document in question an extra space is being added to work. A fix for this should soon be available.
        Hide
        Patrick Corless added a comment -

        The issue is related to how the DocumentSearchControllerImpl is splitting up the words in the search phrase. A change was made as part of PDF-745 that introduced the issue but there is no reference in the JIRA as to what the new code was for. Removing code for now as it no behaves more like text work detection algorithm used for text extraction.

        Show
        Patrick Corless added a comment - The issue is related to how the DocumentSearchControllerImpl is splitting up the words in the search phrase. A change was made as part of PDF-745 that introduced the issue but there is no reference in the JIRA as to what the new code was for. Removing code for now as it no behaves more like text work detection algorithm used for text extraction.
        Hide
        Olivier Chuzel added a comment -

        Thank you very much for your quick fix, I am eager to use it.

        Show
        Olivier Chuzel added a comment - Thank you very much for your quick fix, I am eager to use it.
        Hide
        Olivier Chuzel added a comment -

        "Article 45 of the Constitution" is found with version 5.1.2 but there's a regression since 'No. 5.33' is not found anymore. But searching 'No. 5.' returns occurrences of No. 5.33
        Shall I create a new issue?

        Show
        Olivier Chuzel added a comment - "Article 45 of the Constitution" is found with version 5.1.2 but there's a regression since 'No. 5.33' is not found anymore. But searching 'No. 5.' returns occurrences of No. 5.33 Shall I create a new issue?

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Olivier Chuzel
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: