ICEpdf
  1. ICEpdf
  2. PDF-992

Search not working with provided PDF file

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.1.1
    • Fix Version/s: 6.2
    • Component/s: Core/Rendering
    • Labels:
      None
    • Environment:
      All

      Description

      With the provided PDF file, searching for terms on the PDF doesn't come back with any results. Testing the same in Adobe comes up the the correct results.

      For example, searching for "Table" comes up with no results where it is clearly in the PDF content multiple times.

        Activity

        Arran Mccullough created issue -
        Hide
        Slawomir Mikula added a comment -

        Maybe you can at first analyze what the real reason for not searching is. It could be corrected during PDF generation, but we had to know what is the real reason for this issue.

        Show
        Slawomir Mikula added a comment - Maybe you can at first analyze what the real reason for not searching is. It could be corrected during PDF generation, but we had to know what is the real reason for this issue.
        Hide
        Patrick Corless added a comment -

        The PDF has a strange style of encoding. It uses a negative font size as well as text scale which I think is causing our text extraction code to incorrectly interpret spaces between characters. I've made a small tweak to the font engine and the words are correctly spaced for searching but there is still an issue with the location of the highlight box. The highlight box is slightly off center with respect to the underlying text.

        This might take a while to figure out as this affects how we fundamentally layout and draw text.

        Show
        Patrick Corless added a comment - The PDF has a strange style of encoding. It uses a negative font size as well as text scale which I think is causing our text extraction code to incorrectly interpret spaces between characters. I've made a small tweak to the font engine and the words are correctly spaced for searching but there is still an issue with the location of the highlight box. The highlight box is slightly off center with respect to the underlying text. This might take a while to figure out as this affects how we fundamentally layout and draw text.
        Hide
        Slawomir Mikula added a comment -

        Can you provide me some additional technical information regarding this pdf document (exact font size, text scale?). I'll work with our customer in order to send bug report to software manufacturer (WS-CAD software) - of course without any ETA for an answer, but it would be best to correct this on their side. Thanks.

        Right now, i've created some workaround for this kind of files, by extracting page text and then removing all white spaces and searching through this. I'm only using this for headless search. Of course with this workaround i can only search continuous text, but right now it's all I can do.

        Show
        Slawomir Mikula added a comment - Can you provide me some additional technical information regarding this pdf document (exact font size, text scale?). I'll work with our customer in order to send bug report to software manufacturer (WS-CAD software) - of course without any ETA for an answer, but it would be best to correct this on their side. Thanks. Right now, i've created some workaround for this kind of files, by extracting page text and then removing all white spaces and searching through this. I'm only using this for headless search. Of course with this workaround i can only search continuous text, but right now it's all I can do.
        Hide
        Patrick Corless added a comment -

        Are your searches only being done in a headless environment or is that just for your work around? The fix I have for the font engine fixes the search algorithm but here are still some issues with painting the search highlight in the correct location.

        Show
        Patrick Corless added a comment - Are your searches only being done in a headless environment or is that just for your work around? The fix I have for the font engine fixes the search algorithm but here are still some issues with painting the search highlight in the correct location.
        Hide
        Slawomir Mikula added a comment -

        Automatic search is done in headless mode. We are using it for finding and mapping items from system structure to pages on generated documentation. But of course we have integrated swing viewer for this documentation and our customer can search this document manually using GUI interface. The headless search is a primary concern for us. If the highlight location is somewhat spatial translated I think customer can live with it.
        Anyway, if you could provide me some example data from PDF structure, which are "wrong" (e.g. negative font size) I can raise an issue on WS-CAD site and ask them to correct the document.

        Show
        Slawomir Mikula added a comment - Automatic search is done in headless mode. We are using it for finding and mapping items from system structure to pages on generated documentation. But of course we have integrated swing viewer for this documentation and our customer can search this document manually using GUI interface. The headless search is a primary concern for us. If the highlight location is somewhat spatial translated I think customer can live with it. Anyway, if you could provide me some example data from PDF structure, which are "wrong" (e.g. negative font size) I can raise an issue on WS-CAD site and ask them to correct the document.
        Patrick Corless made changes -
        Field Original Value New Value
        Fix Version/s 6.2 [ 13090 ]
        Hide
        Patrick Corless added a comment -

        A very unique PDF that has a text layout that we haven't seen or no one has taken notices. When creating the bounding box we always he assume box should be create with the x,y at the lower left but in the instance of this pdf the font size is negative which creates right to left layout which is then mirrored by the current gs transform. The problem with the layout is that we create the bounding box incorrectly and everything goes sideways from there.
        I've added some code to detect the negative layout and thus we create the correct bound. Everything works as expected afterwards.

        Show
        Patrick Corless added a comment - A very unique PDF that has a text layout that we haven't seen or no one has taken notices. When creating the bounding box we always he assume box should be create with the x,y at the lower left but in the instance of this pdf the font size is negative which creates right to left layout which is then mirrored by the current gs transform. The problem with the layout is that we create the bounding box incorrectly and everything goes sideways from there. I've added some code to detect the negative layout and thus we create the correct bound. Everything works as expected afterwards.
        Hide
        Patrick Corless added a comment -

        Marking as fixed.

        Show
        Patrick Corless added a comment - Marking as fixed.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Arran Mccullough
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: