ICEpdf
  1. ICEpdf
  2. PDF-689

PDF text extraction failed, never ending

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.0.4
    • Fix Version/s: 5.0.5
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any, 5.0.4 professional version
    • Salesforce Case Reference:

      Description

      using sample (PageTextExtraction class), PDF extraction failed on page 78 (0-based indexed) on the attached file.

      To be precise, extraction never ends.
      I can't see what is happening as I don't have NContentParser.parseTextBlocks source code

        Activity

        Hide
        Felix Nicolas added a comment -

        Mmmh, I am trying to find a smaller file as mine is a 16Mb-weight.

        Show
        Felix Nicolas added a comment - Mmmh, I am trying to find a smaller file as mine is a 16Mb-weight.
        Hide
        Felix Nicolas added a comment -

        waiting for a sample document (< than 10Mb),
        you may find here a java thread dump about this issue.

        Thread 25274: (state = IN_JAVA)
         - org.icepdf.core.util.content.b.a(byte[], int, int) @bci=2105 (Compiled frame; information may be imprecise)
         - org.icepdf.core.util.content.c.a(byte[], int, int) @bci=3 (Compiled frame)
         - org.icepdf.core.util.content.a.c() @bci=89 (Compiled frame)
         - org.icepdf.core.util.content.a.a() @bci=82 (Compiled frame)
         - org.icepdf.core.util.content.a.a() @bci=45 (Compiled frame)
         - org.icepdf.core.util.content.a.a() @bci=102 (Compiled frame)
         - org.icepdf.core.util.content.NContentParser.parseTextBlocks(byte[][]) @bci=189 (Compiled frame)
         - org.icepdf.core.pobjects.Page.getText() @bci=147, line=1449 (Interpreted frame)
         - org.icepdf.core.pobjects.Document.getPageText(int) @bci=27, line=1120 (Interpreted frame)
        
        Show
        Felix Nicolas added a comment - waiting for a sample document (< than 10Mb), you may find here a java thread dump about this issue. Thread 25274: (state = IN_JAVA) - org.icepdf.core.util.content.b.a( byte [], int , int ) @bci=2105 (Compiled frame; information may be imprecise) - org.icepdf.core.util.content.c.a( byte [], int , int ) @bci=3 (Compiled frame) - org.icepdf.core.util.content.a.c() @bci=89 (Compiled frame) - org.icepdf.core.util.content.a.a() @bci=82 (Compiled frame) - org.icepdf.core.util.content.a.a() @bci=45 (Compiled frame) - org.icepdf.core.util.content.a.a() @bci=102 (Compiled frame) - org.icepdf.core.util.content.NContentParser.parseTextBlocks( byte [][]) @bci=189 (Compiled frame) - org.icepdf.core.pobjects.Page.getText() @bci=147, line=1449 (Interpreted frame) - org.icepdf.core.pobjects.Document.getPageText( int ) @bci=27, line=1120 (Interpreted frame)
        Hide
        Felix Nicolas added a comment -

        FYI, it works using Document#getPageViewText method

        Show
        Felix Nicolas added a comment - FYI, it works using Document#getPageViewText method
        Hide
        Patrick Corless added a comment -

        Thanks Felix, we'll take a look at it. The code path is slightly different for the two methods. We'll get a fix for this as soon as we get a sample. You can also email product.support@icesoft.com if the file is too large to attach to the case.

        Show
        Patrick Corless added a comment - Thanks Felix, we'll take a look at it. The code path is slightly different for the two methods. We'll get a fix for this as soon as we get a sample. You can also email product.support@icesoft.com if the file is too large to attach to the case.
        Hide
        Felix Nicolas added a comment -

        Hi Patrick, I droped a test file here, https://dl.dropboxusercontent.com/u/2853503/123456.pdf.

        Problem happens on page n°79 (or 78 0-based indexed).
        I just tested with your applet and did a search on the whole document
        it is still hanging on page n°79 (running for more than 15 min).

        Thanks,
        Nicolas

        Show
        Felix Nicolas added a comment - Hi Patrick, I droped a test file here, https://dl.dropboxusercontent.com/u/2853503/123456.pdf . Problem happens on page n°79 (or 78 0-based indexed). I just tested with your applet and did a search on the whole document it is still hanging on page n°79 (running for more than 15 min). Thanks, Nicolas
        Hide
        Patrick Corless added a comment -

        The text extraction parser is a cut down version of the full page content parser with the intent on being faster because things like data and clipping parsing are ignored. In this particular file there is an imbedded image on the 79th page that contains a character stream that puts the parser into an infinite loop.

        I've touched up the text extraction code to properly parse the inline images to avoid choking on " character in this case or any other operator that might be in the image stream. Inline images are very small and aren't use much anymore so I'm not to worried about any performance lose.

        Show
        Patrick Corless added a comment - The text extraction parser is a cut down version of the full page content parser with the intent on being faster because things like data and clipping parsing are ignored. In this particular file there is an imbedded image on the 79th page that contains a character stream that puts the parser into an infinite loop. I've touched up the text extraction code to properly parse the inline images to avoid choking on " character in this case or any other operator that might be in the image stream. Inline images are very small and aren't use much anymore so I'm not to worried about any performance lose.
        Hide
        Patrick Corless added a comment -

        Checked in fix on the trunk and 5.0.1 branch.

        Show
        Patrick Corless added a comment - Checked in fix on the trunk and 5.0.1 branch.
        Hide
        Felix Nicolas added a comment -

        Cool you have corrected it !
        May you tell me when the 5.0.4 version will be release ? (previously announced on november 1st).

        Show
        Felix Nicolas added a comment - Cool you have corrected it ! May you tell me when the 5.0.4 version will be release ? (previously announced on november 1st).

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Felix Nicolas
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: