ICEpdf
  1. ICEpdf
  2. PDF-689

PDF text extraction failed, never ending

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.0.4
    • Fix Version/s: 5.0.5
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any, 5.0.4 professional version
    • Salesforce Case Reference:

      Description

      using sample (PageTextExtraction class), PDF extraction failed on page 78 (0-based indexed) on the attached file.

      To be precise, extraction never ends.
      I can't see what is happening as I don't have NContentParser.parseTextBlocks source code

        Activity

        Felix Nicolas created issue -
        Hide
        Felix Nicolas added a comment -

        Mmmh, I am trying to find a smaller file as mine is a 16Mb-weight.

        Show
        Felix Nicolas added a comment - Mmmh, I am trying to find a smaller file as mine is a 16Mb-weight.
        Hide
        Felix Nicolas added a comment -

        waiting for a sample document (< than 10Mb),
        you may find here a java thread dump about this issue.

        Thread 25274: (state = IN_JAVA)
         - org.icepdf.core.util.content.b.a(byte[], int, int) @bci=2105 (Compiled frame; information may be imprecise)
         - org.icepdf.core.util.content.c.a(byte[], int, int) @bci=3 (Compiled frame)
         - org.icepdf.core.util.content.a.c() @bci=89 (Compiled frame)
         - org.icepdf.core.util.content.a.a() @bci=82 (Compiled frame)
         - org.icepdf.core.util.content.a.a() @bci=45 (Compiled frame)
         - org.icepdf.core.util.content.a.a() @bci=102 (Compiled frame)
         - org.icepdf.core.util.content.NContentParser.parseTextBlocks(byte[][]) @bci=189 (Compiled frame)
         - org.icepdf.core.pobjects.Page.getText() @bci=147, line=1449 (Interpreted frame)
         - org.icepdf.core.pobjects.Document.getPageText(int) @bci=27, line=1120 (Interpreted frame)
        
        Show
        Felix Nicolas added a comment - waiting for a sample document (< than 10Mb), you may find here a java thread dump about this issue. Thread 25274: (state = IN_JAVA) - org.icepdf.core.util.content.b.a( byte [], int , int ) @bci=2105 (Compiled frame; information may be imprecise) - org.icepdf.core.util.content.c.a( byte [], int , int ) @bci=3 (Compiled frame) - org.icepdf.core.util.content.a.c() @bci=89 (Compiled frame) - org.icepdf.core.util.content.a.a() @bci=82 (Compiled frame) - org.icepdf.core.util.content.a.a() @bci=45 (Compiled frame) - org.icepdf.core.util.content.a.a() @bci=102 (Compiled frame) - org.icepdf.core.util.content.NContentParser.parseTextBlocks( byte [][]) @bci=189 (Compiled frame) - org.icepdf.core.pobjects.Page.getText() @bci=147, line=1449 (Interpreted frame) - org.icepdf.core.pobjects.Document.getPageText( int ) @bci=27, line=1120 (Interpreted frame)
        Hide
        Felix Nicolas added a comment -

        FYI, it works using Document#getPageViewText method

        Show
        Felix Nicolas added a comment - FYI, it works using Document#getPageViewText method
        Hide
        Patrick Corless added a comment -

        Thanks Felix, we'll take a look at it. The code path is slightly different for the two methods. We'll get a fix for this as soon as we get a sample. You can also email product.support@icesoft.com if the file is too large to attach to the case.

        Show
        Patrick Corless added a comment - Thanks Felix, we'll take a look at it. The code path is slightly different for the two methods. We'll get a fix for this as soon as we get a sample. You can also email product.support@icesoft.com if the file is too large to attach to the case.
        Hide
        Felix Nicolas added a comment -

        Hi Patrick, I droped a test file here, https://dl.dropboxusercontent.com/u/2853503/123456.pdf.

        Problem happens on page n°79 (or 78 0-based indexed).
        I just tested with your applet and did a search on the whole document
        it is still hanging on page n°79 (running for more than 15 min).

        Thanks,
        Nicolas

        Show
        Felix Nicolas added a comment - Hi Patrick, I droped a test file here, https://dl.dropboxusercontent.com/u/2853503/123456.pdf . Problem happens on page n°79 (or 78 0-based indexed). I just tested with your applet and did a search on the whole document it is still hanging on page n°79 (running for more than 15 min). Thanks, Nicolas
        Hide
        Patrick Corless added a comment -

        The text extraction parser is a cut down version of the full page content parser with the intent on being faster because things like data and clipping parsing are ignored. In this particular file there is an imbedded image on the 79th page that contains a character stream that puts the parser into an infinite loop.

        I've touched up the text extraction code to properly parse the inline images to avoid choking on " character in this case or any other operator that might be in the image stream. Inline images are very small and aren't use much anymore so I'm not to worried about any performance lose.

        Show
        Patrick Corless added a comment - The text extraction parser is a cut down version of the full page content parser with the intent on being faster because things like data and clipping parsing are ignored. In this particular file there is an imbedded image on the 79th page that contains a character stream that puts the parser into an infinite loop. I've touched up the text extraction code to properly parse the inline images to avoid choking on " character in this case or any other operator that might be in the image stream. Inline images are very small and aren't use much anymore so I'm not to worried about any performance lose.
        Patrick Corless made changes -
        Field Original Value New Value
        Fix Version/s 5.0.5 [ 11373 ]
        Hide
        Patrick Corless added a comment -

        Checked in fix on the trunk and 5.0.1 branch.

        Show
        Patrick Corless added a comment - Checked in fix on the trunk and 5.0.1 branch.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Felix Nicolas added a comment -

        Cool you have corrected it !
        May you tell me when the 5.0.4 version will be release ? (previously announced on november 1st).

        Show
        Felix Nicolas added a comment - Cool you have corrected it ! May you tell me when the 5.0.4 version will be release ? (previously announced on november 1st).
        Judy Guglielmin made changes -
        Salesforce Case Reference 5007000000ZDvw0AAD
        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #50156 Wed Jan 18 12:37:57 MST 2017 patrick.corless PDF-689 removed alpha set on common fill as it trumps any previous blending modes.
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/core/src/org/icepdf/core/util/content/AbstractContentParser.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #50157 Wed Jan 18 12:38:10 MST 2017 patrick.corless PDF-689 removed alpha set on common fill as it trumps any previous blending modes.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/content/AbstractContentParser.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #50158 Wed Jan 18 12:40:31 MST 2017 patrick.corless PDF-689 added test for cid font pick, if so we default back to /helvetica, as we aren't doing the work to setup the cmap correctly for display.
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/core/src/org/icepdf/core/pobjects/acroform/VariableTextFieldDictionary.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #50159 Wed Jan 18 12:40:41 MST 2017 patrick.corless PDF-689 added test for cid font pick, if so we default back to /helvetica, as we aren't doing the work to setup the cmap correctly for display.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/acroform/VariableTextFieldDictionary.java

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Felix Nicolas
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: