ICEpdf
  1. ICEpdf
  2. PDF-1173

Document.getPageText hangs on some files

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 6.2.4
    • Fix Version/s: None
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      Windows 10 64-bit, Java 1.8.0_131-b11

      Description

      Hello,

      When I try to extract text from some PDFs, it's getting stuck on Document.getPageText().

      It looks like an infinite loop in the parser code. I found the following similar tickets reported and fixed in the past:
      http://jira.icesoft.org/browse/PDF-846
      http://jira.icesoft.org/browse/PDF-689

      Here is the thread dump:

      "reactor1" #13 prio=5 os_prio=0 tid=0x000000001d488800 nid=0xe1c runnable [0x000000002163e000]
         java.lang.Thread.State: RUNNABLE
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.a.a(Unknown Source)
      at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source)
      at org.icepdf.core.util.content.NContentParser.parseTextBlocks(Unknown Source)
      at org.icepdf.core.pobjects.Page.getText(Page.java:1571)
      - locked <0x0000000080025d28> (a org.icepdf.core.pobjects.Page)
      at org.icepdf.core.pobjects.Document.getPageText(Document.java:1174)

      Unfortunately, I can't provide a sample document, because it's very sensitive data. But I'm happy to run any tests to debug the issue locally and report the results to you, if it's possible.

      Thanks.

        Activity

        Hide
        Patrick Corless added a comment -

        I suspect this will fail given the statck trace but can you try calling document.getPageViewText() or loading the PDF in the viewer RI?

        Show
        Patrick Corless added a comment - I suspect this will fail given the statck trace but can you try calling document.getPageViewText() or loading the PDF in the viewer RI?
        Hide
        Igor R added a comment -

        Hi Patrick,

        I tried to run the following test code:

        Document doc = new Document();
        doc.setFile(pdfFilePath);
        for (int page = 0; page < doc.getNumberOfPages(); page++) {
           System.out.println(" ---- Getting page " + page + " --- ");
           PageText pageText = doc.getPageViewText(page);
        }
        

        Here is the output:

        Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.Document setInputStream
        WARNING: Cross reference deferred loading failed, will fall back to linear reading.
        Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.Catalog <clinit>
        INFO: ICEsoft ICEpdf Core 6.2.4 
        Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.CrossReference addXRefStreamEntries
        SEVERE: Error parsing xRef stream entries.
        java.io.EOFException
        	at org.icepdf.core.util.Utils.readIntWithVaryingBytesBE(Utils.java:98)
        	at org.icepdf.core.pobjects.CrossReference.addXRefStreamEntries(CrossReference.java:199)
        	at org.icepdf.core.util.Parser.getObject(Parser.java:297)
        	at org.icepdf.core.pobjects.Document.loadDocumentViaLinearTraversal(Document.java:622)
        	at org.icepdf.core.pobjects.Document.setInputStream(Document.java:466)
        	at org.icepdf.core.pobjects.Document.setByteArray(Document.java:361)
        	at org.icepdf.core.pobjects.Document.setFile(Document.java:214)
        	at test.IcePdfTest.main(IcePdfTest.java:19)
        
        Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.fonts.nfont.Font <clinit>
        INFO: ICEsoft ICEpdf Pro 6.2.4 
        Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.ImageStream <clinit>
        INFO: Levigo JBIG2 image library was found on classpath
         ---- Getting page 0 --- 
         ---- Getting page 1 --- 
         ---- Getting page 2 --- 
        

        And here is the thread dump:

        "main" #1 prio=5 os_prio=0 tid=0x00000000031b9000 nid=0x1c80 runnable [0x000000000316e000]
           java.lang.Thread.State: RUNNABLE
        	at java.util.HashMap.put(HashMap.java:611)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source)
        	at org.icepdf.core.util.content.NContentParser.parse(Unknown Source)
        	at org.icepdf.core.pobjects.Page.init(Page.java:399)
        	- locked <0x000000071df3cdd8> (a org.icepdf.core.pobjects.Page)
        	at org.icepdf.core.pobjects.Page.getViewText(Page.java:1509)
        	at org.icepdf.core.pobjects.Document.getPageViewText(Document.java:1194)
        	at test.IcePdfTest.main(IcePdfTest.java:23)
        

        When I open it in the viewer RI it shows 3 blank pages and I can see it eating 10% CPU. Here is the thread dump for the viewer:

        "ICEpdf-thread-pool" #32 daemon prio=5 os_prio=0 tid=0x0000000029b1b800 nid=0x120c runnable [0x0000000038ade000]
           java.lang.Thread.State: RUNNABLE
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.a.a(Unknown Source)
        	at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source)
        	at org.icepdf.core.util.content.NContentParser.parse(Unknown Source)
        	at org.icepdf.core.pobjects.Page.init(Page.java:399)
        	- locked <0x0000000718a043f0> (a org.icepdf.core.pobjects.Page)
        	at org.icepdf.ri.common.views.AbstractPageViewComponent$PageImageCaptureTask.call(AbstractPageViewComponent.java:410)
        

        I hope this helps.

        Show
        Igor R added a comment - Hi Patrick, I tried to run the following test code: Document doc = new Document(); doc.setFile(pdfFilePath); for ( int page = 0; page < doc.getNumberOfPages(); page++) { System .out.println( " ---- Getting page " + page + " --- " ); PageText pageText = doc.getPageViewText(page); } Here is the output: Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.Document setInputStream WARNING: Cross reference deferred loading failed, will fall back to linear reading. Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.Catalog <clinit> INFO: ICEsoft ICEpdf Core 6.2.4 Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.CrossReference addXRefStreamEntries SEVERE: Error parsing xRef stream entries. java.io.EOFException at org.icepdf.core.util.Utils.readIntWithVaryingBytesBE(Utils.java:98) at org.icepdf.core.pobjects.CrossReference.addXRefStreamEntries(CrossReference.java:199) at org.icepdf.core.util.Parser.getObject(Parser.java:297) at org.icepdf.core.pobjects.Document.loadDocumentViaLinearTraversal(Document.java:622) at org.icepdf.core.pobjects.Document.setInputStream(Document.java:466) at org.icepdf.core.pobjects.Document.setByteArray(Document.java:361) at org.icepdf.core.pobjects.Document.setFile(Document.java:214) at test.IcePdfTest.main(IcePdfTest.java:19) Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.fonts.nfont.Font <clinit> INFO: ICEsoft ICEpdf Pro 6.2.4 Jul 07, 2017 11:03:48 AM org.icepdf.core.pobjects.ImageStream <clinit> INFO: Levigo JBIG2 image library was found on classpath ---- Getting page 0 --- ---- Getting page 1 --- ---- Getting page 2 --- And here is the thread dump: "main" #1 prio=5 os_prio=0 tid=0x00000000031b9000 nid=0x1c80 runnable [0x000000000316e000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(HashMap.java:611) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source) at org.icepdf.core.util.content.NContentParser.parse(Unknown Source) at org.icepdf.core.pobjects.Page.init(Page.java:399) - locked <0x000000071df3cdd8> (a org.icepdf.core.pobjects.Page) at org.icepdf.core.pobjects.Page.getViewText(Page.java:1509) at org.icepdf.core.pobjects.Document.getPageViewText(Document.java:1194) at test.IcePdfTest.main(IcePdfTest.java:23) When I open it in the viewer RI it shows 3 blank pages and I can see it eating 10% CPU. Here is the thread dump for the viewer: "ICEpdf-thread-pool" #32 daemon prio=5 os_prio=0 tid=0x0000000029b1b800 nid=0x120c runnable [0x0000000038ade000] java.lang.Thread.State: RUNNABLE at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.a.a(Unknown Source) at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source) at org.icepdf.core.util.content.NContentParser.parse(Unknown Source) at org.icepdf.core.pobjects.Page.init(Page.java:399) - locked <0x0000000718a043f0> (a org.icepdf.core.pobjects.Page) at org.icepdf.ri.common.views.AbstractPageViewComponent$PageImageCaptureTask.call(AbstractPageViewComponent.java:410) I hope this helps.
        Hide
        Patrick Corless added a comment -

        We are getting very close a 6.3 release and have done quite a bit of work around insuring document loaded via a linear traversal have the correct references. Without a test case I can't offer much more support just from the logs.

        Show
        Patrick Corless added a comment - We are getting very close a 6.3 release and have done quite a bit of work around insuring document loaded via a linear traversal have the correct references. Without a test case I can't offer much more support just from the logs.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Igor R
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: