Details
-
Type: Bug
-
Status: Open
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: 6.2.4
-
Fix Version/s: None
-
Component/s: Core/Parsing
-
Labels:None
-
Environment:Windows 10 64-bit, Java 1.8.0_131-b11
Description
Hello,
When I try to extract text from some PDFs, it's getting stuck on Document.getPageText().
It looks like an infinite loop in the parser code. I found the following similar tickets reported and fixed in the past:
http://jira.icesoft.org/browse/PDF-846
http://jira.icesoft.org/browse/PDF-689
Here is the thread dump:
"reactor1" #13 prio=5 os_prio=0 tid=0x000000001d488800 nid=0xe1c runnable [0x000000002163e000]
java.lang.Thread.State: RUNNABLE
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source)
at org.icepdf.core.util.content.NContentParser.parseTextBlocks(Unknown Source)
at org.icepdf.core.pobjects.Page.getText(Page.java:1571)
- locked <0x0000000080025d28> (a org.icepdf.core.pobjects.Page)
at org.icepdf.core.pobjects.Document.getPageText(Document.java:1174)
Unfortunately, I can't provide a sample document, because it's very sensitive data. But I'm happy to run any tests to debug the issue locally and report the results to you, if it's possible.
Thanks.
When I try to extract text from some PDFs, it's getting stuck on Document.getPageText().
It looks like an infinite loop in the parser code. I found the following similar tickets reported and fixed in the past:
http://jira.icesoft.org/browse/PDF-846
http://jira.icesoft.org/browse/PDF-689
Here is the thread dump:
"reactor1" #13 prio=5 os_prio=0 tid=0x000000001d488800 nid=0xe1c runnable [0x000000002163e000]
java.lang.Thread.State: RUNNABLE
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.a.a(Unknown Source)
at org.icepdf.core.util.content.NContentParser.parseText(Unknown Source)
at org.icepdf.core.util.content.NContentParser.parseTextBlocks(Unknown Source)
at org.icepdf.core.pobjects.Page.getText(Page.java:1571)
- locked <0x0000000080025d28> (a org.icepdf.core.pobjects.Page)
at org.icepdf.core.pobjects.Document.getPageText(Document.java:1174)
Unfortunately, I can't provide a sample document, because it's very sensitive data. But I'm happy to run any tests to debug the issue locally and report the results to you, if it's possible.
Thanks.
I suspect this will fail given the statck trace but can you try calling document.getPageViewText() or loading the PDF in the viewer RI?