Details
-
Type: Bug
-
Status: Closed
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 6.1.3
-
Fix Version/s: 6.2
-
Component/s: API, Core/Parsing
-
Labels:None
-
Environment:any
-
ICEsoft Forum Reference:
Description
A community member has is migrating from 4.x to 6.x and has run up against a few regressions with the expected results of the page text extraction calls. I've done a little digging around and it would appear that the docment.getPageText() method calls does not execute the same extraction algorithms as page.getPageText() call.
This bug is a place holder to review the text extraction API and make srue the non visual page extraction calls have the same sorting calls as the visual page extraction calls.
This bug is a place holder to review the text extraction API and make srue the non visual page extraction calls have the same sorting calls as the visual page extraction calls.
I've reviewed our code and things seems to be in order. The sorting and formatting takes place in the PageText call ArrayList<LineText> getPageLines(). The document and Page calls getPageText() and getPageViewText() work as the javadoc suggests, that is they change the parser config and getPageText() can be a lot faster for straight up extraction with no page image capture.
I've also touched up the viewer ri text extraction calls and the extraction examples to use the fontProperties manager to speed up the start time of the examples.