ICEpdf
  1. ICEpdf
  2. PDF-1073

Consolidate Page text extraction sorting calls

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.1.3
    • Fix Version/s: 6.2
    • Component/s: API, Core/Parsing
    • Labels:
      None
    • Environment:
      any

      Description

      A community member has is migrating from 4.x to 6.x and has run up against a few regressions with the expected results of the page text extraction calls. I've done a little digging around and it would appear that the docment.getPageText() method calls does not execute the same extraction algorithms as page.getPageText() call.

      This bug is a place holder to review the text extraction API and make srue the non visual page extraction calls have the same sorting calls as the visual page extraction calls.

        Activity

        Patrick Corless created issue -
        Patrick Corless made changes -
        Field Original Value New Value
        Fix Version/s 6.1.4 [ 13090 ]
        Hide
        Patrick Corless added a comment -

        I've reviewed our code and things seems to be in order. The sorting and formatting takes place in the PageText call ArrayList<LineText> getPageLines(). The document and Page calls getPageText() and getPageViewText() work as the javadoc suggests, that is they change the parser config and getPageText() can be a lot faster for straight up extraction with no page image capture.

        I've also touched up the viewer ri text extraction calls and the extraction examples to use the fontProperties manager to speed up the start time of the examples.

        Show
        Patrick Corless added a comment - I've reviewed our code and things seems to be in order. The sorting and formatting takes place in the PageText call ArrayList<LineText> getPageLines(). The document and Page calls getPageText() and getPageViewText() work as the javadoc suggests, that is they change the parser config and getPageText() can be a lot faster for straight up extraction with no page image capture. I've also touched up the viewer ri text extraction calls and the extraction examples to use the fontProperties manager to speed up the start time of the examples.
        Hide
        Patrick Corless added a comment -

        Marking as fixed.

        Show
        Patrick Corless added a comment - Marking as fixed.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #50004 Thu Jan 12 09:40:45 MST 2017 patrick.corless PDF-1073 touched up text extraction examples to use pageview or page text appropriately.
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/examples/extraction/PageTextExtraction.java
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/examples/extraction/PageImageExtraction.java
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/examples/extraction/PageMetaDataExtraction.java
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/viewer/src/org/icepdf/ri/util/TextExtractionTask.java
        Commit graph MODIFY /icepdf/branches/icepdf-6.1.0/icepdf/examples/search/SearchController.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #50005 Thu Jan 12 09:41:25 MST 2017 patrick.corless PDF-1073 touched up text extraction examples to use pageview or page text appropriately.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/examples/search/SearchController.java
        Commit graph MODIFY /icepdf/trunk/icepdf/examples/extraction/PageTextExtraction.java
        Commit graph MODIFY /icepdf/trunk/icepdf/viewer/src/org/icepdf/ri/util/TextExtractionTask.java
        Commit graph MODIFY /icepdf/trunk/icepdf/examples/extraction/PageMetaDataExtraction.java
        Commit graph MODIFY /icepdf/trunk/icepdf/examples/extraction/PageImageExtraction.java
        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: