ICEpdf
  1. ICEpdf
  2. PDF-854

Copied text from PDF is pasted incorrectly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.1.1
    • Fix Version/s: 5.1.2
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      All

      Description

      When copying/extracting the text from a PDF file in the viewer, the pasted text is not in the same format as the text on the viewer. For example:

      Viewer text: 31.12.2013

      Pasted text:
       3
       1
      .
       1
       2
      .
       2
       0
       1
      3

        Activity

        Arran Mccullough created issue -
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #43990 Mon Feb 02 14:01:57 MST 2015 patrick.corless PDF-854 updates to text extraction ordering
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/TextSprite.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/PageText.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/util/content/AbstractContentParser.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/WordText.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/LinePositionComparator.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/AbstractText.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/GlyphText.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/viewer/src/org/icepdf/ri/util/TextExtractionTask.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/annotations/FreeTextAnnotation.java
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/WordPositionComparator.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #43991 Mon Feb 02 14:08:09 MST 2015 patrick.corless PDF-854 changes to text extraction formating.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/TextSprite.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/annotations/FreeTextAnnotation.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/PageText.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/GlyphText.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/LinePositionComparator.java
        Commit graph MODIFY /icepdf/trunk/icepdf/viewer/src/org/icepdf/ri/util/TextExtractionTask.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/annotations/ChoiceWidgetAnnotation.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/annotations/TextWidgetAnnotation.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/content/AbstractContentParser.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/WordPositionComparator.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/WordText.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/AbstractText.java
        Patrick Corless made changes -
        Field Original Value New Value
        Fix Version/s 5.1.2 [ 11872 ]
        Hide
        Patrick Corless added a comment -

        The PDF in question contains a landscape page view. For some strange reason the text is layed out using a a portrait layout. As a result the coordinates move along the y-plane instead of the usual x-plane which explains why our page extraction algorithm breaks down. I've added a fix which looks for the negative y shear value in the Tm matrix which is responsible for the rotation.

        Show
        Patrick Corless added a comment - The PDF in question contains a landscape page view. For some strange reason the text is layed out using a a portrait layout. As a result the coordinates move along the y-plane instead of the usual x-plane which explains why our page extraction algorithm breaks down. I've added a fix which looks for the negative y shear value in the Tm matrix which is responsible for the rotation.
        Hide
        Patrick Corless added a comment -

        I've rework the new line detection code to take a few more units of measure into consideration before creating a new line of text. This seems to fix the document in question with regards to text extraction. Tripple clicking on a line now selects all the text that visually represents a line of text.

        Show
        Patrick Corless added a comment - I've rework the new line detection code to take a few more units of measure into consideration before creating a new line of text. This seems to fix the document in question with regards to text extraction. Tripple clicking on a line now selects all the text that visually represents a line of text.
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #44087 Thu Feb 19 10:07:30 MST 2015 patrick.corless PDF-854 addition of a tolerance check before creating a new line.
        Files Changed
        Commit graph MODIFY /icepdf/branches/icepdf-5.0.1/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/PageText.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #44092 Thu Feb 19 11:44:15 MST 2015 patrick.corless PDF-854 addition of a tolerance check before creating a new line.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/text/PageText.java
        Hide
        Patrick Corless added a comment -

        Marking as resolved.

        Show
        Patrick Corless added a comment - Marking as resolved.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Arran Mccullough
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: