ICEpdf
  1. ICEpdf
  2. PDF-1003

GlyphText.getUnicode() gives wrong results for embedded-subset fonts with custom encoding, when FontEngine is ON

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.1.2
    • Fix Version/s: 6.1.3
    • Component/s: Font Engine
    • Labels:
      None
    • Environment:
      ICEpdf 6.1.2 Pro
      Java SE 8u92 x64
      Windows 10 x64

      Description

      h4. Summary
      When FontEngine is ON, GlyphText.toUnicode() does not correctly apply custom encoding specified by an embedded font-subset, which manifests itself when trying to work with LineText/WordText in the document, or when trying to copy text from the RI Viewer.

      h4. Steps to reproduce
      # Launch ICEpdf Pro RI Viewer
      # Open the attached PDF file
      # Make a text selection (e.g. in the "Briefing Strip" section of the document)
      # Copy selected text from RI Viewer, and paste it to a text editor

      h4. Expected behavior
      Selected text should contain the same character as the displayed PDF.

      h4. Actual behavior:
      * With FontEngine OFF, correct decoding is observed: http://i.imgur.com/cVgmdNI.png
      * With FontEngine ON, copied text is garbled (i.e. custom encoding is not translated to Unicode correctly): http://i.imgur.com/Njbbid6.png

        Activity

        Hide
        Matvei Stefarov added a comment - - edited

        I don't know how FontEngine works under-the-hood, but stepping through AbstractContentParser.drawString(...) in a debugger shows that all the necessary information is present at the time GlyphText objects are constructed. For example, the first embedded font in the attached PDF is used for the blue text ("Front Range") on the middle-right side of the page:

        • displayText contains CIDs, in the font's custom encoding: \u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\u0004\u0009\u000A
        • textState.font appears to contain an Encoding object named "WinAnsi+diffs" that maps these CIDs to standard symbol names:
          0 -> ".notdef"
          1 -> "F"
          2 -> "r"
          3 -> "o"
          4 -> "n"
          5 -> "t"
          6 -> "space"
          7 -> "R"
          8 -> "a"
          9 -> "g"
          10 -> "e"
          11 -> "space"
          

          ...so the CIDs can be mapped to: ["F", "r", "o", "n", "t", "space", "R", "a", "n", "g", "e"]

        • The OFont Encoding implementation (org.icepdf.core.pobjects.fonts.ofont.Encoding) has a mapping from these symbol names to UVs, from which it should be trivial to get to a Unicode string. I'm guessing that NFont's Encoding implementation has something similar.

        So, given that all this information is already available to the code, it seems that FontEngine should be able to work with the custom encoding without too much trouble.

        Show
        Matvei Stefarov added a comment - - edited I don't know how FontEngine works under-the-hood, but stepping through AbstractContentParser.drawString(...) in a debugger shows that all the necessary information is present at the time GlyphText objects are constructed. For example, the first embedded font in the attached PDF is used for the blue text ("Front Range") on the middle-right side of the page: displayText contains CIDs, in the font's custom encoding: \u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\u0004\u0009\u000A textState.font appears to contain an Encoding object named "WinAnsi+diffs" that maps these CIDs to standard symbol names: 0 -> ".notdef" 1 -> "F" 2 -> "r" 3 -> "o" 4 -> "n" 5 -> "t" 6 -> "space" 7 -> "R" 8 -> "a" 9 -> "g" 10 -> "e" 11 -> "space" ...so the CIDs can be mapped to: ["F", "r", "o", "n", "t", "space", "R", "a", "n", "g", "e"] The OFont Encoding implementation (org.icepdf.core.pobjects.fonts.ofont.Encoding) has a mapping from these symbol names to UVs, from which it should be trivial to get to a Unicode string. I'm guessing that NFont's Encoding implementation has something similar. So, given that all this information is already available to the code, it seems that FontEngine should be able to work with the custom encoding without too much trouble.
        Hide
        Matvei Stefarov added a comment -

        Note that this issue may be related to PDF-936 ("ToUnicode conversion errors for text extraction") or PDF-616 ("Missing unicode value when extracting text"), since both of those issues involve embedded custom-encoding font subsets.

        Show
        Matvei Stefarov added a comment - Note that this issue may be related to PDF-936 ("ToUnicode conversion errors for text extraction") or PDF-616 ("Missing unicode value when extracting text"), since both of those issues involve embedded custom-encoding font subsets.
        Hide
        Patrick Corless added a comment -

        This appears to be a regression introduced in PDF-722. The basic problem was that for some TrueType fonts the encoding information specified by the document doesn't always match the font. As a result we can get an encoding that doesn't always match the glyph map of the font and end up with a PDF rendering that doesn't look right. In this particular the encoding data is fine and matches what is in the font. However in such a case we normally generate a toUnicode table out of the encoding data for text extraction purposes. The fix for PDF-722 prevented this map from being correctly assigned.

        Show
        Patrick Corless added a comment - This appears to be a regression introduced in PDF-722 . The basic problem was that for some TrueType fonts the encoding information specified by the document doesn't always match the font. As a result we can get an encoding that doesn't always match the glyph map of the font and end up with a PDF rendering that doesn't look right. In this particular the encoding data is fine and matches what is in the font. However in such a case we normally generate a toUnicode table out of the encoding data for text extraction purposes. The fix for PDF-722 prevented this map from being correctly assigned.
        Hide
        Patrick Corless added a comment -

        Marking as fixed.

        Show
        Patrick Corless added a comment - Marking as fixed.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Matvei Stefarov
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: