[PDF-1003] GlyphText.getUnicode() gives wrong results for embedded-subset fonts with custom encoding, when FontEngine is ON - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 6.1.2
Fix Version/s: 6.1.3
Component/s: Font Engine
Labels:
None
Environment:
ICEpdf 6.1.2 Pro
Java SE 8u92 x64
Windows 10 x64

Description

h4. Summary
When FontEngine is ON, GlyphText.toUnicode() does not correctly apply custom encoding specified by an embedded font-subset, which manifests itself when trying to work with LineText/WordText in the document, or when trying to copy text from the RI Viewer.

h4. Steps to reproduce
# Launch ICEpdf Pro RI Viewer
# Open the attached PDF file
# Make a text selection (e.g. in the "Briefing Strip" section of the document)
# Copy selected text from RI Viewer, and paste it to a text editor

h4. Expected behavior
Selected text should contain the same character as the displayed PDF.

h4. Actual behavior:
* With FontEngine OFF, correct decoding is observed: http://i.imgur.com/cVgmdNI.png
* With FontEngine ON, copied text is garbled (i.e. custom encoding is not translated to Unicode correctly): http://i.imgur.com/Njbbid6.png

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

KDEN_Rwy8.pdf

21/Jun/16 2:33 PM

53 kB

Matvei Stefarov

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Matvei Stefarov added a comment - 21/Jun/16 4:07 PM - edited

I don't know how FontEngine works under-the-hood, but stepping through AbstractContentParser.drawString(...) in a debugger shows that all the necessary information is present at the time GlyphText objects are constructed. For example, the first embedded font in the attached PDF is used for the blue text ("Front Range") on the middle-right side of the page:

displayText contains CIDs, in the font's custom encoding: \u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\u0004\u0009\u000A
textState.font appears to contain an Encoding object named "WinAnsi+diffs" that maps these CIDs to standard symbol names:
```
0 -> ".notdef"
1 -> "F"
2 -> "r"
3 -> "o"
4 -> "n"
5 -> "t"
6 -> "space"
7 -> "R"
8 -> "a"
9 -> "g"
10 -> "e"
11 -> "space"
```
...so the CIDs can be mapped to: ["F", "r", "o", "n", "t", "space", "R", "a", "n", "g", "e"]
The OFont Encoding implementation (org.icepdf.core.pobjects.fonts.ofont.Encoding) has a mapping from these symbol names to UVs, from which it should be trivial to get to a Unicode string. I'm guessing that NFont's Encoding implementation has something similar.

So, given that all this information is already available to the code, it seems that FontEngine should be able to work with the custom encoding without too much trouble.

Show

Matvei Stefarov added a comment - 21/Jun/16 4:07 PM - edited I don't know how FontEngine works under-the-hood, but stepping through AbstractContentParser.drawString(...) in a debugger shows that all the necessary information is present at the time GlyphText objects are constructed. For example, the first embedded font in the attached PDF is used for the blue text ("Front Range") on the middle-right side of the page: displayText contains CIDs, in the font's custom encoding: \u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\u0004\u0009\u000A textState.font appears to contain an Encoding object named "WinAnsi+diffs" that maps these CIDs to standard symbol names: 0 -> ".notdef" 1 -> "F" 2 -> "r" 3 -> "o" 4 -> "n" 5 -> "t" 6 -> "space" 7 -> "R" 8 -> "a" 9 -> "g" 10 -> "e" 11 -> "space" ...so the CIDs can be mapped to: ["F", "r", "o", "n", "t", "space", "R", "a", "n", "g", "e"] The OFont Encoding implementation (org.icepdf.core.pobjects.fonts.ofont.Encoding) has a mapping from these symbol names to UVs, from which it should be trivial to get to a Unicode string. I'm guessing that NFont's Encoding implementation has something similar. So, given that all this information is already available to the code, it seems that FontEngine should be able to work with the custom encoding without too much trouble.

Hide

Permalink

Matvei Stefarov added a comment - 21/Jun/16 4:19 PM

Note that this issue may be related to PDF-936 ("ToUnicode conversion errors for text extraction") or PDF-616 ("Missing unicode value when extracting text"), since both of those issues involve embedded custom-encoding font subsets.

Show

Matvei Stefarov added a comment - 21/Jun/16 4:19 PM Note that this issue may be related to PDF-936 ("ToUnicode conversion errors for text extraction") or PDF-616 ("Missing unicode value when extracting text"), since both of those issues involve embedded custom-encoding font subsets.

Hide

Permalink

Patrick Corless added a comment - 28/Oct/16 1:58 PM

This appears to be a regression introduced in ~~PDF-722~~. The basic problem was that for some TrueType fonts the encoding information specified by the document doesn't always match the font. As a result we can get an encoding that doesn't always match the glyph map of the font and end up with a PDF rendering that doesn't look right. In this particular the encoding data is fine and matches what is in the font. However in such a case we normally generate a toUnicode table out of the encoding data for text extraction purposes. The fix for ~~PDF-722~~ prevented this map from being correctly assigned.

Show

Patrick Corless added a comment - 28/Oct/16 1:58 PM This appears to be a regression introduced in PDF-722 . The basic problem was that for some TrueType fonts the encoding information specified by the document doesn't always match the font. As a result we can get an encoding that doesn't always match the glyph map of the font and end up with a PDF rendering that doesn't look right. In this particular the encoding data is fine and matches what is in the font. However in such a case we normally generate a toUnicode table out of the encoding data for text extraction purposes. The fix for PDF-722 prevented this map from being correctly assigned.

Hide

Permalink

Patrick Corless added a comment - 28/Oct/16 2:04 PM

Marking as fixed.

Show

Patrick Corless added a comment - 28/Oct/16 2:04 PM Marking as fixed.

GlyphText.getUnicode() gives wrong results for embedded-subset fonts with custom encoding, when FontEngine is ON

Details

Description

Attachments

Activity

People

Dates