Details
-
Type: Bug
-
Status: Closed
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 3.0
-
Fix Version/s: 3.1
-
Component/s: Core/Parsing
-
Labels:None
-
Environment:any
-
ICEsoft Forum Reference:
Description
The following fix was suggested by Pedro Rivera:
public String toUnicode(String displayText) {
...
...
//char c = c1;//getCharDiff(c1); //<== Comment this out
char c = getCharDiff(c1); //<== Put this in
...
...
}
This fix is valid for all but CID font types. When we have a CID font we have to look for a toUnicode character map for the specified character ids. In such a case we can't use the Differences specified by the encoding as they are no longer applicable. CID type fonts are special in the that character id is completely arbitrary and has no meaning except to the font program which knows which glyph it should draw for the given CID. CID fonts make font substitution impossible unless the PDF encoder also included a to Unicode cmap. The unicode cmap maps the CID to a valid unicode value which in normal cases would be used for text extraction but we can also try and sue to display the correct content.
There is a long story behind why the getCharDiff function was commended out but after reading Pedro's post it became immediately clear to me as to when we could used and had problems with it in the past. Thanks Pedro!
public String toUnicode(String displayText) {
...
...
//char c = c1;//getCharDiff(c1); //<== Comment this out
char c = getCharDiff(c1); //<== Put this in
...
...
}
This fix is valid for all but CID font types. When we have a CID font we have to look for a toUnicode character map for the specified character ids. In such a case we can't use the Differences specified by the encoding as they are no longer applicable. CID type fonts are special in the that character id is completely arbitrary and has no meaning except to the font program which knows which glyph it should draw for the given CID. CID fonts make font substitution impossible unless the PDF encoder also included a to Unicode cmap. The unicode cmap maps the CID to a valid unicode value which in normal cases would be used for text extraction but we can also try and sue to display the correct content.
There is a long story behind why the getCharDiff function was commended out but after reading Pedro's post it became immediately clear to me as to when we could used and had problems with it in the past. Thanks Pedro!
Added code in svn checken -r 19367.