[PDF-288] Asian font encoding issue - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.2
Fix Version/s: 6.0
Component/s: Core/Parsing
Labels:
None
Environment:
PRO

ICEsoft Forum Reference:
http://www.icefaces.org/JForum/posts/list/18960.page
Workaround Exists:

Yes

Description

The attache file combines CIDFontType2 Asian and Roman glyphs. For some reason the encoding is not correctly being applied and the incorrect glyphs are being rendered.

Further investigation is needed but this should be fixable.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

assian_test.pdf

08/Apr/11 7:53 AM

1.09 MB

Patrick Corless

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Patrick Corless added a comment - 08/Apr/11 7:53 AM

sample file

Show

Patrick Corless added a comment - 08/Apr/11 7:53 AM sample file

Hide

Permalink

Patrick Corless added a comment - 18/Aug/11 10:49 AM

I've taken a closer look at this issue and it comes down two name object encode hex digits. For example font names in question are enocoded as follows:

#b7#bd#d5#fd#b3#ac#b4#d6#ba#da_GBK+ZEMJ7y-1

Where each #XX represent a 2-digit hexadecima code. The current code parses the hex format into an integer and inserts the resulting character code in the into the string. There doesn't seem to be anything wrong with this approach but Java Strings don't treat them as unicode.

I have a workaround code that formats the #xx hex into standard Java Unicode for example #b7 = \u00b7. However I don't know if this is what the end user is expected.

The class org.icepdf.core.pobjects.Name would be updated as follows:

/**

Utility Method converting Name object hext notation to ascii. For
example #41 should be represented as 'A'. The hex format will always
be #XX where XX is a 2 digit hex value. The spec says that # can't be
used in a string but I guess we'll see.
*
@param name PDF name object string to be checked for hex codes.
@return full ascii encoded name string.
*/
private String convertHexChars(StringBuilder name) {
// we need to search for an instance of # and try and convert to hex
try {
for (int i = 0; i < name.length(); i++)
Unknown macro: { if (name.charAt(i) == HEX_CHAR) { // convert digits to hex. name.delete(i, i + 3); name.insert(i, convert(name.substring(i + 1, i + 3))); } }

} catch (Throwable e)
{ logger.warning("Error parsing hexadecimal characters."); // we are going to bail on any exception and just return the original // string. return name.toString(); }
return name.toString();
}

/**

Converts a hext string to formated unicode string.
@param hex 2-digit hex number.
@return
*/
private String convert(String hex) {
StringBuilder output = new StringBuilder();
output.append("
u"); // standard unicode format.
for (int j = 0, max = 4 - hex.length(); j < max; j++) { output.append("0"); }
output.append(hex.toLowerCase());
return output.toString();

}

Any feed back on these potential workaround would be appreciated. If it's a valid fix I can add it to the core code base.

Show

Patrick Corless added a comment - 18/Aug/11 10:49 AM I've taken a closer look at this issue and it comes down two name object encode hex digits. For example font names in question are enocoded as follows: #b7#bd#d5#fd#b3#ac#b4#d6#ba#da_GBK+ZEMJ7y-1 Where each #XX represent a 2-digit hexadecima code. The current code parses the hex format into an integer and inserts the resulting character code in the into the string. There doesn't seem to be anything wrong with this approach but Java Strings don't treat them as unicode. I have a workaround code that formats the #xx hex into standard Java Unicode for example #b7 = \u00b7. However I don't know if this is what the end user is expected. The class org.icepdf.core.pobjects.Name would be updated as follows: /** Utility Method converting Name object hext notation to ascii. For example #41 should be represented as 'A'. The hex format will always be #XX where XX is a 2 digit hex value. The spec says that # can't be used in a string but I guess we'll see. * @param name PDF name object string to be checked for hex codes. @return full ascii encoded name string. */ private String convertHexChars(StringBuilder name) { // we need to search for an instance of # and try and convert to hex try { for (int i = 0; i < name.length(); i++) Unknown macro: { if (name.charAt(i) == HEX_CHAR) { // convert digits to hex. name.delete(i, i + 3); name.insert(i, convert(name.substring(i + 1, i + 3))); } } } catch (Throwable e) { logger.warning("Error parsing hexadecimal characters."); // we are going to bail on any exception and just return the original // string. return name.toString(); } return name.toString(); } /** Converts a hext string to formated unicode string. @param hex 2-digit hex number. @return */ private String convert(String hex) { StringBuilder output = new StringBuilder(); output.append(" u"); // standard unicode format. for (int j = 0, max = 4 - hex.length(); j < max; j++) { output.append("0"); } output.append(hex.toLowerCase()); return output.toString(); } Any feed back on these potential workaround would be appreciated. If it's a valid fix I can add it to the core code base.

Hide

Permalink

Patrick Corless added a comment - 12/Jan/12 10:59 AM

I've applied the naming parsing change but the document in question still has a mix of embedded CID font and non embedded CID fonts so getting it to fully render will be difficult without the fonts used to encode it.

marking issue as won't fix for now.

Show

Patrick Corless added a comment - 12/Jan/12 10:59 AM I've applied the naming parsing change but the document in question still has a mix of embedded CID font and non embedded CID fonts so getting it to fully render will be difficult without the fonts used to encode it. marking issue as won't fix for now.

Hide

Permalink

Patrick Corless added a comment - 01/Apr/15 1:58 PM

After a bunch of work we are know correctly rendering most japan and chinese based document regardless of the fonts being embedded or not. The core operating system still needs to have a fonts that can render the unicode characters.

Show

Patrick Corless added a comment - 01/Apr/15 1:58 PM After a bunch of work we are know correctly rendering most japan and chinese based document regardless of the fonts being embedded or not. The core operating system still needs to have a fonts that can render the unicode characters.

Asian font encoding issue

Details

Description

Attachments

Activity

People

Dates