[PDF-288] Asian font encoding issue - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.2
Fix Version/s: 6.0
Component/s: Core/Parsing
Labels:
None
Environment:
PRO

ICEsoft Forum Reference:
http://www.icefaces.org/JForum/posts/list/18960.page
Workaround Exists:

Yes

Description

The attache file combines CIDFontType2 Asian and Roman glyphs. For some reason the encoding is not correctly being applied and the incorrect glyphs are being rendered.

Further investigation is needed but this should be fixable.

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Download All

Attachments

assian_test.pdf

08/Apr/11 7:53 AM

1.09 MB

Patrick Corless

Activity

Ascending order - Click to sort in descending order

Patrick Corless created issue - 08/Apr/11 7:50 AM

Hide

Permalink

Patrick Corless added a comment - 08/Apr/11 7:53 AM

sample file

Show

Patrick Corless added a comment - 08/Apr/11 7:53 AM sample file

Patrick Corless made changes - 08/Apr/11 7:53 AM

Field	Original Value	New Value
Attachment		assian_test.pdf [ 13083 ]

Hide

Permalink

Patrick Corless added a comment - 18/Aug/11 10:49 AM

I've taken a closer look at this issue and it comes down two name object encode hex digits. For example font names in question are enocoded as follows:

#b7#bd#d5#fd#b3#ac#b4#d6#ba#da_GBK+ZEMJ7y-1

Where each #XX represent a 2-digit hexadecima code. The current code parses the hex format into an integer and inserts the resulting character code in the into the string. There doesn't seem to be anything wrong with this approach but Java Strings don't treat them as unicode.

I have a workaround code that formats the #xx hex into standard Java Unicode for example #b7 = \u00b7. However I don't know if this is what the end user is expected.

The class org.icepdf.core.pobjects.Name would be updated as follows:

/**

Utility Method converting Name object hext notation to ascii. For
example #41 should be represented as 'A'. The hex format will always
be #XX where XX is a 2 digit hex value. The spec says that # can't be
used in a string but I guess we'll see.
*
@param name PDF name object string to be checked for hex codes.
@return full ascii encoded name string.
*/
private String convertHexChars(StringBuilder name) {
// we need to search for an instance of # and try and convert to hex
try {
for (int i = 0; i < name.length(); i++)
Unknown macro: { if (name.charAt(i) == HEX_CHAR) { // convert digits to hex. name.delete(i, i + 3); name.insert(i, convert(name.substring(i + 1, i + 3))); } }

} catch (Throwable e)
{ logger.warning("Error parsing hexadecimal characters."); // we are going to bail on any exception and just return the original // string. return name.toString(); }
return name.toString();
}

/**

Converts a hext string to formated unicode string.
@param hex 2-digit hex number.
@return
*/
private String convert(String hex) {
StringBuilder output = new StringBuilder();
output.append("
u"); // standard unicode format.
for (int j = 0, max = 4 - hex.length(); j < max; j++) { output.append("0"); }
output.append(hex.toLowerCase());
return output.toString();

}

Any feed back on these potential workaround would be appreciated. If it's a valid fix I can add it to the core code base.

Show

Patrick Corless added a comment - 18/Aug/11 10:49 AM I've taken a closer look at this issue and it comes down two name object encode hex digits. For example font names in question are enocoded as follows: #b7#bd#d5#fd#b3#ac#b4#d6#ba#da_GBK+ZEMJ7y-1 Where each #XX represent a 2-digit hexadecima code. The current code parses the hex format into an integer and inserts the resulting character code in the into the string. There doesn't seem to be anything wrong with this approach but Java Strings don't treat them as unicode. I have a workaround code that formats the #xx hex into standard Java Unicode for example #b7 = \u00b7. However I don't know if this is what the end user is expected. The class org.icepdf.core.pobjects.Name would be updated as follows: /** Utility Method converting Name object hext notation to ascii. For example #41 should be represented as 'A'. The hex format will always be #XX where XX is a 2 digit hex value. The spec says that # can't be used in a string but I guess we'll see. * @param name PDF name object string to be checked for hex codes. @return full ascii encoded name string. */ private String convertHexChars(StringBuilder name) { // we need to search for an instance of # and try and convert to hex try { for (int i = 0; i < name.length(); i++) Unknown macro: { if (name.charAt(i) == HEX_CHAR) { // convert digits to hex. name.delete(i, i + 3); name.insert(i, convert(name.substring(i + 1, i + 3))); } } } catch (Throwable e) { logger.warning("Error parsing hexadecimal characters."); // we are going to bail on any exception and just return the original // string. return name.toString(); } return name.toString(); } /** Converts a hext string to formated unicode string. @param hex 2-digit hex number. @return */ private String convert(String hex) { StringBuilder output = new StringBuilder(); output.append(" u"); // standard unicode format. for (int j = 0, max = 4 - hex.length(); j < max; j++) { output.append("0"); } output.append(hex.toLowerCase()); return output.toString(); } Any feed back on these potential workaround would be appreciated. If it's a valid fix I can add it to the core code base.

Patrick Corless made changes - 18/Aug/11 10:49 AM

Workaround Exists		[Yes]
Salesforce Case		[]
Fix Version/s		4.3 [ 10266 ]
Fix Version/s	4.2.2 [ 10265 ]

Repository	Revision	Date	User	Message
ICEsoft Public SVN Repository	#27164	Thu Jan 12 08:53:12 MST 2012	patrick.corless	~~PDF-288~~ updated name class to convert the "#Hex" notation to unicode, so font names can be more easily read.
				Files Changed
				MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/Name.java

Hide

Permalink

Patrick Corless added a comment - 12/Jan/12 10:59 AM

I've applied the naming parsing change but the document in question still has a mix of embedded CID font and non embedded CID fonts so getting it to fully render will be difficult without the fonts used to encode it.

marking issue as won't fix for now.

Show

Patrick Corless added a comment - 12/Jan/12 10:59 AM I've applied the naming parsing change but the document in question still has a mix of embedded CID font and non embedded CID fonts so getting it to fully render will be difficult without the fonts used to encode it. marking issue as won't fix for now.

Patrick Corless made changes - 12/Jan/12 10:59 AM

Status	Open [ 1 ]	Resolved [ 5 ]
Resolution		Won't Fix [ 2 ]

Ken Fyten made changes - 29/Mar/12 11:42 AM

Status

Resolved [ 5 ]

Closed [ 6 ]

Patrick Corless made changes - 01/Apr/15 1:54 PM

Resolution	Won't Fix [ 2 ]
Status	Closed [ 6 ]	Reopened [ 4 ]

Patrick Corless made changes - 01/Apr/15 1:54 PM

Fix Version/s		5.2 [ 10970 ]
Fix Version/s	4.3 [ 10266 ]

Hide

Permalink

Patrick Corless added a comment - 01/Apr/15 1:58 PM

After a bunch of work we are know correctly rendering most japan and chinese based document regardless of the fonts being embedded or not. The core operating system still needs to have a fonts that can render the unicode characters.

Show

Patrick Corless added a comment - 01/Apr/15 1:58 PM After a bunch of work we are know correctly rendering most japan and chinese based document regardless of the fonts being embedded or not. The core operating system still needs to have a fonts that can render the unicode characters.

Patrick Corless made changes - 01/Apr/15 1:58 PM

Status	Reopened [ 4 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Patrick Corless made changes - 01/Apr/15 3:01 PM

Status

Resolved [ 5 ]

Closed [ 6 ]

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

08/Apr/11 7:50 AM

Updated:

01/Apr/15 3:01 PM

Resolved:

01/Apr/15 1:58 PM