[PDF-167] Ofont not apply character diff values on text extraction - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0
Fix Version/s: 4.0.1
Component/s: Core/Parsing
Labels:
None
Environment:
OS rendering core

ICEsoft Forum Reference:
http://www.icefaces.org/JForum/posts/list/0/16298.page
Workaround Description:
See the comments in the main issue tracker thread.

Description

The forum poster noticed that the extracted text in a PDF document was missing some punctuation. I dog a little deeper and found that we where correctly apply the character encoding if a toUnicode cmap does not exist.

The following is a quick patch for the problem but will not be the final fix. I will update the interface code to make sure we handle this in a more generic maner.

Step One.
Create a new method in org.icepdf.core.pobjects.fonts.ofont.Ofont.java
public char toUnicode(char c1) {
char c = toUnicode==null?getCharDiff(c1):c1;
c = getCMapping(c);

if (!awtFont.canDisplay(c)) {
    c |= 0xF000;
}
if (!awtFont.canDisplay(c)) {
    c = findAlternateSymbol(c);
}
return c;
}

Step 2
Call the new toUnicode method from the content parser. Just after the 'charValue' is defined in drawString(..) ~ line 2160 add the following code.

if (textState.currentfont instanceof org.icepdf.core.pobjects.fonts.ofont.OFont){
charValue = ((org.icepdf.core.pobjects.fonts.ofont.OFont)
        textState.currentfont).toUnicode(unmodifiedDisplayText.charAt(i));
}

Once again this is not an official patch just a work around.

Activity

Ascending order - Click to sort in descending order

Patrick Corless created issue - 16/Mar/10 10:31 AM

Patrick Corless made changes - 16/Mar/10 10:31 AM

Field	Original Value	New Value
Salesforce Case		[]
Fix Version/s		4.0.1 [ 10228 ]
Affects Version/s		4.0 [ 10222 ]

Repository	Revision	Date	User	Message
ICEsoft Public SVN Repository	#20973	Tue Mar 16 11:00:52 MDT 2010	patrick.corless	~~PDF-167~~ - fixed encoding issue which prevent some types of OS font text from being extract correctly.
				Files Changed
				MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/ContentParser.java MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/fonts/FontFile.java MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/fonts/ofont/OFont.java

Hide

Permalink

Patrick Corless added a comment - 31/Mar/10 10:41 AM

Issues appears to be resolved.

Show

Patrick Corless added a comment - 31/Mar/10 10:41 AM Issues appears to be resolved.

Patrick Corless made changes - 31/Mar/10 10:41 AM

Status	Open [ 1 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Ken Fyten made changes - 29/Mar/12 11:56 AM

Status

Resolved [ 5 ]

Closed [ 6 ]

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

1 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

16/Mar/10 10:31 AM

Updated:

29/Mar/12 11:56 AM

Resolved:

31/Mar/10 10:41 AM