[PDF-167] Ofont not apply character diff values on text extraction - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0
Fix Version/s: 4.0.1
Component/s: Core/Parsing
Labels:
None
Environment:
OS rendering core

ICEsoft Forum Reference:
http://www.icefaces.org/JForum/posts/list/0/16298.page
Workaround Description:
See the comments in the main issue tracker thread.

Description

The forum poster noticed that the extracted text in a PDF document was missing some punctuation. I dog a little deeper and found that we where correctly apply the character encoding if a toUnicode cmap does not exist.

The following is a quick patch for the problem but will not be the final fix. I will update the interface code to make sure we handle this in a more generic maner.

Step One.
Create a new method in org.icepdf.core.pobjects.fonts.ofont.Ofont.java
public char toUnicode(char c1) {
char c = toUnicode==null?getCharDiff(c1):c1;
c = getCMapping(c);

if (!awtFont.canDisplay(c)) {
    c |= 0xF000;
}
if (!awtFont.canDisplay(c)) {
    c = findAlternateSymbol(c);
}
return c;
}

Step 2
Call the new toUnicode method from the content parser. Just after the 'charValue' is defined in drawString(..) ~ line 2160 add the following code.

if (textState.currentfont instanceof org.icepdf.core.pobjects.fonts.ofont.OFont){
charValue = ((org.icepdf.core.pobjects.fonts.ofont.OFont)
        textState.currentfont).toUnicode(unmodifiedDisplayText.charAt(i));
}

Once again this is not an official patch just a work around.

Activity

Hide

Permalink

Patrick Corless added a comment - 31/Mar/10 10:41 AM

Issues appears to be resolved.

Show

Patrick Corless added a comment - 31/Mar/10 10:41 AM Issues appears to be resolved.

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

1 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

16/Mar/10 10:31 AM

Updated:

29/Mar/12 11:56 AM

Resolved:

31/Mar/10 10:41 AM