Details
- 
        Type: Bug Bug
- 
        Status: Closed
- 
            Priority: Major Major
- 
            Resolution: Fixed
- 
            Affects Version/s: 4.0
- 
            Fix Version/s: 4.0.1
- 
            Component/s: Core/Parsing
- 
            Labels:None
- 
            Environment:OS rendering core
- 
                        ICEsoft Forum Reference:
- 
                        Workaround Description:See the comments in the main issue tracker thread.
Description
                    The forum poster noticed that the extracted text in a PDF document was missing some punctuation.  I dog a little deeper and found that we where correctly apply the character encoding if a toUnicode cmap does not exist.  
The following is a quick patch for the problem but will not be the final fix. I will update the interface code to make sure we handle this in a more generic maner.
Step One.
Create a new method in org.icepdf.core.pobjects.fonts.ofont.Ofont.java
public char toUnicode(char c1) {
char c = toUnicode==null?getCharDiff(c1):c1;
c = getCMapping(c);
if (!awtFont.canDisplay(c)) {
c |= 0xF000;
}
if (!awtFont.canDisplay(c)) {
c = findAlternateSymbol(c);
}
return c;
}
Step 2
Call the new toUnicode method from the content parser. Just after the 'charValue' is defined in drawString(..) ~ line 2160 add the following code.
if (textState.currentfont instanceof org.icepdf.core.pobjects.fonts.ofont.OFont){
charValue = ((org.icepdf.core.pobjects.fonts.ofont.OFont)
textState.currentfont).toUnicode(unmodifiedDisplayText.charAt(i));
}
Once again this is not an official patch just a work around.
            
The following is a quick patch for the problem but will not be the final fix. I will update the interface code to make sure we handle this in a more generic maner.
Step One.
Create a new method in org.icepdf.core.pobjects.fonts.ofont.Ofont.java
public char toUnicode(char c1) {
char c = toUnicode==null?getCharDiff(c1):c1;
c = getCMapping(c);
if (!awtFont.canDisplay(c)) {
c |= 0xF000;
}
if (!awtFont.canDisplay(c)) {
c = findAlternateSymbol(c);
}
return c;
}
Step 2
Call the new toUnicode method from the content parser. Just after the 'charValue' is defined in drawString(..) ~ line 2160 add the following code.
if (textState.currentfont instanceof org.icepdf.core.pobjects.fonts.ofont.OFont){
charValue = ((org.icepdf.core.pobjects.fonts.ofont.OFont)
textState.currentfont).toUnicode(unmodifiedDisplayText.charAt(i));
}
Once again this is not an official patch just a work around.
Activity
- All
- Comments
- History
- Activity
- Remote Attachments
- Subversion
Issues appears to be resolved.