ICEpdf
  1. ICEpdf
  2. PDF-167

Ofont not apply character diff values on text extraction

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.0.1
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      OS rendering core

      Description

      The forum poster noticed that the extracted text in a PDF document was missing some punctuation. I dog a little deeper and found that we where correctly apply the character encoding if a toUnicode cmap does not exist.

      The following is a quick patch for the problem but will not be the final fix. I will update the interface code to make sure we handle this in a more generic maner.

      Step One.
      Create a new method in org.icepdf.core.pobjects.fonts.ofont.Ofont.java
      public char toUnicode(char c1) {
      char c = toUnicode==null?getCharDiff(c1):c1;
      c = getCMapping(c);

      if (!awtFont.canDisplay(c)) {
          c |= 0xF000;
      }
      if (!awtFont.canDisplay(c)) {
          c = findAlternateSymbol(c);
      }
      return c;
      }

      Step 2
      Call the new toUnicode method from the content parser. Just after the 'charValue' is defined in drawString(..) ~ line 2160 add the following code.

      if (textState.currentfont instanceof org.icepdf.core.pobjects.fonts.ofont.OFont){
      charValue = ((org.icepdf.core.pobjects.fonts.ofont.OFont)
              textState.currentfont).toUnicode(unmodifiedDisplayText.charAt(i));
      }

      Once again this is not an official patch just a work around.

        Activity

        Patrick Corless created issue -
        Patrick Corless made changes -
        Field Original Value New Value
        Salesforce Case []
        Fix Version/s 4.0.1 [ 10228 ]
        Affects Version/s 4.0 [ 10222 ]
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #20973 Tue Mar 16 11:00:52 MDT 2010 patrick.corless PDF-167 - fixed encoding issue which prevent some types of OS font text from being extract correctly.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/ContentParser.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/fonts/FontFile.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/fonts/ofont/OFont.java
        Hide
        Patrick Corless added a comment -

        Issues appears to be resolved.

        Show
        Patrick Corless added a comment - Issues appears to be resolved.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Ken Fyten made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: