ICEpdf
  1. ICEpdf
  2. PDF-167

Ofont not apply character diff values on text extraction

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.0.1
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      OS rendering core

      Description

      The forum poster noticed that the extracted text in a PDF document was missing some punctuation. I dog a little deeper and found that we where correctly apply the character encoding if a toUnicode cmap does not exist.

      The following is a quick patch for the problem but will not be the final fix. I will update the interface code to make sure we handle this in a more generic maner.

      Step One.
      Create a new method in org.icepdf.core.pobjects.fonts.ofont.Ofont.java
      public char toUnicode(char c1) {
      char c = toUnicode==null?getCharDiff(c1):c1;
      c = getCMapping(c);

      if (!awtFont.canDisplay(c)) {
          c |= 0xF000;
      }
      if (!awtFont.canDisplay(c)) {
          c = findAlternateSymbol(c);
      }
      return c;
      }

      Step 2
      Call the new toUnicode method from the content parser. Just after the 'charValue' is defined in drawString(..) ~ line 2160 add the following code.

      if (textState.currentfont instanceof org.icepdf.core.pobjects.fonts.ofont.OFont){
      charValue = ((org.icepdf.core.pobjects.fonts.ofont.OFont)
              textState.currentfont).toUnicode(unmodifiedDisplayText.charAt(i));
      }

      Once again this is not an official patch just a work around.

        Activity

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: