ICEpdf
  1. ICEpdf
  2. PDF-356

Error extractiong CID font unicode data

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 4.2.2
    • Fix Version/s: 4.3
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      pro

      Description

      The file in question was sent in from a client. Page 9 of the pdf contains some Asian characters back by a CID font with now unicode data. When the client uses the OSX PDF preview tool the Asian text is extracted correctly. However when ICEpdf is used the extract text is not correctly encoded for unicode.

      I've taken one pass at this issue and the CID font in question has no unicode data associated with it so I can only assume the orginal CID's are valid unicode values. For the first list of Asican charcters the cid's are:

      5797, 3388, 2694, 3879, 618, 1186, 5242, 1625 etc.

      The cid to GID is
      595 -> 1
      601 -> 2
      618 -> 3
      660 -> 4
      752 -> 5
      1186 -> 6
      1393 -> 7
      1625 -> 8
      1954 -> 9
      2694 -> 10
      3388 -> 11
      3543 -> 12
      3879 -> 13
      4469 -> 14
      5242 -> 15
      5797 -> 16

      When we output one of these codes via text extraction we don't do any special byte handling for UTF-8 which I think is causing the output issue. More debuging is needed to get to the bottom of this one.

        Activity

        Hide
        Patrick Corless added a comment -

        I've spend some time carefully looking over this PDF. It turns out there is no easy way to determine the correct Unicode values for the glyphs in question. The font is CID based and has no Unicode data associated with it. As a result of the missing Unicode data ICEpdf like Acrobat can't extract meaningful Unicode data.

        The document is created using Mac OS X Quartz and as a result I suspect the OS X viewer knows the original toUnicode map for LSungLight that was used to encode the text in question. There isn't much we can do to correct this on our end.

        Show
        Patrick Corless added a comment - I've spend some time carefully looking over this PDF. It turns out there is no easy way to determine the correct Unicode values for the glyphs in question. The font is CID based and has no Unicode data associated with it. As a result of the missing Unicode data ICEpdf like Acrobat can't extract meaningful Unicode data. The document is created using Mac OS X Quartz and as a result I suspect the OS X viewer knows the original toUnicode map for LSungLight that was used to encode the text in question. There isn't much we can do to correct this on our end.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: