ICEpdf
  1. ICEpdf
  2. PDF-356

Error extractiong CID font unicode data

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 4.2.2
    • Fix Version/s: 4.3
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      pro

      Description

      The file in question was sent in from a client. Page 9 of the pdf contains some Asian characters back by a CID font with now unicode data. When the client uses the OSX PDF preview tool the Asian text is extracted correctly. However when ICEpdf is used the extract text is not correctly encoded for unicode.

      I've taken one pass at this issue and the CID font in question has no unicode data associated with it so I can only assume the orginal CID's are valid unicode values. For the first list of Asican charcters the cid's are:

      5797, 3388, 2694, 3879, 618, 1186, 5242, 1625 etc.

      The cid to GID is
      595 -> 1
      601 -> 2
      618 -> 3
      660 -> 4
      752 -> 5
      1186 -> 6
      1393 -> 7
      1625 -> 8
      1954 -> 9
      2694 -> 10
      3388 -> 11
      3543 -> 12
      3879 -> 13
      4469 -> 14
      5242 -> 15
      5797 -> 16

      When we output one of these codes via text extraction we don't do any special byte handling for UTF-8 which I think is causing the output issue. More debuging is needed to get to the bottom of this one.

        Activity

        There are no subversion log entries for this issue yet.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: