Details
-
Type: Bug
-
Status: Closed
-
Priority: Major
-
Resolution: Won't Fix
-
Affects Version/s: 4.2.2
-
Fix Version/s: 4.3
-
Component/s: Core/Parsing
-
Labels:None
-
Environment:pro
Description
The file in question was sent in from a client. Page 9 of the pdf contains some Asian characters back by a CID font with now unicode data. When the client uses the OSX PDF preview tool the Asian text is extracted correctly. However when ICEpdf is used the extract text is not correctly encoded for unicode.
I've taken one pass at this issue and the CID font in question has no unicode data associated with it so I can only assume the orginal CID's are valid unicode values. For the first list of Asican charcters the cid's are:
5797, 3388, 2694, 3879, 618, 1186, 5242, 1625 etc.
The cid to GID is
595 -> 1
601 -> 2
618 -> 3
660 -> 4
752 -> 5
1186 -> 6
1393 -> 7
1625 -> 8
1954 -> 9
2694 -> 10
3388 -> 11
3543 -> 12
3879 -> 13
4469 -> 14
5242 -> 15
5797 -> 16
When we output one of these codes via text extraction we don't do any special byte handling for UTF-8 which I think is causing the output issue. More debuging is needed to get to the bottom of this one.
I've taken one pass at this issue and the CID font in question has no unicode data associated with it so I can only assume the orginal CID's are valid unicode values. For the first list of Asican charcters the cid's are:
5797, 3388, 2694, 3879, 618, 1186, 5242, 1625 etc.
The cid to GID is
595 -> 1
601 -> 2
618 -> 3
660 -> 4
752 -> 5
1186 -> 6
1393 -> 7
1625 -> 8
1954 -> 9
2694 -> 10
3388 -> 11
3543 -> 12
3879 -> 13
4469 -> 14
5242 -> 15
5797 -> 16
When we output one of these codes via text extraction we don't do any special byte handling for UTF-8 which I think is causing the output issue. More debuging is needed to get to the bottom of this one.
Activity
- All
- Comments
- History
- Activity
- Remote Attachments
- Subversion
Patrick Corless
created issue -
Patrick Corless
made changes -
Field | Original Value | New Value |
---|---|---|
Status | Open [ 1 ] | Resolved [ 5 ] |
Resolution | Won't Fix [ 2 ] |
Ken Fyten
made changes -
Status | Resolved [ 5 ] | Closed [ 6 ] |