Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: 6.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      PRO

      Description

      The attache file combines CIDFontType2 Asian and Roman glyphs. For some reason the encoding is not correctly being applied and the incorrect glyphs are being rendered.

      Further investigation is needed but this should be fixable.
      1. assian_test.pdf
        1.09 MB
        Patrick Corless

        Activity

        Hide
        Patrick Corless added a comment -

        sample file

        Show
        Patrick Corless added a comment - sample file
        Hide
        Patrick Corless added a comment -

        I've taken a closer look at this issue and it comes down two name object encode hex digits. For example font names in question are enocoded as follows:

        #b7#bd#d5#fd#b3#ac#b4#d6#ba#da_GBK+ZEMJ7y-1

        Where each #XX represent a 2-digit hexadecima code. The current code parses the hex format into an integer and inserts the resulting character code in the into the string. There doesn't seem to be anything wrong with this approach but Java Strings don't treat them as unicode.

        I have a workaround code that formats the #xx hex into standard Java Unicode for example #b7 = \u00b7. However I don't know if this is what the end user is expected.

        The class org.icepdf.core.pobjects.Name would be updated as follows:

        /**

        • Utility Method converting Name object hext notation to ascii. For
        • example #41 should be represented as 'A'. The hex format will always
        • be #XX where XX is a 2 digit hex value. The spec says that # can't be
        • used in a string but I guess we'll see.
          *
        • @param name PDF name object string to be checked for hex codes.
        • @return full ascii encoded name string.
          */
          private String convertHexChars(StringBuilder name) {
          // we need to search for an instance of # and try and convert to hex
          try {
          for (int i = 0; i < name.length(); i++)
          Unknown macro: { if (name.charAt(i) == HEX_CHAR) { // convert digits to hex. name.delete(i, i + 3); name.insert(i, convert(name.substring(i + 1, i + 3))); } }

          } catch (Throwable e)

          { logger.warning("Error parsing hexadecimal characters."); // we are going to bail on any exception and just return the original // string. return name.toString(); }

          return name.toString();
          }

        /**

        • Converts a hext string to formated unicode string.
        • @param hex 2-digit hex number.
        • @return
          */
          private String convert(String hex) {
          StringBuilder output = new StringBuilder();
          output.append("
          u"); // standard unicode format.
          for (int j = 0, max = 4 - hex.length(); j < max; j++) { output.append("0"); }

          output.append(hex.toLowerCase());
          return output.toString();

        }

        Any feed back on these potential workaround would be appreciated. If it's a valid fix I can add it to the core code base.

        Show
        Patrick Corless added a comment - I've taken a closer look at this issue and it comes down two name object encode hex digits. For example font names in question are enocoded as follows: #b7#bd#d5#fd#b3#ac#b4#d6#ba#da_GBK+ZEMJ7y-1 Where each #XX represent a 2-digit hexadecima code. The current code parses the hex format into an integer and inserts the resulting character code in the into the string. There doesn't seem to be anything wrong with this approach but Java Strings don't treat them as unicode. I have a workaround code that formats the #xx hex into standard Java Unicode for example #b7 = \u00b7. However I don't know if this is what the end user is expected. The class org.icepdf.core.pobjects.Name would be updated as follows: /** Utility Method converting Name object hext notation to ascii. For example #41 should be represented as 'A'. The hex format will always be #XX where XX is a 2 digit hex value. The spec says that # can't be used in a string but I guess we'll see. * @param name PDF name object string to be checked for hex codes. @return full ascii encoded name string. */ private String convertHexChars(StringBuilder name) { // we need to search for an instance of # and try and convert to hex try { for (int i = 0; i < name.length(); i++) Unknown macro: { if (name.charAt(i) == HEX_CHAR) { // convert digits to hex. name.delete(i, i + 3); name.insert(i, convert(name.substring(i + 1, i + 3))); } } } catch (Throwable e) { logger.warning("Error parsing hexadecimal characters."); // we are going to bail on any exception and just return the original // string. return name.toString(); } return name.toString(); } /** Converts a hext string to formated unicode string. @param hex 2-digit hex number. @return */ private String convert(String hex) { StringBuilder output = new StringBuilder(); output.append(" u"); // standard unicode format. for (int j = 0, max = 4 - hex.length(); j < max; j++) { output.append("0"); } output.append(hex.toLowerCase()); return output.toString(); } Any feed back on these potential workaround would be appreciated. If it's a valid fix I can add it to the core code base.
        Hide
        Patrick Corless added a comment -

        I've applied the naming parsing change but the document in question still has a mix of embedded CID font and non embedded CID fonts so getting it to fully render will be difficult without the fonts used to encode it.

        marking issue as won't fix for now.

        Show
        Patrick Corless added a comment - I've applied the naming parsing change but the document in question still has a mix of embedded CID font and non embedded CID fonts so getting it to fully render will be difficult without the fonts used to encode it. marking issue as won't fix for now.
        Hide
        Patrick Corless added a comment -

        After a bunch of work we are know correctly rendering most japan and chinese based document regardless of the fonts being embedded or not. The core operating system still needs to have a fonts that can render the unicode characters.

        Show
        Patrick Corless added a comment - After a bunch of work we are know correctly rendering most japan and chinese based document regardless of the fonts being embedded or not. The core operating system still needs to have a fonts that can render the unicode characters.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: