ICEpdf
  1. ICEpdf
  2. PDF-1271

Text extraction rotation issues

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 6.3.1
    • Fix Version/s: 6.3.3
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any

      Description

      A uses has reported that when extracting text they are seeing the text being drawn out vertically:
      h
      e
      l
      l
      o
      instead of hello.

      We've seen this issue in the past and have some corrective code to detect and adjust for the shift. Further investigation is needed.

        Activity

        Hide
        Matthias Göbel added a comment -

        Hello Patrick,
        Maybe you are interested in our investigation results regarding this issue.
        I would guess there is a problem with the ‘Tm’-Operator inside cos stream in connection with the given page rotation attribute - especially when values in the text matrix are negative.
        Please let me know if you have any hints or if I can help you
        Kind regard
        Matthias

        Show
        Matthias Göbel added a comment - Hello Patrick, Maybe you are interested in our investigation results regarding this issue. I would guess there is a problem with the ‘Tm’-Operator inside cos stream in connection with the given page rotation attribute - especially when values in the text matrix are negative. Please let me know if you have any hints or if I can help you Kind regard Matthias
        Hide
        Patrick Corless added a comment -

        The page rotation generally just states if the page should be rotated in the viewer. In this particular case the page is rotated from portrait to landscape.

        As you suggested the Tm operator is suspect as in one case it's 0 1 -1 -0 241.44 126.231 Tm. We have code that tries to detect this type of encoding when sorting the document text for extraction. I'll need to play around with it a bit more but will hopefully have something soon.

        Show
        Patrick Corless added a comment - The page rotation generally just states if the page should be rotated in the viewer. In this particular case the page is rotated from portrait to landscape. As you suggested the Tm operator is suspect as in one case it's 0 1 -1 -0 241.44 126.231 Tm. We have code that tries to detect this type of encoding when sorting the document text for extraction. I'll need to play around with it a bit more but will hopefully have something soon.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated: