ICEpdf
  1. ICEpdf
  2. PDF-13

Add support for multi-threaded document stream loading/parsing

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 5.0.0 alpha1, 5.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      ICEpdf

      Description

      Note from customer:

      Please notice that the test program uses two threads to convert two identical PDF files here. If I use single thread to convert two PDF files sequentially, the comparison will success (image files are the same). If I save the converted images and open in Microsoft paint, I will see that one of the image is missing a chart.
      1. Test.java
        2 kB
        Tyler Johnson

        Activity

        Tyler Johnson created issue -
        Tyler Johnson made changes -
        Field Original Value New Value
        Attachment Test.java [ 11707 ]
        Patrick Corless made changes -
        Salesforce Case []
        Fix Version/s 3.1 [ 10181 ]
        Hide
        Patrick Corless added a comment -

        I've been testing this issue with a similar application that tries to initialize or extract text using multiple threads. After numerous days of testing and debugging I think I have found a couple hots spots.

        The first area of concern is the thread access mechanism related to the Implementations of the SeekableInput. There are two implementations, RandomAccessFileInputStream and SeekableByteArrayInputStream. When a SeekableInput implementations use a wait notify mechanism for starting and ending thread access. On a micro scale this locking work pretty well but becomes problematic when working dealing with more then one input stream per thread. For example

        Font.init (Thread 1)

        • start thread 1 access
          font.getFontDescriptor (new inputstream)
        • starting thread1 access
        • closing thread 1 access
              • This where a thread 2 can gain access to leaving the inistial startthread 1 hanging.

        The problem seems to happen more often that not in the SequenceInputStream when parsing though a pag content stream that has more then one Content Stream. When sequenceInputStream closes one stream and goes on tot he next it makes it possible for another thread to jump in which usually causes problems.

        Something that is not clear to me is when we uses the SeekableInput becuase it inherits from InputStream and thus hard to find all the access points. I'm pretty sure there is still some SeekableInput usuages that don't use the startThreadAccess and endThreadAccess and thus mess up

        There is one last problem that seems to show up do to shared PDF resources. For example it is possible for a Font object to initialize some other object form the document file which will cause the current thread to loose access to the stream, at which point a new thread get access and will try and initialize the same font resulting in a liveLock. The initial font object will be stuck waiting for thread access and the second thread will be stock waiting for access to the font init method.

        Show
        Patrick Corless added a comment - I've been testing this issue with a similar application that tries to initialize or extract text using multiple threads. After numerous days of testing and debugging I think I have found a couple hots spots. The first area of concern is the thread access mechanism related to the Implementations of the SeekableInput. There are two implementations, RandomAccessFileInputStream and SeekableByteArrayInputStream. When a SeekableInput implementations use a wait notify mechanism for starting and ending thread access. On a micro scale this locking work pretty well but becomes problematic when working dealing with more then one input stream per thread. For example Font.init (Thread 1) start thread 1 access font.getFontDescriptor (new inputstream) starting thread1 access closing thread 1 access This where a thread 2 can gain access to leaving the inistial startthread 1 hanging. The problem seems to happen more often that not in the SequenceInputStream when parsing though a pag content stream that has more then one Content Stream. When sequenceInputStream closes one stream and goes on tot he next it makes it possible for another thread to jump in which usually causes problems. Something that is not clear to me is when we uses the SeekableInput becuase it inherits from InputStream and thus hard to find all the access points. I'm pretty sure there is still some SeekableInput usuages that don't use the startThreadAccess and endThreadAccess and thus mess up There is one last problem that seems to show up do to shared PDF resources. For example it is possible for a Font object to initialize some other object form the document file which will cause the current thread to loose access to the stream, at which point a new thread get access and will try and initialize the same font resulting in a liveLock. The initial font object will be stuck waiting for thread access and the second thread will be stock waiting for access to the font init method.
        Hide
        Patrick Corless added a comment -

        As I mentioned I didn't get too far on solving the problem. Interestingly enough if a pdf file doesn't used shared font and has only one content stream per page, we can use as many thread as we want to parse the file. I'm curious if we another high level thread access system couldn't be used to make sure that we don't loose access to a Input stream in the middle of an object parse?

        I've going to check in my modification as they did fix a couple smaller problems. Hopefully we can look at this again for the next major release.

        Show
        Patrick Corless added a comment - As I mentioned I didn't get too far on solving the problem. Interestingly enough if a pdf file doesn't used shared font and has only one content stream per page, we can use as many thread as we want to parse the file. I'm curious if we another high level thread access system couldn't be used to make sure that we don't loose access to a Input stream in the middle of an object parse? I've going to check in my modification as they did fix a couple smaller problems. Hopefully we can look at this again for the next major release.
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #19338 Mon Oct 05 06:13:53 MDT 2009 patrick.corless PDF-13 - updated initialization logic so that shared resource initialization is synchronized, also added some Thread.interupted logic to help with respsonsiveness.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/MemoryManager.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/ContentParser.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/PageTree.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/Catalog.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/Form.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/Page.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/ShadingType1Pattern.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/Library.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/ShadingType3Pattern.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/graphics/ShadingType2Pattern.java
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #19339 Mon Oct 05 06:15:03 MDT 2009 patrick.corless PDF-13 - stream thread access changes to try and help with multithread stream access.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/pobjects/ObjectStream.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/io/SeekableByteArrayInputStream.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/io/RandomAccessFileInputStream.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/Utils.java
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/util/LazyObjectLoader.java
        Patrick Corless made changes -
        Fix Version/s 3.2 [ 10212 ]
        Fix Version/s 3.1 [ 10181 ]
        Ken Fyten made changes -
        Fix Version/s 4.0 - Beta [ 10212 ]
        Ken Fyten made changes -
        Summary Stream loading issue with multiple threads Add support for multi-threaded document stream loading/parsing
        Issue Type Bug [ 1 ] New Feature [ 2 ]
        Salesforce Case []
        Patrick Corless made changes -
        Salesforce Case []
        Fix Version/s 5.0 [ 10314 ]
        Hide
        Patrick Corless added a comment -

        A client has sent in a test case will take quick look at this issue for 4.3.4

        Show
        Patrick Corless added a comment - A client has sent in a test case will take quick look at this issue for 4.3.4
        Patrick Corless made changes -
        Salesforce Case []
        Fix Version/s 4.3.4 [ 10341 ]
        Fix Version/s 5.0 [ 10314 ]
        Migration made changes -
        Fix Version/s 4.5 [ 10342 ]
        Fix Version/s 4.3.4 [ 10341 ]
        Patrick Corless made changes -
        Fix Version/s 5.0 [ 10314 ]
        Fix Version/s 4.5 [ 10342 ]
        Hide
        Patrick Corless added a comment -

        As part of the major refactoring for PDF-319 the handling of stream content has been reworked. When a stream is parsed the uncompressed stream bytes are stored and the file lock is removed. The Stream is then available for processing on a new thread. This work was also coupled with the removal of the memory manager in favor of Soft and Weak Reference model. From early testing three threads seems to be optimal for processing a document. Any more and the file locks tend to slow things down on an 4 core machine, new systems maybe able to do more.

        Another interesting refactor was to move image loading onto a new worker thread to avoid pausing the content parse while the image is decoded. The new ImageProxy will however block if paint is called before the image is fully decoded. So not true imageProxy but overall parse speed has been greatly improved for some image heavy documents.

        Hopefully in the future we can take the rework a step further using NIO and avoid having to manage the file marker one thread at time.

        Show
        Patrick Corless added a comment - As part of the major refactoring for PDF-319 the handling of stream content has been reworked. When a stream is parsed the uncompressed stream bytes are stored and the file lock is removed. The Stream is then available for processing on a new thread. This work was also coupled with the removal of the memory manager in favor of Soft and Weak Reference model. From early testing three threads seems to be optimal for processing a document. Any more and the file locks tend to slow things down on an 4 core machine, new systems maybe able to do more. Another interesting refactor was to move image loading onto a new worker thread to avoid pausing the content parse while the image is decoded. The new ImageProxy will however block if paint is called before the image is fully decoded. So not true imageProxy but overall parse speed has been greatly improved for some image heavy documents. Hopefully in the future we can take the rework a step further using NIO and avoid having to manage the file marker one thread at time.
        Patrick Corless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Repository Revision Date User Message
        ICEsoft Public SVN Repository #32746 Tue Dec 11 14:54:08 MST 2012 patrick.corless PDF-13 addition of experimental reentrant lock for file reads.
        Files Changed
        Commit graph MODIFY /icepdf/trunk/icepdf/core/src/org/icepdf/core/io/RandomAccessFileInputStream.java
        Patrick Corless made changes -
        Fix Version/s 5.0.0 alpha1 [ 10676 ]
        Patrick Corless made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Tyler Johnson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: