Details
-
Type: New Feature
-
Status: Closed
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 3.1
-
Fix Version/s: 4.0 - Beta, 4.0
-
Component/s: Core/Parsing
-
Labels:None
-
Environment:All
Description
Relevant issues:
1. Linearised file becomming unlinearised. We don't support linearisation, and we'd probably want to avoid re-writing the entire file.
2. If existing file uses cross reference streams, should appended section use cross reference stream or xref table? There are also hybrid files that use both.
3. If existing file uses object streams, should new objects go into an object stream (which allows for encryption), or simply write objects directly?
4. When new objects are created, they have to be assigned object numbers, which is something you take from the existing xref table, since it tells you what the next number to use is.
5. Are we only adding objects, or will we be editing and deleting them? This affects the xref writing.
6. If the original document was encrypted, the updated document must also be encrypted.
7. Will save feature be standard or pro feature
A. Know how to write a trailer
B. Know how to write a xref table, and possibly cross reference stream
C. Possibly know how to write an object stream
D. Know how to write whichever object is being added or modified
E. Know how to write a Page object
Issue Links
Activity
- All
- Comments
- History
- Activity
- Remote Attachments
- Subversion
Annotation borders are mutable, might be a good place to start.
Responses:
1. Don't have to de-linearise file, as it's up to linearisation reader support to detect incremental updates
2. Only need to use cross reference streams if using object streams. Refer to #2
3. Only need to use object streams if stream encrypting objects. Refer to #6.
4. Will add API on Document for getting new Reference objects, for newly created objects. Deleted entries get bumped a generation number in xref table, and are marked 'f' for free. Modified objects don't get new object number or generation number. Generation numbers are only used for re-using object numbers that have previously been freed (deleted). Acrobat doesn't re-use object numbers, so we won't either.
5. We will be adding objects, modifying them, and deleting them.
6. We currently don't support reading Crypt filtered streams, let alone writing them. So we'll only rely on String encryption, and won't do stream encryption. No object streams -> no reference streams.
7. Save will be pro feature, like font engine
A. Know how to write a trailer
B. Know how to write a xref table, but don't have to write a cross reference stream
C. Don't have to write an object stream
D. Know how to write whichever object is being added or modified -> Annotation, Annots array
E. Know how to write a Page object
Issues:
8. Add API to page for adding, deleting, editing objects. Should this be general in nature, or just specific to Annotations? If we continue adding support for other object types, how much will that API expand? What about objects shared between Page objects? Does that indicate we should do this on the document?
9. Should we try to merge in modified objects into a pseudo xref object, so they'll behave exactly as existing objects? Or just track the objects in proprietary data-structures?
10. What if user adds Annotation, and uses Reference generator from #4, then deletes it. How will we handle that object number?
11. If an entry in an Annotation object is an indirect object, do we assume we have to write it out too? Obviously yes for new annotations, but what about edited Annotations? Do we try to detect what has changed? If deleting an Annotation do we delete the constituent parts? What if they're shared? Do we try to detect shared objects?
It looks like there are two emerging scenarios, of how we'll manage Annotation management:
The first, and preferred due to the perceived less effort and greater simplicity, will be to just track the Annotation objects in the RI and Page, until the moment when we save, at which point the Annotations may be assigned the required objects numbers, and the implications of deletions can be resolved. As such, the Annotations will be a special case, and be specially treated by Page as not necessarily existing in the Library.
The second approach will be to make use of an incremental update pseudo xref table and possibly modify the lazy object loader or library. This way the system will treat the modified Annotation objects exactly like pre-exiting Annotations from the PDF file. The downside of this is that multiple edits, deletions of newly added annotations, or any kind of undo operations, will have to be mirrored throughout more of the system, in the Library, and the pseudo xref, and Page.
Responses:
8. API should be annotation-specific for this release. We can create a more general purpose API if required in the future.
9. For now, assume the strategy of special-casing Annotations, and hope to avoid the pseudo xref strategy.
10. If we only assign object numbers as we save, then this won't be an issue, but if we assign them as we create the objects, which may then be further modified or deleted, then try to treat them the same as pre-existing annotations that got deleted. Don't worry about reclaiming or otherwise optimizing object number usage for initial impl, likely isn't a large issue.
11. When deleting annotations, free them at the top level and orphan/leak any sub-objects, since they may be shared. Don't try to detect sharing. For edits, we'll treat those as deletions and additions. Otherwise we'd have to do a recursive diff on the old and new Annotation. The RI will just be doing adds and removes.
As part of PDF-74 I've created a StateManager class that is instance var of Document. This new class holds PObject that have been changed as the result of an annotation addition or deletion. The main idea of this class is that its cache can be interated over to get a list of objects that should be added to the new document body during a file save.
I still have to flush out some bugs with the manager but I think it's a good starting point for keeping track of document changes.
For doing D, E of writing out the Annotation and Page objects, that requires being able to generically write out objects, dictionaries, and PDF primitives. Here are some sub-issues for that:
DE.1. If a dictionary or value or array element value is a real number, should I get the value as a float, then Float.toString() that, or get it as a double and Double.toString() it, or just call toString() right on the Number, to ensure no intermediary conversion?
DE.2. The spec says to not output exponential notation, but maybe those toString() methods do that. Any idea how to enforce no scientific or engineering notation with real numbers?
DE.3. If a String has Unicode values in it, should I output that FEFF kind of string? Is that a literal or hex string? How should I handle unicode strings?
DE.4. The SecurityManager handling of string encryption seems to require a Reference. Is that for the parent dictionary of the string, or does that mean that only indirect strings, that have a reference to them, can be encrypted?
DE.5. Can there be Unicode characters in Names, that might have to be escaped? Are there any issues of escaping with Names? Or can I just truncate the String chars to 8 bit bytes?
DE.6. When you mark a pre-existing Annotation as deleted, are you modifying the generation number?
DE.7. If I come across a java.awt.Rectangle, what conversions are necessary to write it out as an array? I think it's: [minX, minY, maxX, maxY]
DE.8. I believe that the AnnotationFactory, and any other Annotation creation code should be using Name objects, not Strings, for the Type and Subtype fields, and any other similar fields
Responses:
DE.1 and DE.2 : Just use it as Number, and do a Number.toString() for now.
DE.3 : The problem is that the octal notation really just handles 8 bit characters, above the 7 bits that ASCII covers. I had to find out how to handle Unicode characters beyond 8 bits, and see how 8 bit accented characters are supposed to be handled.
On further inversitgation, I found that:
The dictionary string handling is a bit different. Instead of simply being UTF-8 and that being that, they have their own character set, PDFDocEncoding, that's similar to Latin1, and then they have a 16 bit Unicode one, and you can choose which to use. So I basically have to find out how to get Java to output Strings as each type, and possibly have some detection algorithm to default to the PDFDocEncoding one, but switch to the 16 bit Unicode one when there are characters that aren't handled by it. What I'm worried about though is when we originally parse the file, and build the dictionary strings, then presumably Java is building those strings assuming that the bytes being fed to it are the platform character set, and not PDFDocEncoding. Since the windows code page character set is only subtly different from Latin1, and PDFDocEncoding is probably pretty close, then we probably wouldn't have run into this as being a parser issue, especially since this is dictionary strings, not content stream strings, so we probably don't show them anywhere. So, to verify that, I'll probably have to find an example of a dictionary string that we'd display, and use that for testing. Hopefully something in Annotation. Maybe the username of the Annotation creator, or some such thing. That way, when we open a PDF with annotations, edit it, save it, and reopen it later, the round-tripping of the funky characters will be preserved, and not get corrupted with each edit cycle.
DE.4 : Use the containing object's Reference for the SecurityManager encryption.
DE.5 : Name strings are a UTF-8 string, which is then hex escaped for byte 35 and any bytes outside of 33-126.
DE.6 : The StateManager is not altering the generation number, leaving this to the xref writer.
DE.7 : The Annotation creation code will now have this as an array that I simply write out without conversion.
DE.8 : The Annotation creation code now uses Names, as appropriate, instead of strings.
Responses:
A. Know how to write a trailer
According to the spec, the new trailer should include all the values from the previous trailer, except the Prev and Size fields should be updated. Prev has to point to where the previous cross reference is, and Size may be larger than the previous Size, to accomodate newly used object numbers, for new objects being appended to the PDF.
A complicated issue that I ran into was what to do when a PDF is slightly corrupted, and so was loaded via linear traversal. If we edit it, and add an incremental update, some questions come to mind. Should we update it at all? How do we populate the Prev value since the file offset locations are probably invalid? How do we intentionally corrupt the new trailer, so that the parser will do a linear traversal of the new PDF, since we can't really do lazy loading? We decided to allow updates, not link back with Prev, and set the startxref file offset to -1 to force linear traversal. This has to be tested though, with both ICEpdf and Acrobat.
B. Know how to write a xref table
The only real question that had to be figured out, from reading the spec, was how to handle the deleted item chaining. Everything else was pretty intuitive. It did bring up some subtasks, that would be necessary to handle the xref outputting with proper deleted item handling:
B.1. Have StateManager.iterator() return PObjects sorted by object number, so objects aren't random in file. Not necessary, but will help debugging.
B.2. Have Entry list use insertion sorting by object number. Very necessary, to facilitate the following steps.
B.3. Create fake deleted Entry items for object numbers beyond Prev's Size, that are gaps in our real Entry object number sequences.
B.4. Iterate over Entry list in reverse, so deleted Entry items point to the next deleted Entry. Track first deleted Entry object number for our object number zero entry, that will point to it.
B.5. Here, iterate over Entry list, so can do xref subsections properly. Don't forget to start with object number zero.
B.6. Have empty newline after, so xref parser will know no more subsections
Remaining TODO:
B. Have to do these.
6. Still have to handle encryption.
DE.9. Have to add support for hex string outputting. Look into HexStringObject.
Finished B, writing the xref table. Ran into several bugs, where non-dictionary top-level objects were being mis-output, null values were being output as literal strings of (null).
Committed the bulk of the incremental update functionality. There are still some straggler issues though.
Some corner cases to make sure we test:
- Incremental update with linear traversal PDF. Make sure still does linear traversal, and edits show up.
- Reals not outputting in scientific/engineering notation
- What is output from existing LiteralStringObject, HexStringObject?
- What is output from existing literal and hex strings? Did they all become LSO, HSO?
Subversion 19871
icepdf/core/src/org/icepdf/core/pobjects/annotations/BorderStyle.java
icepdf/core/src/org/icepdf/core/application/Capabilities.java
icepdf/core/src/org/icepdf/core/io/CountingOutputStream.java
icepdf/core/src/org/icepdf/core/pobjects/Document.java
icepdf-pro/font-engine/src/org/icepdf/core/util/IncrementalUpdater.java
icepdf/core/src/org/icepdf/core/util/LazyObjectLoader.java
icepdf/core/src/org/icepdf/core/pobjects/PObject.java
icepdf/core/src/org/icepdf/core/pobjects/PTrailer.java
icepdf/core/src/org/icepdf/core/pobjects/StateManager.java
icepdf/viewer/src/org/icepdf/ri/common/SwingController.java
Support writing out null values.
Subversion 22002
icepdf-pro/font-engine/src/org/icepdf/core/util/IncrementalUpdater.java
Encryption issue for action dictionary string writes. .
Analysis of string encryption, that we've agreed to:
There are 6 cases for strings being written out. There are LiteralStringObject, HexStringObject, and java.lang.String. And the main scenarios are when the document was not encrypted and when it was encrypted.
When parsed from a PDF, both LiteralStringObject and HexStringObject hold the raw original bytes from the PDF. That means there are 8 bits of information in each char.
Let me go on a tangent for a second. I was previously worried about us corrupting the strings, as we read in bytes and convert to String, and who knows which platform specific character encoding is used. I'm not worried anymore. In our Parser we're not using any String constructor that takes a byte[], so it doesn't apply. We're using StringBuffer, and doing InputStream (Reader would cause problems too) reads, and casting to char. There's no inadvertent sign extension, since InputStream.read() gives an int holding only the byte of data. When we go back to bytes, we're just casting back. So, bit-wise nothing's being corrupted. There's still the issue of user input strings having beyond ASCII characters, that will need to be handled, but right now we're only allowing editing of URIs, which only allow a subset of ASCII characters.
Ok, back to encryption. In the unencrypted scenario, the LiteralStringObject holds the actual 7 bit ASCII string characters. Any 8 bit values are octal escaped. In the encrypted scenario, it hold 8 bit binary values. But there's no way to know when a LiteralStringObject is encrypted or not, since there's no flags per LSO. When we need to acess the unencrypted values, we just unencrypt it. If the PDF is encrypted, every LSO is encrypted. If the PDF is not encrypted, there's no key to decrypt, so we pass through the underlying data. Either way, the LSO already holds whatever values, as bytes, that the IncrementalUpdater should just write out, without trying to process.
For HexStringObject, it's the same story, it either contains ASCII characters (because they're hex, so they're 0-9, a-z, A-Z), or the 8 bit binary encrypted data. Likewise no way to know if one specific HSO is encrypted, you just assume that if the PDF is encrypted, the HSO is too. Either way, those raw bytes are what the IncrementalUpdater should output.
In those two scenarios it's clear, if page/annotation/action/destination editing or creating involves making a LiteralStringObject or a HexStringObject, from a java.lang.String returned from a swing editor component, then it needs to do the proper escaping (LSOs do octal escaping of 8 bit values, HSOs do hex escaping of all values), then encrypt those bytes, and store that in the LSO or HSO. Then the IncrementalUpdater can find them in the dictionaries, and simply write out those bytes, with no processing.
That leaves .java.lang.String objects, found in the dictionary. The Parser would only make those as tokens for the stack, not for bonafide strings in an object. So our editing would be what's making them. We'll just make sure that our editing doesn't make java.lang.String objects, and just makes LiteralStringObjects, as described above. For now, the IncrementalUpdater will fail fast on java.lang.String values, to help us discover if we're incorrectly creating and storing them.
When editing pre-existing objects, and when creating new objects, found out that it's a best practice to have all dictionary derived objects be indirect references.
For example, when altering an Annotation, if it has a direct Action, that has to become an indirect Action. The reason is because of how we do encryption. Let's say, in an encrypted PDF, there's a LinkAnnotation with a direct URIAction, and the user edits the URI. As the code stands, the new LiteralStringObject will get it's Reference from the URIAction. But, that will be null, since it's direct. Encryption will likely fail, or decryption will fail, since the Parser will actually wire the reference correctly, from the top-level annotation object.
Of course, there are other ways of dealing with that specific example, such as:
1. If the string was indirect, only changing that, and carrying forward the reference
2. If the string was direct, but the action indirect, finding the annotation's reference
But that's more complicated, and misses the point that if we just adhere to the best practice, then we'll skip over any similar issues.
Properly write out encrypted PDFs with encrypted strings. Fail fast when writing java.lang.String, so we can debug why there are java.lang.String objects in our data structures.
Subversion 19907
icepdf\core\src\org\icepdf\core\application\Capabilities.java
icepdf\core\src\org\icepdf\core\pobjects\Document.java
icepdf\core\src\org\icepdf\core\pobjects\StateManager.java
Subversion 22029
icepdf-pro\font-engine\src\org\icepdf\core\util\IncrementalUpdater.java
Fix encryption's conversion between bytes and chars.
Subversion 19914
C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\HexStringObject.java
C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\LiteralStringObject.java
C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\pobjects\security\StandardEncryption.java
C:\Documents and Settings\Mark Collette\IdeaProjects\ICEpdf3\core\src\org\icepdf\core\util\Utils.java
Have to have something edited, to have something to save.