Details
-
Type: Improvement
-
Status: Closed
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 4.3
-
Fix Version/s: 5.0.0 alpha1, 5.0.0 beta1, 5.0
-
Component/s: Core/Parsing
-
Labels:None
-
Environment:any
Description
When building the PostScript calculator for type 4 function support I did quite a bit of research into parsing techniques. The end result was relatively quick parsing engine. Once this work was completed I started working on a new PDF Content Parser system using the same techniques. In theory the new parser should be in the order of 50x faster the current one.
The ContentParser in ICEpdf is tightly coupled with the the generic Parser class. The Parser class feeds the Content Parser tokens for processing. This Parser is multipurpose handling both stream and dictionary parsing as well as providing tokens in a page content stream. The main problem here is that content stream operand tokens are returned as strings from the parser and then .equals is used by the content Parser to execute a found command. There are 90 plus operand tokens which is a a lot of comparison that we could be doing more efficiently.
One further problem with the Parser class is that it assumes that a content stream is always well formed and that operands, names and number will always be white space separated. This is not the case and a new setup should be able to determine tokens even if spaces are not present.
I've already done quite a bit of work on this. I will likely create a 4.3 branch and use the trunk to start checking in work for this optimization.
The ContentParser in ICEpdf is tightly coupled with the the generic Parser class. The Parser class feeds the Content Parser tokens for processing. This Parser is multipurpose handling both stream and dictionary parsing as well as providing tokens in a page content stream. The main problem here is that content stream operand tokens are returned as strings from the parser and then .equals is used by the content Parser to execute a found command. There are 90 plus operand tokens which is a a lot of comparison that we could be doing more efficiently.
One further problem with the Parser class is that it assumes that a content stream is always well formed and that operands, names and number will always be white space separated. This is not the case and a new setup should be able to determine tokens even if spaces are not present.
I've already done quite a bit of work on this. I will likely create a 4.3 branch and use the trunk to start checking in work for this optimization.
Activity
- All
- Comments
- History
- Activity
- Remote Attachments
- Subversion
Patrick Corless
created issue -
Patrick Corless
made changes -
Field | Original Value | New Value |
---|---|---|
Salesforce Case | [] | |
Fix Version/s | 5.0 [ 10314 ] |
Patrick Corless
made changes -
Status | Open [ 1 ] | Resolved [ 5 ] |
Fix Version/s | 5.0.0 beta1 [ 10677 ] | |
Fix Version/s | 5.0.0 alpha1 [ 10676 ] | |
Resolution | Fixed [ 1 ] |
Patrick Corless
made changes -
Status | Resolved [ 5 ] | Closed [ 6 ] |