Commit graph

135 commits

Author SHA1 Message Date
d90eaf6f1c Add Box instances to allow handling some exceptions in monad and converting them to Traversable accessible from the data part of the type 2020-02-27 17:22:12 +01:00
99014ff30d Recognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at it 2020-02-26 22:13:29 +01:00
f4df4aab22 Found a nicer formulation that doesn't require transitivity or index agglomeration and swapped argument of w for more reusability with at / atAll 2020-02-26 17:19:22 +01:00
bdbc5f7351 Generalize the Box instances on containers from the particular cases of Document/Layers and Layers/Objects and move them to PDF.Box 2020-02-25 17:36:54 +01:00
30fece6537 Notice the 'edit' I exported earlier could be reused to simplify the w implementation of the proof that Box is a transitive relation 2020-02-25 09:27:56 +01:00
1a70f2972b Expose Box index flags in PDF and PDF.Layer 2020-02-24 21:37:09 +01:00
67faa06ea2 Lift unused restriction on MonadFail for AllObjects instance of Box Layer 2020-02-24 21:36:31 +01:00
83a63d4b02 Implement Box instance from Layer to Object, either all at once or indexed by an ObjectId 2020-02-24 17:29:22 +01:00
85ee8519c4 Implement Box instances from Document to Layers and EOLStyle 2020-02-24 17:28:17 +01:00
e607f9cd37 Implement transitivity instance, extract a part of modifyAt as a convenient 'edit' function useful elsewhere and present a right-infix version of (,) to allow writing the nested tuple indexes more conveniently 2020-02-24 17:27:37 +01:00
a9252b129a Start a Box module to describe inclusion relations between different types and get a MonadState action on the top type for any modification down there 2020-02-23 22:24:59 +01:00
71e62ee732 Add IDs to Instructions so that they can be selected in a given Content (and modified one day…) 2020-02-23 22:21:09 +01:00
160999a7d7 A small renaming for more clarity and because I thought «update» could be needed for a function name but after all maybe not but it's still better that way 2020-02-23 22:17:09 +01:00
36b1782464 Follow previous renaming for a local variable in Navigation for more clarity 2020-02-23 22:15:52 +01:00
bcf2e05bfb Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already) 2020-02-17 15:29:59 +01:00
6096a1a237 Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary 2020-02-15 13:51:24 +01:00
23186100a8 Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access 2020-02-15 10:25:09 +01:00
b916ab5206 Just noticed Streams are a kind of Dictionary too, since they have a header 2020-02-15 10:23:32 +01:00
4a6dbda7d3 Move Error type from Pages to Navigation as a candidate for MonadFail required by PDFContent defined there 2020-02-15 10:22:42 +01:00
923d1800b0 Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to read 2020-02-14 18:02:40 +01:00
1c457d71d8 Fix the reading of Hexadecimal string objects detected by running the tests implemented from the spec 2020-02-14 18:00:12 +01:00
a72d76e229 Add unit tests to make sure I'm not breaking things too much 2020-02-14 17:58:03 +01:00
919f640443 Merge branch 'extract-text' into navigation 2020-02-12 17:35:56 +01:00
32f9866106 Use peek to improve directObject parser avoiding a large <|> disjunction 2020-02-12 17:34:27 +01:00
eb4d76002c Finish the split of Navigation out of Page, generalize the use of MonadFail with a custom Error monad (~= Either String) 2020-02-11 22:41:46 +01:00
af994cb50c WIP: in the process of migrating to Object.Navigation in Pages, still unsure how to manage simple Content parsing and efficient font loading (+ giving a way to edit Contents) 2020-02-11 17:59:15 +01:00
704d7a7fcf It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…) 2020-02-11 17:36:29 +01:00
11647eb4eb Implement output for Content streams 2020-02-11 17:26:47 +01:00
aed7af376a WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell 2020-02-11 08:29:08 +01:00
e77bbbcda9 WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing 2020-02-10 17:43:04 +01:00
195446e653 Allow resources with no /Font field, they won't cause any problem as long as no call to Tf (to load a font) is made 2020-02-10 17:41:44 +01:00
9f1b1afafe Implement Text rendering from parsed Content 2020-02-10 10:54:44 +01:00
20466c4f13 WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigation 2020-02-09 22:42:57 +01:00
325250383a Add support for fonts and implement MacRomanEncoding 2020-02-08 08:15:32 +01:00
c48ab22808 Forgot some useless parentheses when playing with operator precedences 2020-02-04 17:05:15 +01:00
a2b66ac6d6 Generalize the getFont function because some /Resources have a direct dictionary as value for their /Font property 2020-02-04 17:04:42 +01:00
cefb08ee50 Going a step further in «optimization» (slowing it even more…) by replacing choice by a search in a Map 2019-11-30 21:46:22 +01:00
afbbcbffc5 Finish implementing the new stack-based call parser 2019-11-30 12:39:40 +01:00
bac08446dd WIP: starting to fix this criminally inefficient parser for PDF's postfix-operator instructions 2019-11-29 17:42:57 +01:00
f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) 2019-11-29 11:51:35 +01:00
08a9717b3a Get rid of wrapper PageContents structure returned by PageContent in the PDF.Text module (and return directly [ByteString] instead) 2019-11-29 11:48:28 +01:00
42a02808c1 Merge branch 'main' into extract-text 2019-11-27 18:05:47 +01:00
380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the obj keyword 2019-11-27 18:01:19 +01:00
c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module 2019-10-14 10:17:15 +02:00
3a3e1533b4 Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue 2019-10-14 10:17:15 +02:00
a96e36ec5a Fix error silently discarding code ranges, make sure ByteString intervals are created with the correct byte length and decode utf16BE encoded values in single-value ranges 2019-10-14 10:17:15 +02:00
d07c286f8e Clean exported ByteString custom functions 2019-10-14 10:17:15 +02:00
7a15113285 Try and re-implement string decoding — compiles but now fails to decode any string 2019-10-14 10:17:15 +02:00
36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
b8ca7281aa Fix parsing errors forgetting to make sure there's a space after special operator arguments like names and stringObjects 2019-10-14 10:17:15 +02:00
32efdcdd6b Try and fix stuff by generalizing a signature to ease debugging and add parenthesis which I think should have been here all along 2019-10-14 10:17:15 +02:00
3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
0374b72920 Finish implementing reading, still bugs to investigate 2019-10-14 10:17:15 +02:00
1dd22c3889 Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings 2019-10-14 10:17:15 +02:00
98d029c4d4 In complete debug, more or less implemented CMap parsing but apparently it uses UTF16 ?! 2019-10-14 10:17:15 +02:00
c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding 2019-10-14 10:17:15 +02:00
e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
f9e5683bf4 WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile) 2019-10-14 10:17:15 +02:00
b8eb9e6856 Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec 2019-10-14 10:17:15 +02:00
66d315b7fe Reflect the distinction between eval and run from State monad into the Parser module 2019-10-14 10:17:15 +02:00
51db57ec67 Ugly commit, breaks everything, still trying to figure a grammar for text 2019-10-14 10:17:15 +02:00
6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00
68f90d20e2 Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions 2019-09-22 01:40:39 +02:00
3a39c75e6a Stop requiring an empty line between subsections in a xref section 2019-09-22 01:37:28 +02:00
29c5823f34 Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file 2019-09-22 01:34:17 +02:00
699f830a45 Simplify XRef structure, clarify integer types and remove nextLine 2019-09-20 22:39:14 +02:00
264b0dc92b Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found 2019-05-31 15:08:54 +02:00
9dac275f68 Keep comment-opening '%' along with the comment and support empty lines 2019-05-31 15:07:41 +02:00
85e4eb9273 Fix bypassed error message for lines + add one for occurrences 2019-05-31 15:06:20 +02:00
11cb6504d7 Go strict ByteStrings with attoparsec 2019-05-24 10:48:09 +02:00
0daa03d958 Remove commented out dead code 2019-05-21 09:07:37 +02:00
b60f337cc4 First useable version 2019-05-18 11:09:03 +02:00
5614a25048 Generate valid PDF 2019-05-18 09:01:13 +02:00
0336baa687 Fix output implementation with dynamic XRefs 2019-05-17 16:14:06 +02:00
e23618da68 Implement output 2019-05-16 22:41:14 +02:00
088637b2c0 Compat stuff for Monoid / Semigroup 2019-05-16 21:40:19 +02:00
645466024a Starting to implement output with String builder 2019-05-16 17:04:45 +02:00
9b2f890227 Boyer-Moore is canceled, implement the rest of parsing with naive search 2019-05-16 11:01:50 +02:00
fc41f815a3 Broken state : trying to implement Boyer-Moore for fast-forwarding to the end of a section 2019-05-15 19:13:35 +02:00
379a821550 Fix bugs preventing the objects from loading 2019-05-15 15:03:55 +02:00
44508a204c Reuse Parser type in PDF.Body (and generalize the type of the comment parser) 2019-05-15 09:04:17 +02:00
91292d6401 Implement retrieving objects in the body of the document and use it to populate the structure previously parsed 2019-05-14 18:42:11 +02:00
8043f84da8 Cut PDF module in two, implement basic parsing up to reading XRef table 2019-05-13 18:22:05 +02:00
6eacb55fc4 Fix bug preventing startXref to be found for files with a single byte EOL encoding 2019-05-13 11:34:15 +02:00
c036334b6f Prototype successfully parsing (only last) startxref 2019-05-13 08:05:28 +02:00