Commit Graph

78 Commits

Author SHA1 Message Date
6096a1a237 Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary 2020-02-15 13:51:24 +01:00
23186100a8 Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access 2020-02-15 10:25:09 +01:00
b916ab5206 Just noticed Streams are a kind of Dictionary too, since they have a header 2020-02-15 10:23:32 +01:00
4a6dbda7d3 Move Error type from Pages to Navigation as a candidate for MonadFail required by PDFContent defined there 2020-02-15 10:22:42 +01:00
923d1800b0 Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to read 2020-02-14 18:02:40 +01:00
1c457d71d8 Fix the reading of Hexadecimal string objects detected by running the tests implemented from the spec 2020-02-14 18:00:12 +01:00
a72d76e229 Add unit tests to make sure I'm not breaking things too much 2020-02-14 17:58:03 +01:00
919f640443 Merge branch 'extract-text' into navigation 2020-02-12 17:35:56 +01:00
ae938acc02 Merge branch 'main' into extract-text 2020-02-12 17:34:56 +01:00
32f9866106 Use peek to improve directObject parser avoiding a large <|> disjunction 2020-02-12 17:34:27 +01:00
eb4d76002c Finish the split of Navigation out of Page, generalize the use of MonadFail with a custom Error monad (~= Either String) 2020-02-11 22:41:46 +01:00
af994cb50c WIP: in the process of migrating to Object.Navigation in Pages, still unsure how to manage simple Content parsing and efficient font loading (+ giving a way to edit Contents) 2020-02-11 17:59:15 +01:00
704d7a7fcf It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…) 2020-02-11 17:36:29 +01:00
11647eb4eb Implement output for Content streams 2020-02-11 17:26:47 +01:00
aed7af376a WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell 2020-02-11 08:29:08 +01:00
e77bbbcda9 WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing 2020-02-10 17:43:04 +01:00
195446e653 Allow resources with no /Font field, they won't cause any problem as long as no call to Tf (to load a font) is made 2020-02-10 17:41:44 +01:00
9f1b1afafe Implement Text rendering from parsed Content 2020-02-10 10:54:44 +01:00
20466c4f13 WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigation 2020-02-09 22:42:57 +01:00
325250383a Add support for fonts and implement MacRomanEncoding 2020-02-08 08:15:32 +01:00
c48ab22808 Forgot some useless parentheses when playing with operator precedences 2020-02-04 17:05:15 +01:00
a2b66ac6d6 Generalize the getFont function because some /Resources have a direct dictionary as value for their /Font property 2020-02-04 17:04:42 +01:00
cefb08ee50 Going a step further in «optimization» (slowing it even more…) by replacing choice by a search in a Map 2019-11-30 21:46:22 +01:00
afbbcbffc5 Finish implementing the new stack-based call parser 2019-11-30 12:39:40 +01:00
8373bd1ea0 Removing +x permission on getText source that shouldn't ever have been set 2019-11-29 19:07:54 +01:00
bac08446dd WIP: starting to fix this criminally inefficient parser for PDF's postfix-operator instructions 2019-11-29 17:42:57 +01:00
7eca875900 Improve getObj example to catch no-existing ObjectId and default to listing existing ObjectIds when none is provided 2019-11-29 11:53:08 +01:00
f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) 2019-11-29 11:51:35 +01:00
08a9717b3a Get rid of wrapper PageContents structure returned by PageContent in the PDF.Text module (and return directly [ByteString] instead) 2019-11-29 11:48:28 +01:00
42a02808c1 Merge branch 'main' into extract-text 2019-11-27 18:05:47 +01:00
380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the obj keyword 2019-11-27 18:01:19 +01:00
c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module 2019-10-14 10:17:15 +02:00
3a3e1533b4 Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue 2019-10-14 10:17:15 +02:00
a96e36ec5a Fix error silently discarding code ranges, make sure ByteString intervals are created with the correct byte length and decode utf16BE encoded values in single-value ranges 2019-10-14 10:17:15 +02:00
d07c286f8e Clean exported ByteString custom functions 2019-10-14 10:17:15 +02:00
7a15113285 Try and re-implement string decoding — compiles but now fails to decode any string 2019-10-14 10:17:15 +02:00
36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
b8ca7281aa Fix parsing errors forgetting to make sure there's a space after special operator arguments like names and stringObjects 2019-10-14 10:17:15 +02:00
32efdcdd6b Try and fix stuff by generalizing a signature to ease debugging and add parenthesis which I think should have been here all along 2019-10-14 10:17:15 +02:00
3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
0374b72920 Finish implementing reading, still bugs to investigate 2019-10-14 10:17:15 +02:00
1dd22c3889 Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings 2019-10-14 10:17:15 +02:00
98d029c4d4 In complete debug, more or less implemented CMap parsing but apparently it uses UTF16 ?! 2019-10-14 10:17:15 +02:00
c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding 2019-10-14 10:17:15 +02:00
e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
f9e5683bf4 WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile) 2019-10-14 10:17:15 +02:00
b8eb9e6856 Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec 2019-10-14 10:17:15 +02:00
66d315b7fe Reflect the distinction between eval and run from State monad into the Parser module 2019-10-14 10:17:15 +02:00
51db57ec67 Ugly commit, breaks everything, still trying to figure a grammar for text 2019-10-14 10:17:15 +02:00
6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00