Hufflepdf

Author	SHA1	Message	Date
Tissevert	d90eaf6f1c	Add Box instances to allow handling some exceptions in monad and converting them to Traversable accessible from the data part of the type	2020-02-27 17:22:12 +01:00
Tissevert	99014ff30d	Recognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at it	2020-02-26 22:13:29 +01:00
Tissevert	f4df4aab22	Found a nicer formulation that doesn't require transitivity or index agglomeration and swapped argument of w for more reusability with at / atAll	2020-02-26 17:19:22 +01:00
Tissevert	bdbc5f7351	Generalize the Box instances on containers from the particular cases of Document/Layers and Layers/Objects and move them to PDF.Box	2020-02-25 17:36:54 +01:00
Tissevert	30fece6537	Notice the 'edit' I exported earlier could be reused to simplify the w implementation of the proof that Box is a transitive relation	2020-02-25 09:27:56 +01:00
Tissevert	1a70f2972b	Expose Box index flags in PDF and PDF.Layer	2020-02-24 21:37:09 +01:00
Tissevert	67faa06ea2	Lift unused restriction on MonadFail for AllObjects instance of Box Layer	2020-02-24 21:36:31 +01:00
Tissevert	83a63d4b02	Implement Box instance from Layer to Object, either all at once or indexed by an ObjectId	2020-02-24 17:29:22 +01:00
Tissevert	85ee8519c4	Implement Box instances from Document to Layers and EOLStyle	2020-02-24 17:28:17 +01:00
Tissevert	e607f9cd37	Implement transitivity instance, extract a part of modifyAt as a convenient 'edit' function useful elsewhere and present a right-infix version of (,) to allow writing the nested tuple indexes more conveniently	2020-02-24 17:27:37 +01:00
Tissevert	a9252b129a	Start a Box module to describe inclusion relations between different types and get a MonadState action on the top type for any modification down there	2020-02-23 22:24:59 +01:00
Tissevert	71e62ee732	Add IDs to Instructions so that they can be selected in a given Content (and modified one day…)	2020-02-23 22:21:09 +01:00
Tissevert	160999a7d7	A small renaming for more clarity and because I thought «update» could be needed for a function name but after all maybe not but it's still better that way	2020-02-23 22:17:09 +01:00
Tissevert	36b1782464	Follow previous renaming for a local variable in Navigation for more clarity	2020-02-23 22:15:52 +01:00
Tissevert	bcf2e05bfb	Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already)	2020-02-17 15:29:59 +01:00
Tissevert	6096a1a237	Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary	2020-02-15 13:51:24 +01:00
Tissevert	23186100a8	Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access	2020-02-15 10:25:09 +01:00
Tissevert	b916ab5206	Just noticed Streams are a kind of Dictionary too, since they have a header	2020-02-15 10:23:32 +01:00
Tissevert	4a6dbda7d3	Move Error type from Pages to Navigation as a candidate for MonadFail required by PDFContent defined there	2020-02-15 10:22:42 +01:00
Tissevert	923d1800b0	Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to read	2020-02-14 18:02:40 +01:00
Tissevert	1c457d71d8	Fix the reading of Hexadecimal string objects detected by running the tests implemented from the spec	2020-02-14 18:00:12 +01:00
Tissevert	a72d76e229	Add unit tests to make sure I'm not breaking things too much	2020-02-14 17:58:03 +01:00
Tissevert	919f640443	Merge branch 'extract-text' into navigation	2020-02-12 17:35:56 +01:00
Tissevert	32f9866106	Use peek to improve directObject parser avoiding a large <\|> disjunction	2020-02-12 17:34:27 +01:00
Tissevert	eb4d76002c	Finish the split of Navigation out of Page, generalize the use of MonadFail with a custom Error monad (~= Either String)	2020-02-11 22:41:46 +01:00
Tissevert	af994cb50c	WIP: in the process of migrating to Object.Navigation in Pages, still unsure how to manage simple Content parsing and efficient font loading (+ giving a way to edit Contents)	2020-02-11 17:59:15 +01:00
Tissevert	704d7a7fcf	It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…)	2020-02-11 17:36:29 +01:00
Tissevert	11647eb4eb	Implement output for Content streams	2020-02-11 17:26:47 +01:00
Tissevert	aed7af376a	WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell	2020-02-11 08:29:08 +01:00
Tissevert	e77bbbcda9	WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing	2020-02-10 17:43:04 +01:00
Tissevert	195446e653	Allow resources with no /Font field, they won't cause any problem as long as no call to Tf (to load a font) is made	2020-02-10 17:41:44 +01:00
Tissevert	9f1b1afafe	Implement Text rendering from parsed Content	2020-02-10 10:54:44 +01:00
Tissevert	20466c4f13	WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigation	2020-02-09 22:42:57 +01:00
Tissevert	325250383a	Add support for fonts and implement MacRomanEncoding	2020-02-08 08:15:32 +01:00
Tissevert	c48ab22808	Forgot some useless parentheses when playing with operator precedences	2020-02-04 17:05:15 +01:00
Tissevert	a2b66ac6d6	Generalize the getFont function because some /Resources have a direct dictionary as value for their /Font property	2020-02-04 17:04:42 +01:00
Tissevert	cefb08ee50	Going a step further in «optimization» (slowing it even more…) by replacing choice by a search in a Map	2019-11-30 21:46:22 +01:00
Tissevert	afbbcbffc5	Finish implementing the new stack-based call parser	2019-11-30 12:39:40 +01:00
Tissevert	bac08446dd	WIP: starting to fix this criminally inefficient parser for PDF's postfix-operator instructions	2019-11-29 17:42:57 +01:00
Tissevert	f9f799c59b	Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1)	2019-11-29 11:51:35 +01:00
Tissevert	08a9717b3a	Get rid of wrapper PageContents structure returned by PageContent in the PDF.Text module (and return directly [ByteString] instead)	2019-11-29 11:48:28 +01:00
Tissevert	42a02808c1	Merge branch 'main' into extract-text	2019-11-27 18:05:47 +01:00
Tissevert	380c1e439b	Fix a bug preventing Hufflepdf from reading objects with a ' ' after the `obj` keyword	2019-11-27 18:01:19 +01:00
Tissevert	c9f050e64b	Remove deprecated debug script and forgotten comments to bypass the selective export of Text module	2019-10-14 10:17:15 +02:00
Tissevert	3a3e1533b4	Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue	2019-10-14 10:17:15 +02:00
Tissevert	a96e36ec5a	Fix error silently discarding code ranges, make sure ByteString intervals are created with the correct byte length and decode utf16BE encoded values in single-value ranges	2019-10-14 10:17:15 +02:00
Tissevert	d07c286f8e	Clean exported ByteString custom functions	2019-10-14 10:17:15 +02:00
Tissevert	7a15113285	Try and re-implement string decoding — compiles but now fails to decode any string	2019-10-14 10:17:15 +02:00
Tissevert	36d7f9b819	Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary	2019-10-14 10:17:15 +02:00
Tissevert	b8ca7281aa	Fix parsing errors forgetting to make sure there's a space after special operator arguments like names and stringObjects	2019-10-14 10:17:15 +02:00
Tissevert	32efdcdd6b	Try and fix stuff by generalizing a signature to ease debugging and add parenthesis which I think should have been here all along	2019-10-14 10:17:15 +02:00
Tissevert	3b59fd0c61	Separate CMap and Text in two distinct modules	2019-10-14 10:17:15 +02:00
Tissevert	0374b72920	Finish implementing reading, still bugs to investigate	2019-10-14 10:17:15 +02:00
Tissevert	1dd22c3889	Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings	2019-10-14 10:17:15 +02:00
Tissevert	98d029c4d4	In complete debug, more or less implemented CMap parsing but apparently it uses UTF16 ?!	2019-10-14 10:17:15 +02:00
Tissevert	c349d9b4c2	Don't trust serializer, they have nothing todo with a reasonable binary encoding	2019-10-14 10:17:15 +02:00
Tissevert	e7484ef536	Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps	2019-10-14 10:17:15 +02:00
Tissevert	f9e5683bf4	WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile)	2019-10-14 10:17:15 +02:00
Tissevert	b8eb9e6856	Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec	2019-10-14 10:17:15 +02:00
Tissevert	66d315b7fe	Reflect the distinction between eval and run from State monad into the Parser module	2019-10-14 10:17:15 +02:00
Tissevert	51db57ec67	Ugly commit, breaks everything, still trying to figure a grammar for text	2019-10-14 10:17:15 +02:00
Tissevert	6f3c159ea7	Adding a module to implement text reading and a demo program to go with it	2019-10-14 10:17:15 +02:00
Tissevert	68f90d20e2	Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions	2019-09-22 01:40:39 +02:00
Tissevert	3a39c75e6a	Stop requiring an empty line between subsections in a xref section	2019-09-22 01:37:28 +02:00
Tissevert	29c5823f34	Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file	2019-09-22 01:34:17 +02:00
Tissevert	699f830a45	Simplify XRef structure, clarify integer types and remove nextLine	2019-09-20 22:39:14 +02:00
Tissevert	264b0dc92b	Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found	2019-05-31 15:08:54 +02:00
Tissevert	9dac275f68	Keep comment-opening '%' along with the comment and support empty lines	2019-05-31 15:07:41 +02:00
Tissevert	85e4eb9273	Fix bypassed error message for lines + add one for occurrences	2019-05-31 15:06:20 +02:00
Tissevert	11cb6504d7	Go strict ByteStrings with attoparsec	2019-05-24 10:48:09 +02:00
Tissevert	0daa03d958	Remove commented out dead code	2019-05-21 09:07:37 +02:00
Tissevert	b60f337cc4	First useable version	2019-05-18 11:09:03 +02:00
Tissevert	5614a25048	Generate valid PDF	2019-05-18 09:01:13 +02:00
Tissevert	0336baa687	Fix output implementation with dynamic XRefs	2019-05-17 16:14:06 +02:00
Tissevert	e23618da68	Implement output	2019-05-16 22:41:14 +02:00
Tissevert	088637b2c0	Compat stuff for Monoid / Semigroup	2019-05-16 21:40:19 +02:00
Tissevert	645466024a	Starting to implement output with String builder	2019-05-16 17:04:45 +02:00
Tissevert	9b2f890227	Boyer-Moore is canceled, implement the rest of parsing with naive search	2019-05-16 11:01:50 +02:00
Tissevert	fc41f815a3	Broken state : trying to implement Boyer-Moore for fast-forwarding to the end of a section	2019-05-15 19:13:35 +02:00
Tissevert	379a821550	Fix bugs preventing the objects from loading	2019-05-15 15:03:55 +02:00
Tissevert	44508a204c	Reuse Parser type in PDF.Body (and generalize the type of the comment parser)	2019-05-15 09:04:17 +02:00
Tissevert	91292d6401	Implement retrieving objects in the body of the document and use it to populate the structure previously parsed	2019-05-14 18:42:11 +02:00
Tissevert	8043f84da8	Cut PDF module in two, implement basic parsing up to reading XRef table	2019-05-13 18:22:05 +02:00
Tissevert	6eacb55fc4	Fix bug preventing startXref to be found for files with a single byte EOL encoding	2019-05-13 11:34:15 +02:00
Tissevert	c036334b6f	Prototype successfully parsing (only last) startxref	2019-05-13 08:05:28 +02:00

1 2 3

135 commits