Hufflepdf

Author	SHA1	Message	Date
Tissevert	a72d76e229	Add unit tests to make sure I'm not breaking things too much	2020-02-14 17:58:03 +01:00
Tissevert	919f640443	Merge branch 'extract-text' into navigation	2020-02-12 17:35:56 +01:00
Tissevert	32f9866106	Use peek to improve directObject parser avoiding a large <\|> disjunction	2020-02-12 17:34:27 +01:00
Tissevert	704d7a7fcf	It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…)	2020-02-11 17:36:29 +01:00
Tissevert	aed7af376a	WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell	2020-02-11 08:29:08 +01:00
Tissevert	e77bbbcda9	WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing	2020-02-10 17:43:04 +01:00
Tissevert	42a02808c1	Merge branch 'main' into extract-text	2019-11-27 18:05:47 +01:00
Tissevert	380c1e439b	Fix a bug preventing Hufflepdf from reading objects with a ' ' after the `obj` keyword	2019-11-27 18:01:19 +01:00
Tissevert	3a3e1533b4	Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue	2019-10-14 10:17:15 +02:00
Tissevert	d07c286f8e	Clean exported ByteString custom functions	2019-10-14 10:17:15 +02:00
Tissevert	36d7f9b819	Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary	2019-10-14 10:17:15 +02:00
Tissevert	3b59fd0c61	Separate CMap and Text in two distinct modules	2019-10-14 10:17:15 +02:00
Tissevert	c349d9b4c2	Don't trust serializer, they have nothing todo with a reasonable binary encoding	2019-10-14 10:17:15 +02:00
Tissevert	e7484ef536	Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps	2019-10-14 10:17:15 +02:00
Tissevert	b8eb9e6856	Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec	2019-10-14 10:17:15 +02:00
Tissevert	51db57ec67	Ugly commit, breaks everything, still trying to figure a grammar for text	2019-10-14 10:17:15 +02:00
Tissevert	6f3c159ea7	Adding a module to implement text reading and a demo program to go with it	2019-10-14 10:17:15 +02:00
Tissevert	3a39c75e6a	Stop requiring an empty line between subsections in a xref section	2019-09-22 01:37:28 +02:00
Tissevert	29c5823f34	Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file	2019-09-22 01:34:17 +02:00
Tissevert	699f830a45	Simplify XRef structure, clarify integer types and remove nextLine	2019-09-20 22:39:14 +02:00
Tissevert	264b0dc92b	Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found	2019-05-31 15:08:54 +02:00
Tissevert	9dac275f68	Keep comment-opening '%' along with the comment and support empty lines	2019-05-31 15:07:41 +02:00
Tissevert	85e4eb9273	Fix bypassed error message for lines + add one for occurrences	2019-05-31 15:06:20 +02:00
Tissevert	11cb6504d7	Go strict ByteStrings with attoparsec	2019-05-24 10:48:09 +02:00
Tissevert	5614a25048	Generate valid PDF	2019-05-18 09:01:13 +02:00
Tissevert	0336baa687	Fix output implementation with dynamic XRefs	2019-05-17 16:14:06 +02:00
Tissevert	e23618da68	Implement output	2019-05-16 22:41:14 +02:00
Tissevert	645466024a	Starting to implement output with String builder	2019-05-16 17:04:45 +02:00
Tissevert	9b2f890227	Boyer-Moore is canceled, implement the rest of parsing with naive search	2019-05-16 11:01:50 +02:00
Tissevert	fc41f815a3	Broken state : trying to implement Boyer-Moore for fast-forwarding to the end of a section	2019-05-15 19:13:35 +02:00
Tissevert	379a821550	Fix bugs preventing the objects from loading	2019-05-15 15:03:55 +02:00
Tissevert	44508a204c	Reuse Parser type in PDF.Body (and generalize the type of the comment parser)	2019-05-15 09:04:17 +02:00
Tissevert	91292d6401	Implement retrieving objects in the body of the document and use it to populate the structure previously parsed	2019-05-14 18:42:11 +02:00
Tissevert	8043f84da8	Cut PDF module in two, implement basic parsing up to reading XRef table	2019-05-13 18:22:05 +02:00

34 commits