Hufflepdf

Author	SHA1	Message	Date
Tissevert	d9f69014a0	Make a couple improvements in performance + add an example script to extract pages from a PDF	2020-05-28 18:54:15 +02:00
Tissevert	729e312f90	Actually, the spec calls 'catalog' what we call 'origin' — use 'catalog' for more clarity in regard to the spec	2020-03-19 10:27:29 +01:00
Tissevert	11640c8465	Replace 'cacheFonts' by more versatile 'withFonts' inspired by 'withResources' that avoid having to declare an inline function to capture the 'layer' argument and pass it twice	2020-03-19 10:27:29 +01:00
Tissevert	ba7dd6a690	Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneath	2020-03-19 10:27:29 +01:00
Tissevert	5027b079eb	Include page numbers in chunks label, needed for long documents with many pages	2020-03-19 10:27:28 +01:00
Tissevert	5722dd1a04	Use IntMap for all Maps on Ids	2020-03-19 10:27:28 +01:00
Tissevert	f31e9eb38b	Generalize Ids out of Content to handle Object Ids too	2020-03-19 10:27:21 +01:00
Tissevert	0f857c457d	Use a defined monadic stack in Pages to lift the MonadReader ambiguity and allow finishing to reimplement getText demo	2020-03-14 16:57:16 +01:00
Tissevert	40475a3093	Clean unneeded stuff separating the monadic type constraint from the actual monad stack used, one more step towrds MonadFail -> MonadError	2020-03-14 16:55:34 +01:00
Tissevert	5b8d951516	WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messages	2020-03-11 18:55:18 +01:00
Tissevert	3b1a5152e4	Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc list	2020-03-10 22:57:11 +01:00
Tissevert	2b9abc24b6	Add a separate instance for Raw streams that don't try to decode them	2020-03-04 18:31:30 +01:00
Tissevert	309f6ed461	Actually re-implement getText with the simpler Box instance	2020-03-04 18:19:10 +01:00
Tissevert	cb257fc07e	Rename function for clarity : actually it's doing just what w StreamContent does, but without checking the headers to re-zlib-encode the stream content	2020-02-27 17:30:42 +01:00
Tissevert	99014ff30d	Recognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at it	2020-02-26 22:13:29 +01:00
Tissevert	bcf2e05bfb	Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already)	2020-02-17 15:29:59 +01:00
Tissevert	6096a1a237	Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary	2020-02-15 13:51:24 +01:00
Tissevert	23186100a8	Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access	2020-02-15 10:25:09 +01:00
Tissevert	ae938acc02	Merge branch 'main' into extract-text	2020-02-12 17:34:56 +01:00
Tissevert	325250383a	Add support for fonts and implement MacRomanEncoding	2020-02-08 08:15:32 +01:00
Tissevert	8373bd1ea0	Removing +x permission on getText source that shouldn't ever have been set	2019-11-29 19:07:54 +01:00
Tissevert	7eca875900	Improve getObj example to catch no-existing ObjectId and default to listing existing ObjectIds when none is provided	2019-11-29 11:53:08 +01:00
Tissevert	f9f799c59b	Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1)	2019-11-29 11:51:35 +01:00
Tissevert	c9f050e64b	Remove deprecated debug script and forgotten comments to bypass the selective export of Text module	2019-10-14 10:17:15 +02:00
Tissevert	3a3e1533b4	Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue	2019-10-14 10:17:15 +02:00
Tissevert	36d7f9b819	Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary	2019-10-14 10:17:15 +02:00
Tissevert	3b59fd0c61	Separate CMap and Text in two distinct modules	2019-10-14 10:17:15 +02:00
Tissevert	0374b72920	Finish implementing reading, still bugs to investigate	2019-10-14 10:17:15 +02:00
Tissevert	e7484ef536	Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps	2019-10-14 10:17:15 +02:00
Tissevert	f9e5683bf4	WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile)	2019-10-14 10:17:15 +02:00
Tissevert	6f3c159ea7	Adding a module to implement text reading and a demo program to go with it	2019-10-14 10:17:15 +02:00
Tissevert	68f90d20e2	Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions	2019-09-22 01:40:39 +02:00
Tissevert	9ab010de61	Add to example programs to show how the lib can be used	2019-09-20 22:42:17 +02:00

33 commits