|
d9f69014a0
|
Make a couple improvements in performance + add an example script to extract pages from a PDF
|
2020-05-28 18:54:15 +02:00 |
|
|
729e312f90
|
Actually, the spec calls 'catalog' what we call 'origin' — use 'catalog' for more clarity in regard to the spec
|
2020-03-19 10:27:29 +01:00 |
|
|
11640c8465
|
Replace 'cacheFonts' by more versatile 'withFonts' inspired by 'withResources' that avoid having to declare an inline function to capture the 'layer' argument and pass it twice
|
2020-03-19 10:27:29 +01:00 |
|
|
ba7dd6a690
|
Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneath
|
2020-03-19 10:27:29 +01:00 |
|
|
5027b079eb
|
Include page numbers in chunks label, needed for long documents with many pages
|
2020-03-19 10:27:28 +01:00 |
|
|
5722dd1a04
|
Use IntMap for all Maps on Ids
|
2020-03-19 10:27:28 +01:00 |
|
|
f31e9eb38b
|
Generalize Ids out of Content to handle Object Ids too
|
2020-03-19 10:27:21 +01:00 |
|
|
0f857c457d
|
Use a defined monadic stack in Pages to lift the MonadReader ambiguity and allow finishing to reimplement getText demo
|
2020-03-14 16:57:16 +01:00 |
|
|
40475a3093
|
Clean unneeded stuff separating the monadic type constraint from the actual monad stack used, one more step towrds MonadFail -> MonadError
|
2020-03-14 16:55:34 +01:00 |
|
|
5b8d951516
|
WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messages
|
2020-03-11 18:55:18 +01:00 |
|
|
3b1a5152e4
|
Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc list
|
2020-03-10 22:57:11 +01:00 |
|
|
2b9abc24b6
|
Add a separate instance for Raw streams that don't try to decode them
|
2020-03-04 18:31:30 +01:00 |
|
|
309f6ed461
|
Actually re-implement getText with the simpler Box instance
|
2020-03-04 18:19:10 +01:00 |
|
|
cb257fc07e
|
Rename function for clarity : actually it's doing just what w StreamContent does, but without checking the headers to re-zlib-encode the stream content
|
2020-02-27 17:30:42 +01:00 |
|
|
99014ff30d
|
Recognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at it
|
2020-02-26 22:13:29 +01:00 |
|
|
bcf2e05bfb
|
Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already)
|
2020-02-17 15:29:59 +01:00 |
|
|
6096a1a237
|
Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary
|
2020-02-15 13:51:24 +01:00 |
|
|
23186100a8
|
Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access
|
2020-02-15 10:25:09 +01:00 |
|
|
ae938acc02
|
Merge branch 'main' into extract-text
|
2020-02-12 17:34:56 +01:00 |
|
|
325250383a
|
Add support for fonts and implement MacRomanEncoding
|
2020-02-08 08:15:32 +01:00 |
|
|
8373bd1ea0
|
Removing +x permission on getText source that shouldn't ever have been set
|
2019-11-29 19:07:54 +01:00 |
|
|
7eca875900
|
Improve getObj example to catch no-existing ObjectId and default to listing existing ObjectIds when none is provided
|
2019-11-29 11:53:08 +01:00 |
|
|
f9f799c59b
|
Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1)
|
2019-11-29 11:51:35 +01:00 |
|
|
c9f050e64b
|
Remove deprecated debug script and forgotten comments to bypass the selective export of Text module
|
2019-10-14 10:17:15 +02:00 |
|
|
3a3e1533b4
|
Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue
|
2019-10-14 10:17:15 +02:00 |
|
|
36d7f9b819
|
Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary
|
2019-10-14 10:17:15 +02:00 |
|
|
3b59fd0c61
|
Separate CMap and Text in two distinct modules
|
2019-10-14 10:17:15 +02:00 |
|
|
0374b72920
|
Finish implementing reading, still bugs to investigate
|
2019-10-14 10:17:15 +02:00 |
|
|
e7484ef536
|
Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps
|
2019-10-14 10:17:15 +02:00 |
|
|
f9e5683bf4
|
WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile)
|
2019-10-14 10:17:15 +02:00 |
|
|
6f3c159ea7
|
Adding a module to implement text reading and a demo program to go with it
|
2019-10-14 10:17:15 +02:00 |
|
|
68f90d20e2
|
Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions
|
2019-09-22 01:40:39 +02:00 |
|
|
9ab010de61
|
Add to example programs to show how the lib can be used
|
2019-09-20 22:42:17 +02:00 |
|