6a7e9e9595Expose a toList function similar to Data.Map in module Id for IdMap and allow mapping over keys and values in a monadic construct like mapM
extract-text
Tissevert2020-06-03 15:15:07 +0200
d9f69014a0Make a couple improvements in performance + add an example script to extract pages from a PDFTissevert2020-05-28 18:54:15 +0200
f6664683c7Once again something that should never have been committedTissevert2020-03-20 09:34:53 +0100
c491e8a70cForgot to remove deprecated source fileTissevert2020-03-19 12:53:35 +0100
09bd706748Export Content operators, needed to write filters like revealTissevert2020-03-19 09:05:06 +0100
729e312f90Actually, the spec calls 'catalog' what we call 'origin' — use 'catalog' for more clarity in regard to the specTissevert2020-03-18 15:11:28 +0100
6d265633e4Export Instructions constructor from PDF.Content, used by revealTissevert2020-03-18 15:08:26 +0100
1eb1c23053Found a nicer way to handle the too long IndirectObjCoordinates for Object NavigationTissevert2020-03-18 15:07:50 +0100
44125f75a6The orphan instance for MonadState s m => MonadReader s m really can't be used, so replace it with a mere function that runs an operation on a ReaderT into the monad State, allowing to borrow operations on MonadReader in a MonadState contextTissevert2020-03-18 15:06:20 +0100
c8a5e2b191Wait, CachedFonts are indexed by Id Object so it could be an IdMap actuallyTissevert2020-03-17 16:39:06 +0100
11640c8465Replace 'cacheFonts' by more versatile 'withFonts' inspired by 'withResources' that avoid having to declare an inline function to capture the 'layer' argument and pass it twiceTissevert2020-03-17 16:29:46 +0100
e94a09b3ecAdd a Traversable instance for IdMap, needed in reveal and useful in general to be able to use atAllTissevert2020-03-17 15:25:12 +0100
ba7dd6a690Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneathTissevert2020-03-17 10:33:47 +0100
d21e14f9a4Hey, zlib isn't needed anymore for getText since all decoding is done directly in the Box instance for StreamsTissevert2020-03-17 08:59:07 +0100
a1c2fbf110Add an alias to Id to lift type ambiguities like 'chunk' in PDF.Content.TextTissevert2020-03-17 08:47:19 +0100
24630a04a1Implement 'w' for Pages Box instancesTissevert2020-03-17 08:46:37 +0100
ee5e7500a8Implement 'w' for Box m Chunks Content (Indexed Text)Tissevert2020-03-17 08:45:18 +0100
d8aec5bf80Add Box instance for IdMap a b, remove restriction on new keys in the Map instance since it's not really needed and could be better implemented like in OrderedMap by first using 'r'Tissevert2020-03-17 08:43:54 +0100
25e2823c75Generalize register to all IdMap a b, since it's gonna be needed by Indexed Text tooTissevert2020-03-17 08:39:29 +0100
5027b079ebInclude page numbers in chunks label, needed for long documents with many pagesTissevert2020-03-17 08:36:02 +0100
f31e9eb38bGeneralize Ids out of Content to handle Object Ids tooTissevert2020-03-14 22:30:28 +0100
0f857c457dUse a defined monadic stack in Pages to lift the MonadReader ambiguity and allow finishing to reimplement getText demoTissevert2020-03-14 16:57:16 +0100
40475a3093Clean unneeded stuff separating the monadic type constraint from the actual monad stack used, one more step towrds MonadFail -> MonadErrorTissevert2020-03-14 16:55:05 +0100
a9d3e5d326Clean unused dependencies from Map + use a more defined Monad for the Box Chunks instance, hoping we will be able to clear the whole stack someday and stop requiring that RoContext type, unboxing and reboxing the FontSet for no goodTissevert2020-03-14 16:27:56 +0100
f2a99e1fd2Reorder module PDF.Body in alphabetical orderTissevert2020-03-14 16:25:26 +0100
5bf2b08fa9Try replacing general monadic type constraint by a definite monad stackTissevert2020-03-11 22:35:19 +0100
5b8d951516WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messagesTissevert2020-03-11 18:55:18 +0100
d3f1b97f3aReplace the fake instance of Box for Content over Indexed Text with the true one using renderTextTissevert2020-03-11 18:53:41 +0100
10f8c711daImplement set and mapi on OrderedMap for convenience and to write a Box instance over OrderedMap like the one over MapTissevert2020-03-11 18:51:49 +0100
b6c1f670efGeneralize the search for FlateDecode (there can be several filters in an array)Tissevert2020-03-11 10:47:52 +0100
3b1a5152e4Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc listTissevert2020-03-10 22:57:11 +0100
a04adff1d2Prepare real instance of Box using renderTextTissevert2020-03-10 22:55:16 +0100
103037ffb2Fix mistake in arity of operator "Tissevert2020-03-10 22:53:27 +0100
dce10ae63aKeep Page as only a reference object keeping the ObjectId explicit so we can modify the actual objects one day, write an OrderedMap data structure to helpTissevert2020-03-08 22:18:47 +0100
f2986da96dSimplify Content abstracting over MonadParser for no reason and provide instead an parse that's in MonadFail to avoid having to handle Either outsideTissevert2020-03-08 22:16:23 +0100
0ade9cc2f5Implement proper text formatting into PDF instructions using the new encode feature available in FontsTissevert2020-03-08 00:04:18 +0100
457f1755e6Prepare storing the reverse mapping for CMaps, divided by length to be able to implement encoding with a reasonable complexityTissevert2020-03-08 00:02:24 +0100
ca40d2df76Don't use (!?) operator that doesn't exist before containers 0.5.9 for maximum compatibilityTissevert2020-03-08 00:00:24 +0100
44bc898ed3Generalize the Indexed type to handle both arbitrary Content instructions and text-related ones that can be viewed as text chunksTissevert2020-03-06 19:21:16 +0100
1ec47c5d07Update Font type to cover both encoding and decoding — WIP for CMap, but complete though not tested yet for MacRoman encodingTissevert2020-03-06 19:19:53 +0100
6e245189fdAdd a simple Box instance that exposes IndexedInstructions within a ContentTissevert2020-03-05 17:44:38 +0100
90348c57d6Disable text rendering and font loading from the Page abstraction, this code will have to be moved into a separate Box instanceTissevert2020-03-05 17:40:58 +0100
50ac0692b2Implement r for access by PageNumber and clean the mess a bitTissevert2020-03-05 10:09:09 +0100
2b9abc24b6Add a separate instance for Raw streams that don't try to decode themTissevert2020-03-04 18:31:30 +0100
309f6ed461Actually re-implement getText with the simpler Box instanceTissevert2020-03-04 18:19:10 +0100
93c9863426Remove accidentally commited trailing space on a lineTissevert2020-03-04 18:14:54 +0100
7cef65d799Fixed vicious bug introduced by 6096a1a237 (since follow is now automatic for references, it's not called explicitely but should in case of 'several' Content, which is an array of references, each of which should be expended) — TODO: add a unit test for thatTissevert2020-03-04 18:14:33 +0100
d288ecf0acStart reimplementing getAll as a Box instance and try to separate the various monad run stepsTissevert2020-03-03 18:17:44 +0100
3b3eeef218Maybe we need a MonadState s m => MonadReader s m instance some day ?Tissevert2020-03-03 18:16:49 +0100
2c02e44adfExport the PDFContent monadic type used in PDF.PagesTissevert2020-03-03 18:16:12 +0100
9ce1a48030Optimistically prepare the instance declaration for Pages that should replace get / getAll, not really getting out of the MonadTissevert2020-02-28 18:15:40 +0100
4969c6442eSimple String aliasing to prepare the day when we'll be able to have more complex Component than just PDF Names (and access elements in an array)Tissevert2020-02-28 18:14:27 +0100
cb257fc07eRename function for clarity : actually it's doing just what w StreamContent does, but without checking the headers to re-zlib-encode the stream contentTissevert2020-02-27 17:30:42 +0100
d90eaf6f1cAdd Box instances to allow handling some exceptions in monad and converting them to Traversable accessible from the data part of the typeTissevert2020-02-27 17:22:12 +0100
99014ff30dRecognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at itTissevert2020-02-26 22:13:29 +0100
f4df4aab22Found a nicer formulation that doesn't require transitivity or index agglomeration and swapped argument of w for more reusability with at / atAllTissevert2020-02-26 17:14:43 +0100
bdbc5f7351Generalize the Box instances on containers from the particular cases of Document/Layers and Layers/Objects and move them to PDF.BoxTissevert2020-02-25 17:36:54 +0100
30fece6537Notice the 'edit' I exported earlier could be reused to simplify the w implementation of the proof that Box is a transitive relationTissevert2020-02-24 21:39:02 +0100
1a70f2972bExpose Box index flags in PDF and PDF.LayerTissevert2020-02-24 21:37:09 +0100
67faa06ea2Lift unused restriction on MonadFail for AllObjects instance of Box LayerTissevert2020-02-24 21:36:31 +0100
83a63d4b02Implement Box instance from Layer to Object, either all at once or indexed by an ObjectIdTissevert2020-02-24 17:29:22 +0100
85ee8519c4Implement Box instances from Document to Layers and EOLStyleTissevert2020-02-24 17:28:17 +0100
e607f9cd37Implement transitivity instance, extract a part of modifyAt as a convenient 'edit' function useful elsewhere and present a right-infix version of (,) to allow writing the nested tuple indexes more convenientlyTissevert2020-02-24 17:27:37 +0100
a9252b129aStart a Box module to describe inclusion relations between different types and get a MonadState action on the top type for any modification down thereTissevert2020-02-23 22:24:59 +0100
71e62ee732Add IDs to Instructions so that they can be selected in a given Content (and modified one day…)Tissevert2020-02-23 22:21:09 +0100
160999a7d7A small renaming for more clarity and because I thought «update» could be needed for a function name but after all maybe not but it's still better that wayTissevert2020-02-23 22:17:09 +0100
36b1782464Follow previous renaming for a local variable in Navigation for more clarityTissevert2020-02-23 22:15:52 +0100
bcf2e05bfbMove Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already)Tissevert2020-02-17 15:29:59 +0100
6096a1a237Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / DictionaryTissevert2020-02-15 13:51:24 +0100
23186100a8Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId accessTissevert2020-02-15 10:25:09 +0100
b916ab5206Just noticed Streams are a kind of Dictionary too, since they have a headerTissevert2020-02-15 10:23:32 +0100
4a6dbda7d3Move Error type from Pages to Navigation as a candidate for MonadFail required by PDFContent defined thereTissevert2020-02-15 10:22:42 +0100
923d1800b0Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to readTissevert2020-02-14 18:02:40 +0100
1c457d71d8Fix the reading of Hexadecimal string objects detected by running the tests implemented from the specTissevert2020-02-14 18:00:12 +0100
a72d76e229Add unit tests to make sure I'm not breaking things too muchTissevert2020-02-14 11:53:05 +0100
e429dbf946Trying to apply the same technique used for directObjects (apparently it ruins the performances)
withHints
Tissevert2020-02-13 16:21:44 +0100
1a25307c8cMinor but strict improvement : remove the general implementation of <?> for Alternative
cutFail
Tissevert2020-02-12 17:48:26 +0100
919f640443Merge branch 'extract-text' into navigationTissevert2020-02-12 17:35:56 +0100
ae938acc02Merge branch 'main' into extract-textTissevert2020-02-12 17:34:56 +0100
32f9866106Use peek to improve directObject parser avoiding a large <|> disjunctionTissevert2020-02-12 17:34:27 +0100
eb4d76002cFinish the split of Navigation out of Page, generalize the use of MonadFail with a custom Error monad (~= Either String)Tissevert2020-02-11 22:41:46 +0100
af994cb50cWIP: in the process of migrating to Object.Navigation in Pages, still unsure how to manage simple Content parsing and efficient font loading (+ giving a way to edit Contents)Tissevert2020-02-11 17:59:15 +0100
704d7a7fcfIt turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…)Tissevert2020-02-11 17:35:35 +0100
11647eb4ebImplement output for Content streamsTissevert2020-02-11 17:26:47 +0100
aed7af376aWIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hellTissevert2020-02-11 08:29:08 +0100
e77bbbcda9WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsingTissevert2020-02-10 17:43:04 +0100
195446e653Allow resources with no /Font field, they won't cause any problem as long as no call to Tf (to load a font) is madeTissevert2020-02-10 17:41:44 +0100
9f1b1afafeImplement Text rendering from parsed ContentTissevert2020-02-10 10:54:44 +0100
20466c4f13WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigationTissevert2020-02-09 22:42:57 +0100
325250383aAdd support for fonts and implement MacRomanEncodingTissevert2020-02-08 08:15:32 +0100
dd6bfd90bdUsing toEnum to convert from Int to Int ? Surely a left-over from some time when it was a different type
fonts
Tissevert2020-02-08 08:10:32 +0100
03fbbc3a96Why did I implement this overly complicated lift by hand again ?Tissevert2020-02-07 13:08:10 +0100
fe055150a3Migrate to Text to represent page contents and get rid of encoding concerns to earlyTissevert2020-02-07 10:03:17 +0100
57996749c6Fix loose parser not making sure endOfInput is reached; add two families of operators and simplify the «Show» instance with a dedicated function to allow deleting lines of uninteresting codeTissevert2020-02-06 16:54:27 +0100
3f6b0651f3Expose the endOfLine parser through MonadParser to allow enforcing reaching the end of input in page parserTissevert2020-02-06 16:53:06 +0100
ecfd682b34Simplify functions exposed (all part of the MonadParser classTissevert2020-02-06 16:52:22 +0100