Commit graph

  • 6a7e9e9595 Expose a toList function similar to Data.Map in module Id for IdMap and allow mapping over keys and values in a monadic construct like mapM extract-text Tissevert 2020-06-03 15:15:07 +0200
  • d9f69014a0 Make a couple improvements in performance + add an example script to extract pages from a PDF Tissevert 2020-05-28 18:54:15 +0200
  • f6664683c7 Once again something that should never have been committed Tissevert 2020-03-20 09:34:53 +0100
  • c491e8a70c Forgot to remove deprecated source file Tissevert 2020-03-19 12:53:35 +0100
  • 09bd706748 Export Content operators, needed to write filters like reveal Tissevert 2020-03-19 09:05:06 +0100
  • 729e312f90 Actually, the spec calls 'catalog' what we call 'origin' — use 'catalog' for more clarity in regard to the spec Tissevert 2020-03-18 15:11:28 +0100
  • 6d265633e4 Export Instructions constructor from PDF.Content, used by reveal Tissevert 2020-03-18 15:08:26 +0100
  • 1eb1c23053 Found a nicer way to handle the too long IndirectObjCoordinates for Object Navigation Tissevert 2020-03-18 15:07:50 +0100
  • 44125f75a6 The orphan instance for MonadState s m => MonadReader s m really can't be used, so replace it with a mere function that runs an operation on a ReaderT into the monad State, allowing to borrow operations on MonadReader in a MonadState context Tissevert 2020-03-18 15:06:20 +0100
  • c8a5e2b191 Wait, CachedFonts are indexed by Id Object so it could be an IdMap actually Tissevert 2020-03-17 16:39:06 +0100
  • 11640c8465 Replace 'cacheFonts' by more versatile 'withFonts' inspired by 'withResources' that avoid having to declare an inline function to capture the 'layer' argument and pass it twice Tissevert 2020-03-17 16:29:46 +0100
  • e94a09b3ec Add a Traversable instance for IdMap, needed in reveal and useful in general to be able to use atAll Tissevert 2020-03-17 15:25:12 +0100
  • ba7dd6a690 Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneath Tissevert 2020-03-17 10:33:47 +0100
  • d21e14f9a4 Hey, zlib isn't needed anymore for getText since all decoding is done directly in the Box instance for Streams Tissevert 2020-03-17 08:59:07 +0100
  • a1c2fbf110 Add an alias to Id to lift type ambiguities like 'chunk' in PDF.Content.Text Tissevert 2020-03-17 08:47:19 +0100
  • 24630a04a1 Implement 'w' for Pages Box instances Tissevert 2020-03-17 08:46:37 +0100
  • ee5e7500a8 Implement 'w' for Box m Chunks Content (Indexed Text) Tissevert 2020-03-17 08:45:18 +0100
  • d8aec5bf80 Add Box instance for IdMap a b, remove restriction on new keys in the Map instance since it's not really needed and could be better implemented like in OrderedMap by first using 'r' Tissevert 2020-03-17 08:43:54 +0100
  • 25e2823c75 Generalize register to all IdMap a b, since it's gonna be needed by Indexed Text too Tissevert 2020-03-17 08:39:29 +0100
  • 5027b079eb Include page numbers in chunks label, needed for long documents with many pages Tissevert 2020-03-17 08:36:02 +0100
  • 5722dd1a04 Use IntMap for all Maps on Ids Tissevert 2020-03-15 15:13:00 +0100
  • f31e9eb38b Generalize Ids out of Content to handle Object Ids too Tissevert 2020-03-14 22:30:28 +0100
  • 0f857c457d Use a defined monadic stack in Pages to lift the MonadReader ambiguity and allow finishing to reimplement getText demo Tissevert 2020-03-14 16:57:16 +0100
  • 40475a3093 Clean unneeded stuff separating the monadic type constraint from the actual monad stack used, one more step towrds MonadFail -> MonadError Tissevert 2020-03-14 16:55:05 +0100
  • a9d3e5d326 Clean unused dependencies from Map + use a more defined Monad for the Box Chunks instance, hoping we will be able to clear the whole stack someday and stop requiring that RoContext type, unboxing and reboxing the FontSet for no good Tissevert 2020-03-14 16:27:56 +0100
  • f2a99e1fd2 Reorder module PDF.Body in alphabetical order Tissevert 2020-03-14 16:25:26 +0100
  • 5bf2b08fa9 Try replacing general monadic type constraint by a definite monad stack Tissevert 2020-03-11 22:35:19 +0100
  • 5b8d951516 WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messages Tissevert 2020-03-11 18:55:18 +0100
  • d3f1b97f3a Replace the fake instance of Box for Content over Indexed Text with the true one using renderText Tissevert 2020-03-11 18:53:41 +0100
  • c4c3e35e09 Write said instance Tissevert 2020-03-11 18:52:09 +0100
  • 10f8c711da Implement set and mapi on OrderedMap for convenience and to write a Box instance over OrderedMap like the one over Map Tissevert 2020-03-11 18:51:49 +0100
  • b6c1f670ef Generalize the search for FlateDecode (there can be several filters in an array) Tissevert 2020-03-11 10:47:52 +0100
  • 3b1a5152e4 Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc list Tissevert 2020-03-10 22:57:11 +0100
  • a04adff1d2 Prepare real instance of Box using renderText Tissevert 2020-03-10 22:55:16 +0100
  • 103037ffb2 Fix mistake in arity of operator " Tissevert 2020-03-10 22:53:27 +0100
  • dce10ae63a Keep Page as only a reference object keeping the ObjectId explicit so we can modify the actual objects one day, write an OrderedMap data structure to help Tissevert 2020-03-08 22:18:47 +0100
  • f2986da96d Simplify Content abstracting over MonadParser for no reason and provide instead an parse that's in MonadFail to avoid having to handle Either outside Tissevert 2020-03-08 22:16:23 +0100
  • 673321bf0a Implement encoder for good Tissevert 2020-03-08 22:14:36 +0100
  • 0ade9cc2f5 Implement proper text formatting into PDF instructions using the new encode feature available in Fonts Tissevert 2020-03-08 00:04:18 +0100
  • 457f1755e6 Prepare storing the reverse mapping for CMaps, divided by length to be able to implement encoding with a reasonable complexity Tissevert 2020-03-08 00:02:24 +0100
  • ca40d2df76 Don't use (!?) operator that doesn't exist before containers 0.5.9 for maximum compatibility Tissevert 2020-03-08 00:00:24 +0100
  • 44bc898ed3 Generalize the Indexed type to handle both arbitrary Content instructions and text-related ones that can be viewed as text chunks Tissevert 2020-03-06 19:21:16 +0100
  • 1ec47c5d07 Update Font type to cover both encoding and decoding — WIP for CMap, but complete though not tested yet for MacRoman encoding Tissevert 2020-03-06 19:19:53 +0100
  • 6e245189fd Add a simple Box instance that exposes IndexedInstructions within a Content Tissevert 2020-03-05 17:44:38 +0100
  • 90348c57d6 Disable text rendering and font loading from the Page abstraction, this code will have to be moved into a separate Box instance Tissevert 2020-03-05 17:40:58 +0100
  • 50ac0692b2 Implement r for access by PageNumber and clean the mess a bit Tissevert 2020-03-05 10:09:09 +0100
  • 2b9abc24b6 Add a separate instance for Raw streams that don't try to decode them Tissevert 2020-03-04 18:31:30 +0100
  • 309f6ed461 Actually re-implement getText with the simpler Box instance Tissevert 2020-03-04 18:19:10 +0100
  • 93c9863426 Remove accidentally commited trailing space on a line Tissevert 2020-03-04 18:14:54 +0100
  • 7cef65d799 Fixed vicious bug introduced by 6096a1a237 (since follow is now automatic for references, it's not called explicitely but should in case of 'several' Content, which is an array of references, each of which should be expended) — TODO: add a unit test for that Tissevert 2020-03-04 18:14:33 +0100
  • d288ecf0ac Start reimplementing getAll as a Box instance and try to separate the various monad run steps Tissevert 2020-03-03 18:17:44 +0100
  • 3b3eeef218 Maybe we need a MonadState s m => MonadReader s m instance some day ? Tissevert 2020-03-03 18:16:49 +0100
  • 2c02e44adf Export the PDFContent monadic type used in PDF.Pages Tissevert 2020-03-03 18:16:12 +0100
  • 9ce1a48030 Optimistically prepare the instance declaration for Pages that should replace get / getAll, not really getting out of the Monad Tissevert 2020-02-28 18:15:40 +0100
  • 4969c6442e Simple String aliasing to prepare the day when we'll be able to have more complex Component than just PDF Names (and access elements in an array) Tissevert 2020-02-28 18:14:27 +0100
  • cb257fc07e Rename function for clarity : actually it's doing just what w StreamContent does, but without checking the headers to re-zlib-encode the stream content Tissevert 2020-02-27 17:30:42 +0100
  • d90eaf6f1c Add Box instances to allow handling some exceptions in monad and converting them to Traversable accessible from the data part of the type Tissevert 2020-02-27 17:22:12 +0100
  • 99014ff30d Recognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at it Tissevert 2020-02-26 22:13:29 +0100
  • f4df4aab22 Found a nicer formulation that doesn't require transitivity or index agglomeration and swapped argument of w for more reusability with at / atAll Tissevert 2020-02-26 17:14:43 +0100
  • bdbc5f7351 Generalize the Box instances on containers from the particular cases of Document/Layers and Layers/Objects and move them to PDF.Box Tissevert 2020-02-25 17:36:54 +0100
  • 30fece6537 Notice the 'edit' I exported earlier could be reused to simplify the w implementation of the proof that Box is a transitive relation Tissevert 2020-02-24 21:39:02 +0100
  • 1a70f2972b Expose Box index flags in PDF and PDF.Layer Tissevert 2020-02-24 21:37:09 +0100
  • 67faa06ea2 Lift unused restriction on MonadFail for AllObjects instance of Box Layer Tissevert 2020-02-24 21:36:31 +0100
  • 83a63d4b02 Implement Box instance from Layer to Object, either all at once or indexed by an ObjectId Tissevert 2020-02-24 17:29:22 +0100
  • 85ee8519c4 Implement Box instances from Document to Layers and EOLStyle Tissevert 2020-02-24 17:28:17 +0100
  • e607f9cd37 Implement transitivity instance, extract a part of modifyAt as a convenient 'edit' function useful elsewhere and present a right-infix version of (,) to allow writing the nested tuple indexes more conveniently Tissevert 2020-02-24 17:27:37 +0100
  • a9252b129a Start a Box module to describe inclusion relations between different types and get a MonadState action on the top type for any modification down there Tissevert 2020-02-23 22:24:59 +0100
  • 71e62ee732 Add IDs to Instructions so that they can be selected in a given Content (and modified one day…) Tissevert 2020-02-23 22:21:09 +0100
  • 160999a7d7 A small renaming for more clarity and because I thought «update» could be needed for a function name but after all maybe not but it's still better that way Tissevert 2020-02-23 22:17:09 +0100
  • 36b1782464 Follow previous renaming for a local variable in Navigation for more clarity Tissevert 2020-02-23 22:15:52 +0100
  • bcf2e05bfb Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already) Tissevert 2020-02-17 15:29:59 +0100
  • 6096a1a237 Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary Tissevert 2020-02-15 13:51:24 +0100
  • 23186100a8 Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access Tissevert 2020-02-15 10:25:09 +0100
  • b916ab5206 Just noticed Streams are a kind of Dictionary too, since they have a header Tissevert 2020-02-15 10:23:32 +0100
  • 4a6dbda7d3 Move Error type from Pages to Navigation as a candidate for MonadFail required by PDFContent defined there Tissevert 2020-02-15 10:22:42 +0100
  • 923d1800b0 Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to read Tissevert 2020-02-14 18:02:40 +0100
  • 1c457d71d8 Fix the reading of Hexadecimal string objects detected by running the tests implemented from the spec Tissevert 2020-02-14 18:00:12 +0100
  • a72d76e229 Add unit tests to make sure I'm not breaking things too much Tissevert 2020-02-14 11:53:05 +0100
  • e429dbf946 Trying to apply the same technique used for directObjects (apparently it ruins the performances) withHints Tissevert 2020-02-13 16:21:44 +0100
  • 1a25307c8c Minor but strict improvement : remove the general implementation of <?> for Alternative cutFail Tissevert 2020-02-12 17:48:26 +0100
  • 919f640443 Merge branch 'extract-text' into navigation Tissevert 2020-02-12 17:35:56 +0100
  • ae938acc02 Merge branch 'main' into extract-text Tissevert 2020-02-12 17:34:56 +0100
  • 32f9866106 Use peek to improve directObject parser avoiding a large <|> disjunction Tissevert 2020-02-12 17:34:27 +0100
  • eb4d76002c Finish the split of Navigation out of Page, generalize the use of MonadFail with a custom Error monad (~= Either String) Tissevert 2020-02-11 22:41:46 +0100
  • af994cb50c WIP: in the process of migrating to Object.Navigation in Pages, still unsure how to manage simple Content parsing and efficient font loading (+ giving a way to edit Contents) Tissevert 2020-02-11 17:59:15 +0100
  • 704d7a7fcf It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…) Tissevert 2020-02-11 17:35:35 +0100
  • 11647eb4eb Implement output for Content streams Tissevert 2020-02-11 17:26:47 +0100
  • aed7af376a WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell Tissevert 2020-02-11 08:29:08 +0100
  • e77bbbcda9 WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing Tissevert 2020-02-10 17:43:04 +0100
  • 195446e653 Allow resources with no /Font field, they won't cause any problem as long as no call to Tf (to load a font) is made Tissevert 2020-02-10 17:41:44 +0100
  • 9f1b1afafe Implement Text rendering from parsed Content Tissevert 2020-02-10 10:54:44 +0100
  • 20466c4f13 WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigation Tissevert 2020-02-09 22:42:57 +0100
  • 325250383a Add support for fonts and implement MacRomanEncoding Tissevert 2020-02-08 08:15:32 +0100
  • dd6bfd90bd Using toEnum to convert from Int to Int ? Surely a left-over from some time when it was a different type fonts Tissevert 2020-02-08 08:10:32 +0100
  • 03fbbc3a96 Why did I implement this overly complicated lift by hand again ? Tissevert 2020-02-07 13:08:10 +0100
  • 95f9ab35b1 Implement MacRomanEncoding for real following their own vendor file https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT Tissevert 2020-02-07 10:59:28 +0100
  • fe055150a3 Migrate to Text to represent page contents and get rid of encoding concerns to early Tissevert 2020-02-07 10:03:17 +0100
  • 57996749c6 Fix loose parser not making sure endOfInput is reached; add two families of operators and simplify the «Show» instance with a dedicated function to allow deleting lines of uninteresting code Tissevert 2020-02-06 16:54:27 +0100
  • 3f6b0651f3 Expose the endOfLine parser through MonadParser to allow enforcing reaching the end of input in page parser Tissevert 2020-02-06 16:53:06 +0100
  • ecfd682b34 Simplify functions exposed (all part of the MonadParser class Tissevert 2020-02-06 16:52:22 +0100