2017-09-17 08:00:20 +02:00
|
|
|
|
% Using the pandoc API
|
|
|
|
|
% John MacFarlane
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 01:08:05 +02:00
|
|
|
|
Pandoc can be used as a Haskell library, to write your own
|
|
|
|
|
conversion tools or power a web application. This document
|
|
|
|
|
offers an introduction to using the pandoc API.
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 01:08:05 +02:00
|
|
|
|
Detailed API documentation at the level of individual functions
|
|
|
|
|
and types is available at
|
|
|
|
|
<https://hackage.haskell.org/package/pandoc>.
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
# Pandoc's architecture
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
Pandoc is structured as a set of *readers*, which translate
|
|
|
|
|
various input formats into an abstract syntax tree (the
|
|
|
|
|
Pandoc AST) representing a structured document, and a set of
|
|
|
|
|
*writers*, which render this AST into various input formats.
|
|
|
|
|
Pictorially:
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
```
|
|
|
|
|
[input format] ==reader==> [Pandoc AST] ==writer==> [output format]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This architecture allows pandoc to perform $M \times n$
|
|
|
|
|
conversions with $M$ readers and $N$ writers.
|
|
|
|
|
|
|
|
|
|
The Pandoc AST is defined in the
|
|
|
|
|
[pandoc-types](https://hackage.haskell.org/package/pandoc-types)
|
|
|
|
|
package. You should start by looking at the Haddock
|
|
|
|
|
documentation for
|
|
|
|
|
[Text.Pandoc.Definition](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html). As you'll see, a `Pandoc` is
|
|
|
|
|
composed of some metadata and a list of `Block`s. There are
|
|
|
|
|
various kinds of `Block`, including `Para` (paragraph),
|
|
|
|
|
`Header` (section heading), and `BlockQuote`. Some of the
|
|
|
|
|
`Block`s (like `BlockQuote`) contain lists of `Block`s,
|
|
|
|
|
while others (like `Para`) contain lists of `Inline`s, and
|
|
|
|
|
still others (like `CodeBlock`) contain plain text or
|
|
|
|
|
nothing. `Inline`s are the basic elements of paragraphs.
|
|
|
|
|
The distinction between `Block` and `Inline` in the type
|
|
|
|
|
system makes it impossible to represent, for example,
|
|
|
|
|
a link (`Inline`) whose link text is a block quote (`Block`).
|
|
|
|
|
This expressive limitation is mostly a help rather than a
|
|
|
|
|
hindrance, since many of the formats pandoc supports have
|
|
|
|
|
similar limitations.
|
|
|
|
|
|
|
|
|
|
The best way to explore the pandoc AST is to use `pandoc -t
|
|
|
|
|
native`, which will display the AST correspoding to some
|
|
|
|
|
Markdown input:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
% echo -e "1. *foo*\n2. bar" | pandoc -t native
|
|
|
|
|
[OrderedList (1,Decimal,Period)
|
|
|
|
|
[[Plain [Emph [Str "foo"]]]
|
|
|
|
|
,[Plain [Str "bar"]]]]
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
# A simple example
|
|
|
|
|
|
|
|
|
|
Here is a simple example of the use of a pandoc reader and
|
|
|
|
|
writer to perform a conversion inside ghci:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
import Text.Pandoc
|
|
|
|
|
import qualified Data.Text as T
|
|
|
|
|
import qualified Data.Text.IO as TIO
|
|
|
|
|
|
|
|
|
|
main :: IO ()
|
|
|
|
|
main = do
|
|
|
|
|
result <- runIO $ do
|
|
|
|
|
doc <- readMarkdown def (T.pack "[testing](url)")
|
|
|
|
|
writeRST def doc
|
|
|
|
|
rst <- handleError result
|
|
|
|
|
TIO.putStrLn rst
|
|
|
|
|
```
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
Some notes:
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
1. The first part constructs a conversion pipeline: the input
|
|
|
|
|
string is passed to `readMarkdown`, and the resulting Pandoc
|
|
|
|
|
AST (`doc`) is then rendered by `writeRST`. The conversion
|
|
|
|
|
pipeline is "run" by `runIO`---more on that below.
|
|
|
|
|
|
|
|
|
|
2. `result` has the type `Either PandocError Text`. We could
|
|
|
|
|
pattern-match on this manually, but it's simpler in this
|
|
|
|
|
context to use the `handleError` function from
|
|
|
|
|
Text.Pandoc.Error. This exits with an appropriate error
|
|
|
|
|
code and message if the value is a `Left`, and returns the
|
|
|
|
|
`Text` if the value is a `Right`.
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
|
|
|
|
# The PandocMonad class
|
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
Let's look at the types of `readMarkdown` and `writeRST`:
|
|
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
|
readMarkdown :: PandocMonad m => ReaderOptions -> Text -> m Pandoc
|
|
|
|
|
|
|
|
|
|
writeRST :: PandocMonad m => WriterOptions -> Pandoc -> m Text
|
|
|
|
|
```
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
The `PandocMonad m =>` part is a typeclass constraint.
|
|
|
|
|
It says that `readMarkdown` and `writeRST` define computations
|
|
|
|
|
that can be used in any instance of the `PandocMonad`
|
|
|
|
|
type class. `PandocMonad` is defined in the module
|
|
|
|
|
Text.Pandoc.Class.
|
|
|
|
|
|
|
|
|
|
Two instances of `PandocMonad` are provided: `PandocIO` and
|
|
|
|
|
`PandocPure`. The difference is that computations run in
|
|
|
|
|
`PandocIO` are allowed to do IO (for example, read a file),
|
|
|
|
|
while computations in `PandocPure` are free of any side effects.
|
|
|
|
|
`PandocPure` is useful for sandboxed environments, when you want
|
|
|
|
|
to prevent users from doing anything malicious. To run the
|
|
|
|
|
conversion in `PandocIO`, use `runIO` (as above). To run it in
|
|
|
|
|
`PandocPure`, use `runPure`.
|
|
|
|
|
|
|
|
|
|
As you can see from the Haddocks,
|
|
|
|
|
[Text.Pandoc.Class](https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Class.html)
|
|
|
|
|
exports many auxiliary functions that can be used in any
|
|
|
|
|
instance of `PandocMonad`. For example:
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
|
-- | Get the verbosity level.
|
|
|
|
|
getVerbosity :: PandocMonad m => m Verbosity
|
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
-- | Set the verbosity level.
|
|
|
|
|
setVerbosity :: PandocMonad m => Verbosity -> m ()
|
2017-09-17 08:17:33 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
-- Get the accomulated log messages (in temporal order).
|
|
|
|
|
getLog :: PandocMonad m => m [LogMessage]
|
|
|
|
|
getLog = reverse <$> getsCommonState stLog
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
-- | Log a message using 'logOutput'. Note that
|
|
|
|
|
-- 'logOutput' is called only if the verbosity
|
|
|
|
|
-- level exceeds the level of the message, but
|
|
|
|
|
-- the message is added to the list of log messages
|
|
|
|
|
-- that will be retrieved by 'getLog' regardless
|
|
|
|
|
-- of its verbosity level.
|
|
|
|
|
report :: PandocMonad m => LogMessage -> m ()
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
-- | Fetch an image or other item from the local filesystem or the net.
|
|
|
|
|
-- Returns raw content and maybe mime type.
|
|
|
|
|
fetchItem :: PandocMonad m
|
|
|
|
|
=> String
|
|
|
|
|
-> m (B.ByteString, Maybe MimeType)
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
setResourcePath :: PandocMonad m => [FilePath] -> m ()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If we wanted more verbose informational messages
|
|
|
|
|
during the conversion we defined in the previous
|
|
|
|
|
section, we could do this:
|
|
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
|
result <- runIO $ do
|
|
|
|
|
setVerbosity INFO
|
|
|
|
|
doc <- readMarkdown def (T.pack "[testing](url)")
|
|
|
|
|
writeRST def doc
|
|
|
|
|
```
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-09-17 08:17:33 +02:00
|
|
|
|
# Options
|
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
The first argument of each reader or writer is for
|
|
|
|
|
options controlling the behavior of the reader or writer:
|
|
|
|
|
`ReaderOptions` for readers and `WriterOptions`
|
|
|
|
|
for writers. These are defined in
|
|
|
|
|
[Text.Pandoc.Options](https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Options.html). It is a good idea to study these
|
|
|
|
|
options to see what can be adjusted.
|
|
|
|
|
|
|
|
|
|
`def` (from Data.Default) denotes a default value for
|
|
|
|
|
each kind of option. (You can also use `defaultWriterOptions`
|
|
|
|
|
and `defaultReaderOptions`.) Generally you'll want to use
|
|
|
|
|
the defaults and modify them only when needed, for example:
|
|
|
|
|
|
|
|
|
|
```haskell
|
|
|
|
|
writeRST def{ writerReferenceLinks = True }
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Some particularly important options to know about:
|
|
|
|
|
|
|
|
|
|
1. `writerTemplate`: By default, this is `Nothing`, which
|
|
|
|
|
means that a document fragment will be produced. If you
|
|
|
|
|
want a full document, you need to specify `Just template`,
|
|
|
|
|
where `template` is a String containing the template's
|
|
|
|
|
contents (not the path).
|
|
|
|
|
|
|
|
|
|
2. `readerExtensions` and `writerExtensions`: These specify
|
|
|
|
|
the extensions to be used in parsing and rendering.
|
|
|
|
|
Extensions are defined in [Text.Pandoc.Extensions](https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Extensions.html).
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-09-17 08:00:20 +02:00
|
|
|
|
# Builder
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
|
2017-09-17 08:17:33 +02:00
|
|
|
|
Inlines vs Inline, etc.
|
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
Concatenating lists is slow. So we use special types Inlines and Blocks that wrap Sequences of Inline and Block elements.
|
|
|
|
|
|
|
|
|
|
Monoid - makes it easy to build up docs programatically.
|
|
|
|
|
|
|
|
|
|
Example.
|
|
|
|
|
Here’s a JSON data source about CNG fueling stations in the
|
|
|
|
|
Chicago area: cng_fuel_chicago.json. Boss says: write me a
|
|
|
|
|
letter in Word listing all the stations that take the Voyager
|
|
|
|
|
card.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
[ {
|
|
|
|
|
"state" : "IL",
|
|
|
|
|
"city" : "Chicago",
|
|
|
|
|
"fuel_type_code" : "CNG",
|
|
|
|
|
"zip" : "60607",
|
|
|
|
|
"station_name" : "Clean Energy - Yellow Cab",
|
|
|
|
|
"cards_accepted" : "A D M V Voyager Wright_Exp CleanEnergy",
|
|
|
|
|
"street_address" : "540 W Grenshaw"
|
|
|
|
|
}, ...
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
No need to open Word for this job! fuel.hs
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
{-# LANGUAGE OverloadedStrings #-}
|
|
|
|
|
import Text.Pandoc.Builder
|
|
|
|
|
import Text.Pandoc
|
|
|
|
|
import Data.Monoid ((<>), mempty, mconcat)
|
|
|
|
|
import Data.Aeson
|
|
|
|
|
import Control.Applicative
|
|
|
|
|
import Control.Monad (mzero)
|
|
|
|
|
import qualified Data.ByteString.Lazy as BL
|
|
|
|
|
import qualified Data.Text as T
|
|
|
|
|
import Data.List (intersperse)
|
|
|
|
|
|
|
|
|
|
data Station = Station{
|
|
|
|
|
address :: String
|
|
|
|
|
, name :: String
|
|
|
|
|
, cardsAccepted :: [String]
|
|
|
|
|
} deriving Show
|
|
|
|
|
|
|
|
|
|
instance FromJSON Station where
|
|
|
|
|
parseJSON (Object v) = Station <$>
|
|
|
|
|
v .: "street_address" <*>
|
|
|
|
|
v .: "station_name" <*>
|
|
|
|
|
(words <$> (v .:? "cards_accepted" .!= ""))
|
|
|
|
|
parseJSON _ = mzero
|
|
|
|
|
|
|
|
|
|
createLetter :: [Station] -> Pandoc
|
|
|
|
|
createLetter stations = doc $
|
|
|
|
|
para "Dear Boss:" <>
|
|
|
|
|
para "Here are the CNG stations that accept Voyager cards:" <>
|
|
|
|
|
simpleTable [plain "Station", plain "Address", plain "Cards accepted"]
|
|
|
|
|
(map stationToRow stations) <>
|
|
|
|
|
para "Your loyal servant," <>
|
|
|
|
|
plain (image "JohnHancock.png" "" mempty)
|
|
|
|
|
where
|
|
|
|
|
stationToRow station =
|
|
|
|
|
[ plain (text $ name station)
|
|
|
|
|
, plain (text $ address station)
|
|
|
|
|
, plain (mconcat $ intersperse linebreak $ map text $ cardsAccepted station)
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
main :: IO ()
|
|
|
|
|
main = do
|
|
|
|
|
json <- BL.readFile "cng_fuel_chicago.json"
|
|
|
|
|
let letter = case decode json of
|
|
|
|
|
Just stations -> createLetter [s | s <- stations,
|
|
|
|
|
"Voyager" `elem` cardsAccepted s]
|
|
|
|
|
Nothing -> error "Could not decode JSON"
|
|
|
|
|
BL.writeFile "letter.docx" =<< writeDocx def letter
|
|
|
|
|
putStrLn "Created letter.docx"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
2017-09-17 08:17:33 +02:00
|
|
|
|
# Templates and other data files
|
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
readDataFile
|
|
|
|
|
|
2017-09-17 08:17:33 +02:00
|
|
|
|
# Handling errors and warnings
|
|
|
|
|
|
2017-09-17 08:00:20 +02:00
|
|
|
|
# Generic transformations
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|
|
|
|
|
Walk and syb for AST transformations
|
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
Filters: see filters.md
|
2017-09-17 08:17:33 +02:00
|
|
|
|
|
2017-10-25 07:25:45 +02:00
|
|
|
|
but, how do you run filters from a program?
|
|
|
|
|
need to export these functions from Text.Pandoc.App!
|
2017-09-17 08:17:33 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# PDF
|
|
|
|
|
|
2017-10-25 01:08:05 +02:00
|
|
|
|
|
2017-09-17 08:17:33 +02:00
|
|
|
|
# Creating a front-end
|
|
|
|
|
|
|
|
|
|
Text.Pandoc.App
|
2017-02-01 12:50:44 +01:00
|
|
|
|
|