pandoc/doc/using-the-pandoc-api.md

294 lines
9.1 KiB
Markdown
Raw Normal View History

% Using the pandoc API
% John MacFarlane
2017-10-25 01:08:05 +02:00
Pandoc can be used as a Haskell library, to write your own
conversion tools or power a web application. This document
offers an introduction to using the pandoc API.
2017-10-25 01:08:05 +02:00
Detailed API documentation at the level of individual functions
and types is available at
<https://hackage.haskell.org/package/pandoc>.
# Pandoc's architecture
2017-10-25 01:08:05 +02:00
Pandoc is structured as a set of *readers*, which translate
various input formats into an abstract syntax tree (the
Pandoc AST) representing a structured document, and a set of
*writers*, which render this AST into various input formats.
Pictorially:
2017-10-25 01:08:05 +02:00
```
[input format] ==reader==> [Pandoc AST] ==writer==> [output format]
```
This architecture allows pandoc to perform $M \times n$
conversions with $M$ readers and $N$ writers.
The Pandoc AST is defined in the
[pandoc-types](https://hackage.haskell.org/package/pandoc-types)
package. You should start by looking at the Haddock
documentation for
[Text.Pandoc.Definition](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html). As you'll see, a `Pandoc` is
composed of some metadata and a list of `Block`s. There are
various kinds of `Block`, including `Para` (paragraph),
`Header` (section heading), and `BlockQuote`. Some of the
`Block`s (like `BlockQuote`) contain lists of `Block`s,
while others (like `Para`) contain lists of `Inline`s, and
still others (like `CodeBlock`) contain plain text or
nothing. `Inline`s are the basic elements of paragraphs.
The distinction between `Block` and `Inline` in the type
system makes it impossible to represent, for example,
a link (`Inline`) whose link text is a block quote (`Block`).
This expressive limitation is mostly a help rather than a
hindrance, since many of the formats pandoc supports have
similar limitations.
The best way to explore the pandoc AST is to use `pandoc -t
native`, which will display the AST correspoding to some
Markdown input:
```
% echo -e "1. *foo*\n2. bar" | pandoc -t native
[OrderedList (1,Decimal,Period)
[[Plain [Emph [Str "foo"]]]
,[Plain [Str "bar"]]]]
```
# A simple example
Here is a simple example of the use of a pandoc reader and
writer to perform a conversion inside ghci:
```
import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
main :: IO ()
main = do
result <- runIO $ do
doc <- readMarkdown def (T.pack "[testing](url)")
writeRST def doc
rst <- handleError result
TIO.putStrLn rst
```
2017-10-25 01:08:05 +02:00
Some notes:
2017-10-25 01:08:05 +02:00
1. The first part constructs a conversion pipeline: the input
string is passed to `readMarkdown`, and the resulting Pandoc
AST (`doc`) is then rendered by `writeRST`. The conversion
pipeline is "run" by `runIO`---more on that below.
2. `result` has the type `Either PandocError Text`. We could
pattern-match on this manually, but it's simpler in this
context to use the `handleError` function from
Text.Pandoc.Error. This exits with an appropriate error
code and message if the value is a `Left`, and returns the
`Text` if the value is a `Right`.
2017-10-25 01:08:05 +02:00
# The PandocMonad class
Let's look at the types of `readMarkdown` and `writeRST`:
```haskell
readMarkdown :: PandocMonad m => ReaderOptions -> Text -> m Pandoc
writeRST :: PandocMonad m => WriterOptions -> Pandoc -> m Text
```
2017-10-25 01:08:05 +02:00
The `PandocMonad m =>` part is a typeclass constraint.
It says that `readMarkdown` and `writeRST` define computations
that can be used in any instance of the `PandocMonad`
type class. `PandocMonad` is defined in the module
Text.Pandoc.Class.
Two instances of `PandocMonad` are provided: `PandocIO` and
`PandocPure`. The difference is that computations run in
`PandocIO` are allowed to do IO (for example, read a file),
while computations in `PandocPure` are free of any side effects.
`PandocPure` is useful for sandboxed environments, when you want
to prevent users from doing anything malicious. To run the
conversion in `PandocIO`, use `runIO` (as above). To run it in
`PandocPure`, use `runPure`.
As you can see from the Haddocks,
[Text.Pandoc.Class](https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Class.html)
exports many auxiliary functions that can be used in any
instance of `PandocMonad`. For example:
2017-10-25 01:08:05 +02:00
```haskell
-- | Get the verbosity level.
getVerbosity :: PandocMonad m => m Verbosity
-- | Set the verbosity level.
setVerbosity :: PandocMonad m => Verbosity -> m ()
-- Get the accomulated log messages (in temporal order).
getLog :: PandocMonad m => m [LogMessage]
getLog = reverse <$> getsCommonState stLog
-- | Log a message using 'logOutput'. Note that
-- 'logOutput' is called only if the verbosity
-- level exceeds the level of the message, but
-- the message is added to the list of log messages
-- that will be retrieved by 'getLog' regardless
-- of its verbosity level.
report :: PandocMonad m => LogMessage -> m ()
-- | Fetch an image or other item from the local filesystem or the net.
-- Returns raw content and maybe mime type.
fetchItem :: PandocMonad m
=> String
-> m (B.ByteString, Maybe MimeType)
setResourcePath :: PandocMonad m => [FilePath] -> m ()
```
If we wanted more verbose informational messages
during the conversion we defined in the previous
section, we could do this:
```haskell
result <- runIO $ do
setVerbosity INFO
doc <- readMarkdown def (T.pack "[testing](url)")
writeRST def doc
```
# Options
The first argument of each reader or writer is for
options controlling the behavior of the reader or writer:
`ReaderOptions` for readers and `WriterOptions`
for writers. These are defined in
[Text.Pandoc.Options](https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Options.html). It is a good idea to study these
options to see what can be adjusted.
`def` (from Data.Default) denotes a default value for
each kind of option. (You can also use `defaultWriterOptions`
and `defaultReaderOptions`.) Generally you'll want to use
the defaults and modify them only when needed, for example:
```haskell
writeRST def{ writerReferenceLinks = True }
```
Some particularly important options to know about:
1. `writerTemplate`: By default, this is `Nothing`, which
means that a document fragment will be produced. If you
want a full document, you need to specify `Just template`,
where `template` is a String containing the template's
contents (not the path).
2. `readerExtensions` and `writerExtensions`: These specify
the extensions to be used in parsing and rendering.
Extensions are defined in [Text.Pandoc.Extensions](https://hackage.haskell.org/package/pandoc/docs/Text-Pandoc-Extensions.html).
# Builder
Inlines vs Inline, etc.
Concatenating lists is slow. So we use special types Inlines and Blocks that wrap Sequences of Inline and Block elements.
Monoid - makes it easy to build up docs programatically.
Example.
Heres a JSON data source about CNG fueling stations in the
Chicago area: cng_fuel_chicago.json. Boss says: write me a
letter in Word listing all the stations that take the Voyager
card.
```
[ {
"state" : "IL",
"city" : "Chicago",
"fuel_type_code" : "CNG",
"zip" : "60607",
"station_name" : "Clean Energy - Yellow Cab",
"cards_accepted" : "A D M V Voyager Wright_Exp CleanEnergy",
"street_address" : "540 W Grenshaw"
}, ...
```
No need to open Word for this job! fuel.hs
```
{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Builder
import Text.Pandoc
import Data.Monoid ((<>), mempty, mconcat)
import Data.Aeson
import Control.Applicative
import Control.Monad (mzero)
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text as T
import Data.List (intersperse)
data Station = Station{
address :: String
, name :: String
, cardsAccepted :: [String]
} deriving Show
instance FromJSON Station where
parseJSON (Object v) = Station <$>
v .: "street_address" <*>
v .: "station_name" <*>
(words <$> (v .:? "cards_accepted" .!= ""))
parseJSON _ = mzero
createLetter :: [Station] -> Pandoc
createLetter stations = doc $
para "Dear Boss:" <>
para "Here are the CNG stations that accept Voyager cards:" <>
simpleTable [plain "Station", plain "Address", plain "Cards accepted"]
(map stationToRow stations) <>
para "Your loyal servant," <>
plain (image "JohnHancock.png" "" mempty)
where
stationToRow station =
[ plain (text $ name station)
, plain (text $ address station)
, plain (mconcat $ intersperse linebreak $ map text $ cardsAccepted station)
]
main :: IO ()
main = do
json <- BL.readFile "cng_fuel_chicago.json"
let letter = case decode json of
Just stations -> createLetter [s | s <- stations,
"Voyager" `elem` cardsAccepted s]
Nothing -> error "Could not decode JSON"
BL.writeFile "letter.docx" =<< writeDocx def letter
putStrLn "Created letter.docx"
```
# Templates and other data files
readDataFile
# Handling errors and warnings
# Generic transformations
Walk and syb for AST transformations
Filters: see filters.md
but, how do you run filters from a program?
need to export these functions from Text.Pandoc.App!
# PDF
2017-10-25 01:08:05 +02:00
# Creating a front-end
Text.Pandoc.App