John MacFarlane 07bf48d07b More progress on using-the-pandoc-api.md.

2017-10-24 22:25:59 -07:00

9.1 KiB

Raw Blame History

% Using the pandoc API % John MacFarlane

Pandoc can be used as a Haskell library, to write your own conversion tools or power a web application. This document offers an introduction to using the pandoc API.

Detailed API documentation at the level of individual functions and types is available at https://hackage.haskell.org/package/pandoc.

Pandoc's architecture

Pandoc is structured as a set of readers, which translate various input formats into an abstract syntax tree (the Pandoc AST) representing a structured document, and a set of writers, which render this AST into various input formats. Pictorially:

[input format] ==reader==> [Pandoc AST] ==writer==> [output format]

This architecture allows pandoc to perform M \times n conversions with M readers and N writers.

The Pandoc AST is defined in the pandoc-types package. You should start by looking at the Haddock documentation for Text.Pandoc.Definition. As you'll see, a Pandoc is composed of some metadata and a list of Blocks. There are various kinds of Block, including Para (paragraph), Header (section heading), and BlockQuote. Some of the Blocks (like BlockQuote) contain lists of Blocks, while others (like Para) contain lists of Inlines, and still others (like CodeBlock) contain plain text or nothing. Inlines are the basic elements of paragraphs. The distinction between Block and Inline in the type system makes it impossible to represent, for example, a link (Inline) whose link text is a block quote (Block). This expressive limitation is mostly a help rather than a hindrance, since many of the formats pandoc supports have similar limitations.

The best way to explore the pandoc AST is to use pandoc -t native, which will display the AST correspoding to some Markdown input:

% echo -e "1. *foo*\n2. bar" | pandoc -t native
[OrderedList (1,Decimal,Period)
 [[Plain [Emph [Str "foo"]]]
 ,[Plain [Str "bar"]]]]

A simple example

Here is a simple example of the use of a pandoc reader and writer to perform a conversion inside ghci:

import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO

main :: IO ()
main = do
  result <- runIO $ do
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc
  rst <- handleError result
  TIO.putStrLn rst

Some notes:

The first part constructs a conversion pipeline: the input string is passed to readMarkdown, and the resulting Pandoc AST (doc) is then rendered by writeRST. The conversion pipeline is "run" by runIO---more on that below.
result has the type Either PandocError Text. We could pattern-match on this manually, but it's simpler in this context to use the handleError function from Text.Pandoc.Error. This exits with an appropriate error code and message if the value is a Left, and returns the Text if the value is a Right.

The PandocMonad class

Let's look at the types of readMarkdown and writeRST:

readMarkdown :: PandocMonad m => ReaderOptions -> Text -> m Pandoc

writeRST :: PandocMonad m => WriterOptions -> Pandoc -> m Text

The PandocMonad m => part is a typeclass constraint. It says that readMarkdown and writeRST define computations that can be used in any instance of the PandocMonad type class. PandocMonad is defined in the module Text.Pandoc.Class.

Two instances of PandocMonad are provided: PandocIO and PandocPure. The difference is that computations run in PandocIO are allowed to do IO (for example, read a file), while computations in PandocPure are free of any side effects. PandocPure is useful for sandboxed environments, when you want to prevent users from doing anything malicious. To run the conversion in PandocIO, use runIO (as above). To run it in PandocPure, use runPure.

As you can see from the Haddocks, Text.Pandoc.Class exports many auxiliary functions that can be used in any instance of PandocMonad. For example:

-- | Get the verbosity level.
getVerbosity :: PandocMonad m => m Verbosity

-- | Set the verbosity level.
setVerbosity :: PandocMonad m => Verbosity -> m ()

-- Get the accomulated log messages (in temporal order).
getLog :: PandocMonad m => m [LogMessage]
getLog = reverse <$> getsCommonState stLog

-- | Log a message using 'logOutput'.  Note that
-- 'logOutput' is called only if the verbosity
-- level exceeds the level of the message, but
-- the message is added to the list of log messages
-- that will be retrieved by 'getLog' regardless
-- of its verbosity level.
report :: PandocMonad m => LogMessage -> m ()

-- | Fetch an image or other item from the local filesystem or the net.
-- Returns raw content and maybe mime type.
fetchItem :: PandocMonad m
          => String
          -> m (B.ByteString, Maybe MimeType)

setResourcePath :: PandocMonad m => [FilePath] -> m ()

If we wanted more verbose informational messages during the conversion we defined in the previous section, we could do this:

  result <- runIO $ do
    setVerbosity INFO
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc

Options

The first argument of each reader or writer is for options controlling the behavior of the reader or writer: ReaderOptions for readers and WriterOptions for writers. These are defined in Text.Pandoc.Options. It is a good idea to study these options to see what can be adjusted.

def (from Data.Default) denotes a default value for each kind of option. (You can also use defaultWriterOptions and defaultReaderOptions.) Generally you'll want to use the defaults and modify them only when needed, for example:

    writeRST def{ writerReferenceLinks = True }

Some particularly important options to know about:

writerTemplate: By default, this is Nothing, which means that a document fragment will be produced. If you want a full document, you need to specify Just template, where template is a String containing the template's contents (not the path).
readerExtensions and writerExtensions: These specify the extensions to be used in parsing and rendering. Extensions are defined in Text.Pandoc.Extensions.

Builder

Inlines vs Inline, etc.

Concatenating lists is slow. So we use special types Inlines and Blocks that wrap Sequences of Inline and Block elements.

Monoid - makes it easy to build up docs programatically.

Example. Here’s a JSON data source about CNG fueling stations in the Chicago area: cng_fuel_chicago.json. Boss says: write me a letter in Word listing all the stations that take the Voyager card.

[ {
  "state" : "IL",
  "city" : "Chicago",
  "fuel_type_code" : "CNG",
  "zip" : "60607",
  "station_name" : "Clean Energy - Yellow Cab",
  "cards_accepted" : "A D M V Voyager Wright_Exp CleanEnergy",
  "street_address" : "540 W Grenshaw"
}, ...

No need to open Word for this job! fuel.hs

{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Builder
import Text.Pandoc
import Data.Monoid ((<>), mempty, mconcat)
import Data.Aeson
import Control.Applicative
import Control.Monad (mzero)
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text as T
import Data.List (intersperse)

data Station = Station{
    address        :: String
  , name           :: String
  , cardsAccepted  :: [String]
  } deriving Show

instance FromJSON Station where
    parseJSON (Object v) = Station <$>
       v .: "street_address" <*>
       v .: "station_name" <*>
       (words <$> (v .:? "cards_accepted" .!= ""))
    parseJSON _          = mzero

createLetter :: [Station] -> Pandoc
createLetter stations = doc $
    para "Dear Boss:" <>
    para "Here are the CNG stations that accept Voyager cards:" <>
    simpleTable [plain "Station", plain "Address", plain "Cards accepted"]
           (map stationToRow stations) <>
    para "Your loyal servant," <>
    plain (image "JohnHancock.png" "" mempty)
  where
    stationToRow station =
      [ plain (text $ name station)
      , plain (text $ address station)
      , plain (mconcat $ intersperse linebreak $ map text $ cardsAccepted station)
      ]

main :: IO ()
main = do
  json <- BL.readFile "cng_fuel_chicago.json"
  let letter = case decode json of
                    Just stations -> createLetter [s | s <- stations,
                                        "Voyager" `elem` cardsAccepted s]
                    Nothing       -> error "Could not decode JSON"
  BL.writeFile "letter.docx" =<< writeDocx def letter
  putStrLn "Created letter.docx"

Templates and other data files

readDataFile

Handling errors and warnings

Generic transformations

Walk and syb for AST transformations

Filters: see filters.md

but, how do you run filters from a program? need to export these functions from Text.Pandoc.App!

PDF

Creating a front-end

Text.Pandoc.App

9.1 KiB Raw Blame History Unescape Escape