pandoc/doc/custom-readers.md
2021-11-29 08:59:37 -08:00

16 KiB

author date title
John MacFarlane
November 18, 2021 Creating Custom Pandoc Readers in Lua

Introduction

If you need to parse a format not already handled by pandoc, you can create a custom reader using the Lua language. Pandoc has a built-in Lua interpreter, so you needn't install any additional software to do this.

A custom reader is a Lua file that defines a function called Reader, which takes two arguments:

  • a string, the raw input to be parsed
  • optionally, a table of reader options, e.g. { columns = 62, standalone = true }.

The Reader function should return a Pandoc AST. This can be created using functions in the pandoc module, which is automatically in scope. (Indeed, all of the utility functions that are available for Lua filters are available in custom readers, too.)

A minimal example would be

function Reader(input)
  return pandoc.Pandoc({ pandoc.CodeBlock(input) })
end

This just returns a document containing a big code block with all of the input.

In a nontrivial reader, you'll want to parse the input. You can do this using standard Lua library functions (for example, the patterns library), or with the powerful and fast lpeg parsing library, which is automatically in scope. You can also use external Lua libraries (for example, an XML parser).

Example: plain text reader

This is a simple example using lpeg to parse the input into space-separated strings and blankline-separated paragraphs.

-- A sample custom reader that just parses text into blankline-separated
-- paragraphs with space-separated words.

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local wordchar = (1 - whitespacechar)
local spacechar = S(" \t")
local newline = P"\r"^-1 * P"\n"
local blanklines = newline * (spacechar^0 * newline)^1
local endline = newline - blanklines

-- Grammar
G = P{ "Pandoc",
  Pandoc = Ct(V"Block"^0) / pandoc.Pandoc;
  Block = blanklines^0 * V"Para" ;
  Para = Ct(V"Inline"^1) / pandoc.Para;
  Inline = V"Str" + V"Space" + V"SoftBreak" ;
  Str = wordchar^1 / pandoc.Str;
  Space = spacechar^1 / pandoc.Space;
  SoftBreak = endline / pandoc.SoftBreak;
}

function Reader(input)
  return lpeg.match(G, input)
end

Example of use:

% pandoc -f plain.lua -t native
*Hello there*, this is plain text with no formatting
except paragraph breaks.

- Like this one.
^D
[ Para
    [ Str "*Hello"
    , Space
    , Str "there*,"
    , Space
    , Str "this"
    , Space
    , Str "is"
    , Space
    , Str "plain"
    , Space
    , Str "text"
    , Space
    , Str "with"
    , Space
    , Str "no"
    , Space
    , Str "formatting"
    , SoftBreak
    , Str "except"
    , Space
    , Str "paragraph"
    , Space
    , Str "breaks."
    ]
, Para
    [ Str "-"
    , Space
    , Str "Like"
    , Space
    , Str "this"
    , Space
    , Str "one."
    ]
]

Example: a RIS bibliography reader

This is a parser for RIS bibliography files. It can be used to convert them to CSL JSON or YAML, BibTeX, or BibLaTeX.

-- A sample custom reader for RIS bibliography format
-- https://en.wikipedia.org/wiki/RIS_(file_format)
-- The references are converted to inline pandoc/CSL YAML
-- references in the metadata.

local inspect = require"inspect"

local types =
  { ABST = "article",
    ADVS = "motion-picture",
    AGGR = "dataset",
    ANCIENT = "book",
    ART = "graphic",
    BILL = "bill",
    BLOG = "post-weblog",
    BOOK = "book",
    CASE = "legal_case",
    CHAP = "chapter",
    CHART = "graphic",
    CLSWK = "book",
    COMP = "program",
    CONF = "paper-conference",
    CPAPER = "paper-conference",
    CTLG = "catalog",
    DATA = "dataset",
    DBASE = "dataset",
    DICT = "book",
    EBOOK = "book",
    ECHAP = "chapter",
    EDBOOK = "book",
    EJOUR = "article",
    WEB = "webpage",
    ENCYC = "entry-encyclopedia",
    EQUA = "figure",
    FIGURE = "figure",
    GEN = "entry",
    GOVDOC = "report",
    GRANT = "report",
    HEAR = "report",
    ICOMM = "personal_communication",
    INPR = "article-journal",
    JFULL = "article-journal",
    JOUR = "article-journal",
    LEGAL = "legal_case",
    MANSCPT = "manuscript",
    MAP = "map",
    MGZN = "article-magazine",
    MPCT = "motion-picture",
    MULTI = "webpage",
    MUSIC = "musical_score",
    NEWS = "article-newspaper",
    PAMP = "pamphlet",
    PAT = "patent",
    PCOMM = "personal_communication",
    RPRT = "report",
    SER = "article",
    SLIDE = "graphic",
    SOUND = "musical_score",
    STAND = "report",
    STAT = "legislation",
    THES = "thesis",
    UNBILL = "bill",
    UNPB = "unpublished",
    VIDEO = "graphic"
  }

local function clean(refpairs)
  local ref = {}
  for i = 1, #refpairs do
    local k,v = table.unpack(refpairs[i])
    if k == "TY" then
      ref["type"] = types[v]
    elseif k == "VL" then
      ref.volume = v
    elseif k == "KW" then
      ref.keyword = v
    elseif k == "PB" then
      ref.publisher = v
    elseif k == "CY" or k == "PP" then
      ref["publisher-place"] = v
    elseif k == "SP" then
      if ref.page then
        ref.page = v .. ref.page
      else
        ref.page = v
      end
    elseif k == "EP" then
      if ref.page then
        ref.page = ref.page .. "-" .. v
      else
        ref.page = "-" .. v
      end
    elseif k == "AU" or k == "A1" or k == "A2" or k == "A3" then
      if ref.author then
        table.insert(ref.author, v)
      else
        ref.author = {v}
      end
    elseif k == "TI" or k == "T1" or k == "CT" or
            (k == "BT" and ref.type == "book") then
      ref.title = v
    elseif k == "ET" then
      ref.edition = v
    elseif k == "NV" then
      ref["number-of-volumes"] = v
    elseif k == "AB" then
      ref.abstract = v
    elseif k == "ED" then
      if ref.editor then
        table.insert(ref.editor, v)
      else
        ref.editor = {v}
      end
    elseif k == "JO" or k == "JF" or k == "T2" or
             (k == "BT" and ref.type ~= "book") then
      ref["container-title"] = v
    elseif k == "PY" or k == "Y1" then
      ref.issued = v
    elseif k == "IS" then
      ref.issue = v
    elseif k == "SN" then
      ref.ISSN = v
    elseif k == "L" then
      ref.lang = v
    elseif k == "UR" or k == "LK" then
      ref.URL = v
    end
  end
  return ref
end

function Reader(input, reader_options)
  local refs = {}
  local thisref = {}
  local ids = {}
  for line in string.gmatch(input, "[^\n]*") do
    key, val = string.match(line, "([A-Z][A-Z0-9])  %- (.*)")
    if key == "ER" then
      -- clean up fields
      local newref = clean(thisref)
      -- ensure we have an id and if not, create a sensible one
      if not newref.id then
        newref.id = ""
        for _,x in ipairs(newref.author) do
          newref.id = newref.id .. string.match(pandoc.utils.stringify(x), "%a+")
        end
        if newref.issued then
          newref.id = newref.id .. string.match(newref.issued, "%d+")
        end
        if ids[newref.id] then -- add disambiguator if needed
          newref.id = newref.id .. "-" .. #ids
        end
      end
      table.insert(ids, newref.id)
      table.insert(refs, newref)
      thisref = {}
    elseif key then
      table.insert(thisref, {key, val})
    end
  end
  return pandoc.Pandoc({}, pandoc.Meta { references = refs } )
end

Example of use:

% pandoc -f ris.lua -t bibtex
TY  - JOUR
AU  - Shannon, Claude E.
PY  - 1948
DA  - July
TI  - A Mathematical Theory of Communication
T2  - Bell System Technical Journal
SP  - 379
EP  - 423
VL  - 27
ER  - 
TY  - JOUR
T1  - On computable numbers, with an application to the Entscheidungsproblem
A1  - Turing, Alan Mathison
JO  - Proc. of London Mathematical Society
VL  - 47
IS  - 1
SP  - 230
EP  - 265
Y1  - 1937
ER  - 
^D
@article{Shannon1948,
  author = {Shannon, Claude E.},
  title = {A {Mathematical} {Theory} of {Communication}},
  journal = {Bell System Technical Journal},
  volume = {27},
  pages = {379-423},
  year = {1948}
}
@article{Turing1937,
  author = {Turing, Alan Mathison},
  title = {On Computable Numbers, with an Application to the
    {Entscheidungsproblem}},
  journal = {Proc. of London Mathematical Society},
  volume = {47},
  number = {1},
  pages = {230-265},
  year = {1937}
}

Example: a wiki Creole reader

This is a parser for Creole common wiki markup. It uses an lpeg grammar. Fun fact: this custom reader is faster than pandoc's built-in creole reader! This shows that high-performance readers can be designed in this way.

-- A sample custom reader for Creole 1.0 (common wiki markup)
-- http://www.wikicreole.org/wiki/CheatSheet

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local specialchar = S("/*~[]\\{}|")
local wordchar = (1 - (whitespacechar + specialchar))
local spacechar = S(" \t")
local newline = P"\r"^-1 * P"\n"
local blankline = spacechar^0 * newline
local endline = newline * #-blankline
local endequals = spacechar^0 * P"="^0 * spacechar^0 * newline
local cellsep = spacechar^0 * P"|"

local function trim(s)
   return (s:gsub("^%s*(.-)%s*$", "%1"))
end

local function ListItem(lev, ch)
  local start
  if ch == nil then
    start = S"*#"
  else
    start = P(ch)
  end
  local subitem = function(c)
    if lev < 6 then
      return ListItem(lev + 1, c)
    else
      return (1 - 1) -- fails
    end
  end
  local parser = spacechar^0
               * start^lev
               * #(- start)
               * spacechar^0
               * Ct((V"Inline" - (newline * spacechar^0 * S"*#"))^0)
               * newline
               * (Ct(subitem("*")^1) / pandoc.BulletList
                  +
                  Ct(subitem("#")^1) / pandoc.OrderedList
                  +
                  Cc(nil))
               / function (ils, sublist)
                   return { pandoc.Plain(ils), sublist }
                 end
  return parser
end

-- Grammar
G = P{ "Doc",
  Doc = Ct(V"Block"^0)
      / pandoc.Pandoc ;
  Block = blankline^0
        * ( V"Header"
          + V"HorizontalRule"
          + V"CodeBlock"
          + V"List"
          + V"Table"
          + V"Para") ;
  Para = Ct(V"Inline"^1)
       * newline
       / pandoc.Para ;
  HorizontalRule = spacechar^0
                 * P"----"
                 * spacechar^0
                 * newline
                 / pandoc.HorizontalRule;
  Header = (P("=")^1 / string.len)
         * spacechar^1
         * Ct((V"Inline" - endequals)^1)
         * endequals
         / pandoc.Header;
  CodeBlock = P"{{{"
            * blankline
            * C((1 - (newline * P"}}}"))^0)
            * newline
            * P"}}}"
            / pandoc.CodeBlock;
  Placeholder = P"<<<"
              * C(P(1) - P">>>")^0
              * P">>>"
              / function() return pandoc.Div({}) end;
  List = V"BulletList"
       + V"OrderedList" ;
  BulletList = Ct(ListItem(1,'*')^1)
             / pandoc.BulletList ;
  OrderedList = Ct(ListItem(1,'#')^1)
             / pandoc.OrderedList ;
  Table = (V"TableHeader" + Cc{})
        * Ct(V"TableRow"^1)
        / function(headrow, bodyrows)
            local numcolumns = #(bodyrows[1])
            local aligns = {}
            local widths = {}
            for i = 1,numcolumns do
              aligns[i] = pandoc.AlignDefault
              widths[i] = 0
            end
            return pandoc.utils.from_simple_table(
              pandoc.SimpleTable({}, aligns, widths, headrow, bodyrows))
          end ;
  TableHeader = Ct(V"HeaderCell"^1)
              * cellsep^-1
              * spacechar^0
              * newline ;
  TableRow   = Ct(V"BodyCell"^1)
             * cellsep^-1
             * spacechar^0
             * newline ;
  HeaderCell = cellsep
             * P"="
             * spacechar^0
             * Ct((V"Inline" - (newline + cellsep))^0)
             / function(ils) return { pandoc.Plain(ils) } end ;
  BodyCell   = cellsep
             * spacechar^0
             * Ct((V"Inline" - (newline + cellsep))^0)
             / function(ils) return { pandoc.Plain(ils) } end ;
  Inline = V"Emph"
         + V"Strong"
         + V"LineBreak"
         + V"Link"
         + V"URL"
         + V"Image"
         + V"Str"
         + V"Space"
         + V"SoftBreak"
         + V"Escaped"
         + V"Placeholder"
         + V"Code"
         + V"Special" ;
  Str = wordchar^1
      / pandoc.Str;
  Escaped = P"~"
          * C(P(1))
          / pandoc.Str ;
  Special = specialchar
          / pandoc.Str;
  Space = spacechar^1
        / pandoc.Space ;
  SoftBreak = endline
            * # -(V"HorizontalRule" + V"CodeBlock")
            / pandoc.SoftBreak ;
  LineBreak = P"\\\\"
            / pandoc.LineBreak ;
  Code = P"{{{"
       * C((1 - P"}}}")^0)
       * P"}}}"
       / trim / pandoc.Code ;
  Link = P"[["
       * C((1 - (P"]]" + P"|"))^0)
       * (P"|" * Ct((V"Inline" - P"]]")^1))^-1 * P"]]"
       / function(url, desc)
           local txt = desc or {pandoc.Str(url)}
           return pandoc.Link(txt, url)
         end ;
  Image = P"{{"
        * #-P"{"
        * C((1 - (S"}"))^0)
        * (P"|" * Ct((V"Inline" - P"}}")^1))^-1
        * P"}}"
        / function(url, desc)
            local txt = desc or ""
            return pandoc.Image(txt, url)
          end ;
  URL = P"http"
      * P"s"^-1
      * P":"
      * (1 - (whitespacechar + (S",.?!:;\"'" * #whitespacechar)))^1
      / function(url)
          return pandoc.Link(pandoc.Str(url), url)
        end ;
  Emph = P"//"
       * Ct((V"Inline" - P"//")^1)
       * P"//"
       / pandoc.Emph ;
  Strong = P"**"
         * Ct((V"Inline" -P"**")^1)
         * P"**"
         / pandoc.Strong ;
}

function Reader(input, reader_options)
  return lpeg.match(G, input)
end

Example of use:

% pandoc -f creole.lua -t markdown
== Wiki Creole

You can make things **bold** or //italic// or **//both//** or //**both**//.

Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

You can use [[internal links]] or [[http://www.wikicreole.org|external links]],
give the link a [[internal links|different]] name.
^D
## Wiki Creole

You can make things **bold** or *italic* or ***both*** or ***both***.

Character formatting extends across line breaks: \*\*bold, this is still
bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

You can use [internal links](internal links) or [external
links](http://www.wikicreole.org), give the link a
[different](internal links) name.

Example: parsing JSON from an API

This custom reader consumes the JSON output of https://www.reddit.com/r/haskell.json and produces a document containing the current top articles on the Haskell subreddit.

It assumes that the luajson library is available. (It can be installed using luarocks install luajson---but be sure you are installing it for Lua 5.3, which is the version packaged with pandoc.)

-- consumes the output of https://www.reddit.com/r/haskell.json

local json = require'json'  -- luajson must be available

local function read_inlines(raw)
  local doc = pandoc.read(raw, "commonmark")
  return pandoc.utils.blocks_to_inlines(doc.blocks)
end

local function read_blocks(raw)
  local doc = pandoc.read(raw, "commonmark")
  return doc.blocks
end

function Reader(input)

  local parsed = json.decode(input)
  local blocks = {}

  for _,entry in ipairs(parsed.data.children) do
    local d = entry.data
    table.insert(blocks, pandoc.Header(2,
                  pandoc.Link(read_inlines(d.title), d.url)))
    for _,block in ipairs(read_blocks(d.selftext)) do
      table.insert(blocks, block)
    end
  end

  return pandoc.Pandoc(blocks)

end

Similar code can be used to consume JSON output from other APIs.

Note that the content of the text fields is markdown, so we convert it using pandoc.read().