pandoc/doc/custom-readers.md

---
author:
- John MacFarlane
date: 'November 18, 2021'
title: Creating Custom Pandoc Readers in Lua
---

# Introduction

If you need to parse a format not already handled by pandoc,
you can create a custom reader using the [Lua] language.
Pandoc has a built-in Lua interpreter, so you needn't
install any additional software to do this.

[Lua]: https://www.lua.org

A custom reader is a Lua file that defines a function
called `Reader`, which takes two arguments:

- a string, the raw input to be parsed
- optionally, a table of reader options, e.g.
  `{ columns = 62, standalone = true }`.

The `Reader` function should return a `Pandoc` AST.
This can be created using functions in the [`pandoc` module],
which is automatically in scope.  (Indeed, all of the utility
functions that are available for [Lua filters] are available
in custom readers, too.)

[Lua filters]: https://pandoc.org/lua-filters.html
[`pandoc` module]: https://pandoc.org/lua-filters.html#module-pandoc

A minimal example would be

```lua
function Reader(input)
  return pandoc.Pandoc({ pandoc.CodeBlock(input) })
end
```

This just returns a document containing a big code block with
all of the input.

In a nontrivial reader, you'll want to parse the input.
You can do this using standard Lua library functions
(for example, the [patterns] library), or with the powerful
and fast [lpeg] parsing library, which is automatically in scope.
You can also use external Lua libraries (for example,
an XML parser).

[patterns]: http://lua-users.org/wiki/PatternsTutorial
[lpeg]: http://www.inf.puc-rio.br/~roberto/lpeg/

# Example: plain text reader

This is a simple example using [lpeg] to parse the input
into space-separated strings and blankline-separated paragraphs.

```lua
-- A sample custom reader that just parses text into blankline-separated
-- paragraphs with space-separated words.

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local wordchar = (1 - whitespacechar)
local spacechar = S(" \t")
local newline = P"\r"^-1 * P"\n"
local blanklines = newline * (spacechar^0 * newline)^1
local endline = newline - blanklines

-- Grammar
G = P{ "Pandoc",
  Pandoc = Ct(V"Block"^0) / pandoc.Pandoc;
  Block = blanklines^0 * V"Para" ;
  Para = Ct(V"Inline"^1) / pandoc.Para;
  Inline = V"Str" + V"Space" + V"SoftBreak" ;
  Str = wordchar^1 / pandoc.Str;
  Space = spacechar^1 / pandoc.Space;
  SoftBreak = endline / pandoc.SoftBreak;
}

function Reader(input)
  return lpeg.match(G, input)
end
```

Example of use:

```
% pandoc -f plain.lua -t native
*Hello there*, this is plain text with no formatting
except paragraph breaks.

- Like this one.
^D
[ Para
    [ Str "*Hello"
    , Space
    , Str "there*,"
    , Space
    , Str "this"
    , Space
    , Str "is"
    , Space
    , Str "plain"
    , Space
    , Str "text"
    , Space
    , Str "with"
    , Space
    , Str "no"
    , Space
    , Str "formatting"
    , SoftBreak
    , Str "except"
    , Space
    , Str "paragraph"
    , Space
    , Str "breaks."
    ]
, Para
    [ Str "-"
    , Space
    , Str "Like"
    , Space
    , Str "this"
    , Space
    , Str "one."
    ]
]
```

# Example: a RIS bibliography reader

This is a parser for [RIS bibliography] files.  It can be used
to convert them to CSL JSON or YAML, BibTeX, or BibLaTeX.

[RIS bibliography]: https://en.wikipedia.org/wiki/RIS_(file_format)

```lua
-- A sample custom reader for RIS bibliography format
-- https://en.wikipedia.org/wiki/RIS_(file_format)
-- The references are converted to inline pandoc/CSL YAML
-- references in the metadata.

local inspect = require"inspect"

local types =
  { ABST = "article",
    ADVS = "motion-picture",
    AGGR = "dataset",
    ANCIENT = "book",
    ART = "graphic",
    BILL = "bill",
    BLOG = "post-weblog",
    BOOK = "book",
    CASE = "legal_case",
    CHAP = "chapter",
    CHART = "graphic",
    CLSWK = "book",
    COMP = "program",
    CONF = "paper-conference",
    CPAPER = "paper-conference",
    CTLG = "catalog",
    DATA = "dataset",
    DBASE = "dataset",
    DICT = "book",
    EBOOK = "book",
    ECHAP = "chapter",
    EDBOOK = "book",
    EJOUR = "article",
    WEB = "webpage",
    ENCYC = "entry-encyclopedia",
    EQUA = "figure",
    FIGURE = "figure",
    GEN = "entry",
    GOVDOC = "report",
    GRANT = "report",
    HEAR = "report",
    ICOMM = "personal_communication",
    INPR = "article-journal",
    JFULL = "article-journal",
    JOUR = "article-journal",
    LEGAL = "legal_case",
    MANSCPT = "manuscript",
    MAP = "map",
    MGZN = "article-magazine",
    MPCT = "motion-picture",
    MULTI = "webpage",
    MUSIC = "musical_score",
    NEWS = "article-newspaper",
    PAMP = "pamphlet",
    PAT = "patent",
    PCOMM = "personal_communication",
    RPRT = "report",
    SER = "article",
    SLIDE = "graphic",
    SOUND = "musical_score",
    STAND = "report",
    STAT = "legislation",
    THES = "thesis",
    UNBILL = "bill",
    UNPB = "unpublished",
    VIDEO = "graphic"
  }

local function clean(refpairs)
  local ref = {}
  for i = 1, #refpairs do
    local k,v = table.unpack(refpairs[i])
    if k == "TY" then
      ref["type"] = types[v]
    elseif k == "VL" then
      ref.volume = v
    elseif k == "KW" then
      ref.keyword = v
    elseif k == "PB" then
      ref.publisher = v
    elseif k == "CY" or k == "PP" then
      ref["publisher-place"] = v
    elseif k == "SP" then
      if ref.page then
        ref.page = v .. ref.page
      else
        ref.page = v
      end
    elseif k == "EP" then
      if ref.page then
        ref.page = ref.page .. "-" .. v
      else
        ref.page = "-" .. v
      end
    elseif k == "AU" or k == "A1" or k == "A2" or k == "A3" then
      if ref.author then
        table.insert(ref.author, v)
      else
        ref.author = {v}
      end
    elseif k == "TI" or k == "T1" or k == "CT" or
            (k == "BT" and ref.type == "book") then
      ref.title = v
    elseif k == "ET" then
      ref.edition = v
    elseif k == "NV" then
      ref["number-of-volumes"] = v
    elseif k == "AB" then
      ref.abstract = v
    elseif k == "ED" then
      if ref.editor then
        table.insert(ref.editor, v)
      else
        ref.editor = {v}
      end
    elseif k == "JO" or k == "JF" or k == "T2" or
             (k == "BT" and ref.type ~= "book") then
      ref["container-title"] = v
    elseif k == "PY" or k == "Y1" then
      ref.issued = v
    elseif k == "IS" then
      ref.issue = v
    elseif k == "SN" then
      ref.ISSN = v
    elseif k == "L" then
      ref.lang = v
    elseif k == "UR" or k == "LK" then
      ref.URL = v
    end
  end
  return ref
end

function Reader(input, reader_options)
  local refs = {}
  local thisref = {}
  local ids = {}
  for line in string.gmatch(input, "[^\n]*") do
    key, val = string.match(line, "([A-Z][A-Z0-9])  %- (.*)")
    if key == "ER" then
      -- clean up fields
      local newref = clean(thisref)
      -- ensure we have an id and if not, create a sensible one
      if not newref.id then
        newref.id = ""
        for _,x in ipairs(newref.author) do
          newref.id = newref.id .. string.match(pandoc.utils.stringify(x), "%a+")
        end
        if newref.issued then
          newref.id = newref.id .. string.match(newref.issued, "%d+")
        end
        if ids[newref.id] then -- add disambiguator if needed
          newref.id = newref.id .. "-" .. #ids
        end
      end
      table.insert(ids, newref.id)
      table.insert(refs, newref)
      thisref = {}
    elseif key then
      table.insert(thisref, {key, val})
    end
  end
  return pandoc.Pandoc({}, pandoc.Meta { references = refs } )
end
```

Example of use:

```
% pandoc -f ris.lua -t bibtex
TY  - JOUR
AU  - Shannon, Claude E.
PY  - 1948
DA  - July
TI  - A Mathematical Theory of Communication
T2  - Bell System Technical Journal
SP  - 379
EP  - 423
VL  - 27
ER  -
TY  - JOUR
T1  - On computable numbers, with an application to the Entscheidungsproblem
A1  - Turing, Alan Mathison
JO  - Proc. of London Mathematical Society
VL  - 47
IS  - 1
SP  - 230
EP  - 265
Y1  - 1937
ER  -
^D
@article{Shannon1948,
  author = {Shannon, Claude E.},
  title = {A {Mathematical} {Theory} of {Communication}},
  journal = {Bell System Technical Journal},
  volume = {27},
  pages = {379-423},
  year = {1948}
}
@article{Turing1937,
  author = {Turing, Alan Mathison},
  title = {On Computable Numbers, with an Application to the
    {Entscheidungsproblem}},
  journal = {Proc. of London Mathematical Society},
  volume = {47},
  number = {1},
  pages = {230-265},
  year = {1937}
}
```

# Example: a wiki Creole reader

This is a parser for [Creole common wiki markup].
It uses an [lpeg] grammar.  Fun fact: this custom reader is faster than
pandoc's built-in creole reader! This shows that high-performance
readers can be designed in this way.

[Creole common wiki markup]: http://www.wikicreole.org/wiki/CheatSheet


```lua
-- A sample custom reader for Creole 1.0 (common wiki markup)
-- http://www.wikicreole.org/wiki/CheatSheet

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local specialchar = S("/*~[]\\{}|")
local wordchar = (1 - (whitespacechar + specialchar))
local spacechar = S(" \t")
local newline = P"\r"^-1 * P"\n"
local blankline = spacechar^0 * newline
local endline = newline * #-blankline
local endequals = spacechar^0 * P"="^0 * spacechar^0 * newline
local cellsep = spacechar^0 * P"|"

local function trim(s)
   return (s:gsub("^%s*(.-)%s*$", "%1"))
end

local function ListItem(lev, ch)
  local start
  if ch == nil then
    start = S"*#"
  else
    start = P(ch)
  end
  local subitem = function(c)
    if lev < 6 then
      return ListItem(lev + 1, c)
    else
      return (1 - 1) -- fails
    end
  end
  local parser = spacechar^0
               * start^lev
               * #(- start)
               * spacechar^0
               * Ct((V"Inline" - (newline * spacechar^0 * S"*#"))^0)
               * newline
               * (Ct(subitem("*")^1) / pandoc.BulletList
                  +
                  Ct(subitem("#")^1) / pandoc.OrderedList
                  +
                  Cc(nil))
               / function (ils, sublist)
                   return { pandoc.Plain(ils), sublist }
                 end
  return parser
end

-- Grammar
G = P{ "Doc",
  Doc = Ct(V"Block"^0)
      / pandoc.Pandoc ;
  Block = blankline^0
        * ( V"Header"
          + V"HorizontalRule"
          + V"CodeBlock"
          + V"List"
          + V"Table"
          + V"Para") ;
  Para = Ct(V"Inline"^1)
       * newline
       / pandoc.Para ;
  HorizontalRule = spacechar^0
                 * P"----"
                 * spacechar^0
                 * newline
                 / pandoc.HorizontalRule;
  Header = (P("=")^1 / string.len)
         * spacechar^1
         * Ct((V"Inline" - endequals)^1)
         * endequals
         / pandoc.Header;
  CodeBlock = P"{{{"
            * blankline
            * C((1 - (newline * P"}}}"))^0)
            * newline
            * P"}}}"
            / pandoc.CodeBlock;
  Placeholder = P"<<<"
              * C(P(1) - P">>>")^0
              * P">>>"
              / function() return pandoc.Div({}) end;
  List = V"BulletList"
       + V"OrderedList" ;
  BulletList = Ct(ListItem(1,'*')^1)
             / pandoc.BulletList ;
  OrderedList = Ct(ListItem(1,'#')^1)
             / pandoc.OrderedList ;
  Table = (V"TableHeader" + Cc{})
        * Ct(V"TableRow"^1)
        / function(headrow, bodyrows)
            local numcolumns = #(bodyrows[1])
            local aligns = {}
            local widths = {}
            for i = 1,numcolumns do
              aligns[i] = pandoc.AlignDefault
              widths[i] = 0
            end
            return pandoc.utils.from_simple_table(
              pandoc.SimpleTable({}, aligns, widths, headrow, bodyrows))
          end ;
  TableHeader = Ct(V"HeaderCell"^1)
              * cellsep^-1
              * spacechar^0
              * newline ;
  TableRow   = Ct(V"BodyCell"^1)
             * cellsep^-1
             * spacechar^0
             * newline ;
  HeaderCell = cellsep
             * P"="
             * spacechar^0
             * Ct((V"Inline" - (newline + cellsep))^0)
             / function(ils) return { pandoc.Plain(ils) } end ;
  BodyCell   = cellsep
             * spacechar^0
             * Ct((V"Inline" - (newline + cellsep))^0)
             / function(ils) return { pandoc.Plain(ils) } end ;
  Inline = V"Emph"
         + V"Strong"
         + V"LineBreak"
         + V"Link"
         + V"URL"
         + V"Image"
         + V"Str"
         + V"Space"
         + V"SoftBreak"
         + V"Escaped"
         + V"Placeholder"
         + V"Code"
         + V"Special" ;
  Str = wordchar^1
      / pandoc.Str;
  Escaped = P"~"
          * C(P(1))
          / pandoc.Str ;
  Special = specialchar
          / pandoc.Str;
  Space = spacechar^1
        / pandoc.Space ;
  SoftBreak = endline
            * # -(V"HorizontalRule" + V"CodeBlock")
            / pandoc.SoftBreak ;
  LineBreak = P"\\\\"
            / pandoc.LineBreak ;
  Code = P"{{{"
       * C((1 - P"}}}")^0)
       * P"}}}"
       / trim / pandoc.Code ;
  Link = P"[["
       * C((1 - (P"]]" + P"|"))^0)
       * (P"|" * Ct((V"Inline" - P"]]")^1))^-1 * P"]]"
       / function(url, desc)
           local txt = desc or {pandoc.Str(url)}
           return pandoc.Link(txt, url)
         end ;
  Image = P"{{"
        * #-P"{"
        * C((1 - (S"}"))^0)
        * (P"|" * Ct((V"Inline" - P"}}")^1))^-1
        * P"}}"
        / function(url, desc)
            local txt = desc or ""
            return pandoc.Image(txt, url)
          end ;
  URL = P"http"
      * P"s"^-1
      * P":"
      * (1 - (whitespacechar + (S",.?!:;\"'" * #whitespacechar)))^1
      / function(url)
          return pandoc.Link(pandoc.Str(url), url)
        end ;
  Emph = P"//"
       * Ct((V"Inline" - P"//")^1)
       * P"//"
       / pandoc.Emph ;
  Strong = P"**"
         * Ct((V"Inline" -P"**")^1)
         * P"**"
         / pandoc.Strong ;
}

function Reader(input, reader_options)
  return lpeg.match(G, input)
end
```

Example of use:

```
% pandoc -f creole.lua -t markdown
== Wiki Creole

You can make things **bold** or //italic// or **//both//** or //**both**//.

Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

You can use [[internal links]] or [[http://www.wikicreole.org|external links]],
give the link a [[internal links|different]] name.
^D
## Wiki Creole

You can make things **bold** or *italic* or ***both*** or ***both***.

Character formatting extends across line breaks: \*\*bold, this is still
bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

You can use [internal links](internal links) or [external
links](http://www.wikicreole.org), give the link a
[different](internal links) name.
```

# Example: parsing JSON from an API

This custom reader consumes the JSON output of
<https://www.reddit.com/r/haskell.json> and produces
a document containing the current top articles on the
Haskell subreddit.

It assumes that the `luajson` library is available.  (It can be
installed using `luarocks install luajson`---but be sure you are
installing it for Lua 5.3, which is the version packaged with
pandoc.)


```lua
-- consumes the output of https://www.reddit.com/r/haskell.json

local json = require'json'  -- luajson must be available

local function read_inlines(raw)
  local doc = pandoc.read(raw, "commonmark")
  return pandoc.utils.blocks_to_inlines(doc.blocks)
end

local function read_blocks(raw)
  local doc = pandoc.read(raw, "commonmark")
  return doc.blocks
end

function Reader(input)

  local parsed = json.decode(input)
  local blocks = {}

  for _,entry in ipairs(parsed.data.children) do
    local d = entry.data
    table.insert(blocks, pandoc.Header(2,
                  pandoc.Link(read_inlines(d.title), d.url)))
    for _,block in ipairs(read_blocks(d.selftext)) do
      table.insert(blocks, block)
    end
  end

  return pandoc.Pandoc(blocks)

end
```

Similar code can be used to consume JSON output from other APIs.

Note that the content of the text fields is markdown, so we
convert it using `pandoc.read()`.