--- author: - John MacFarlane date: 'November 18, 2021' title: Creating Custom Pandoc Readers in Lua --- # Introduction If you need to parse a format not already handled by pandoc, you can create a custom reader using the [Lua] language. Pandoc has a built-in Lua interpreter, so you needn't install any additional software to do this. [Lua]: https://www.lua.org A custom reader is a Lua file that defines a function called `Reader`, which takes two arguments: - the raw input to be parsed, as a list of sources - optionally, a table of reader options, e.g. `{ columns = 62, standalone = true }`. The `Reader` function should return a `Pandoc` AST. This can be created using functions in the [`pandoc` module], which is automatically in scope. (Indeed, all of the utility functions that are available for [Lua filters] are available in custom readers, too.) Each source item corresponds to a file or stream passed to pandoc containing its text and name. E.g., if a single file `input.txt` is passed to pandoc, then the list of sources will contain just a single element `s`, where `s.name == 'input.txt'` and `s.text` contains the file contents as a string. The sources list, as well as each of its elements, can be converted to a string via the Lua standard library function `tostring`. [Lua filters]: https://pandoc.org/lua-filters.html [`pandoc` module]: https://pandoc.org/lua-filters.html#module-pandoc A minimal example would be ```lua function Reader(input) return pandoc.Pandoc({ pandoc.CodeBlock(tostring(input)) }) end ``` This just returns a document containing a big code block with all of the input. Or, to create a separate code block for each input file, one might write ``` lua function Reader(input) return pandoc.Pandoc(input:map( function (s) return pandoc.CodeBlock(s.text) end)) end ``` In a nontrivial reader, you'll want to parse the input. You can do this using standard Lua library functions (for example, the [patterns] library), or with the powerful and fast [lpeg] parsing library, which is automatically in scope. You can also use external Lua libraries (for example, an XML parser). A previous pandoc version passed a raw string instead of a list of sources to the Reader function. Reader functions that rely on this are obsolete, but still supported: Pandoc analyzes any script error, detecting when code assumed the old behavior. The code is rerun with raw string input in this case, thereby ensuring backwards compatibility. [patterns]: http://lua-users.org/wiki/PatternsTutorial [lpeg]: http://www.inf.puc-rio.br/~roberto/lpeg/ # Example: plain text reader This is a simple example using [lpeg] to parse the input into space-separated strings and blankline-separated paragraphs. ```lua -- A sample custom reader that just parses text into blankline-separated -- paragraphs with space-separated words. -- For better performance we put these functions in local variables: local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt = lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V, lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt local whitespacechar = S(" \t\r\n") local wordchar = (1 - whitespacechar) local spacechar = S(" \t") local newline = P"\r"^-1 * P"\n" local blanklines = newline * (spacechar^0 * newline)^1 local endline = newline - blanklines -- Grammar G = P{ "Pandoc", Pandoc = Ct(V"Block"^0) / pandoc.Pandoc; Block = blanklines^0 * V"Para" ; Para = Ct(V"Inline"^1) / pandoc.Para; Inline = V"Str" + V"Space" + V"SoftBreak" ; Str = wordchar^1 / pandoc.Str; Space = spacechar^1 / pandoc.Space; SoftBreak = endline / pandoc.SoftBreak; } function Reader(input) return lpeg.match(G, tostring(input)) end ``` Example of use: ``` % pandoc -f plain.lua -t native *Hello there*, this is plain text with no formatting except paragraph breaks. - Like this one. ^D [ Para [ Str "*Hello" , Space , Str "there*," , Space , Str "this" , Space , Str "is" , Space , Str "plain" , Space , Str "text" , Space , Str "with" , Space , Str "no" , Space , Str "formatting" , SoftBreak , Str "except" , Space , Str "paragraph" , Space , Str "breaks." ] , Para [ Str "-" , Space , Str "Like" , Space , Str "this" , Space , Str "one." ] ] ``` # Example: a RIS bibliography reader This is a parser for [RIS bibliography] files. It can be used to convert them to CSL JSON or YAML, BibTeX, or BibLaTeX. [RIS bibliography]: https://en.wikipedia.org/wiki/RIS_(file_format) ```lua -- A sample custom reader for RIS bibliography format -- https://en.wikipedia.org/wiki/RIS_(file_format) -- The references are converted to inline pandoc/CSL YAML -- references in the metadata. local inspect = require"inspect" local types = { ABST = "article", ADVS = "motion-picture", AGGR = "dataset", ANCIENT = "book", ART = "graphic", BILL = "bill", BLOG = "post-weblog", BOOK = "book", CASE = "legal_case", CHAP = "chapter", CHART = "graphic", CLSWK = "book", COMP = "program", CONF = "paper-conference", CPAPER = "paper-conference", CTLG = "catalog", DATA = "dataset", DBASE = "dataset", DICT = "book", EBOOK = "book", ECHAP = "chapter", EDBOOK = "book", EJOUR = "article", WEB = "webpage", ENCYC = "entry-encyclopedia", EQUA = "figure", FIGURE = "figure", GEN = "entry", GOVDOC = "report", GRANT = "report", HEAR = "report", ICOMM = "personal_communication", INPR = "article-journal", JFULL = "article-journal", JOUR = "article-journal", LEGAL = "legal_case", MANSCPT = "manuscript", MAP = "map", MGZN = "article-magazine", MPCT = "motion-picture", MULTI = "webpage", MUSIC = "musical_score", NEWS = "article-newspaper", PAMP = "pamphlet", PAT = "patent", PCOMM = "personal_communication", RPRT = "report", SER = "article", SLIDE = "graphic", SOUND = "musical_score", STAND = "report", STAT = "legislation", THES = "thesis", UNBILL = "bill", UNPB = "unpublished", VIDEO = "graphic" } local function clean(refpairs) local ref = {} for i = 1, #refpairs do local k,v = table.unpack(refpairs[i]) if k == "TY" then ref["type"] = types[v] elseif k == "VL" then ref.volume = v elseif k == "KW" then ref.keyword = v elseif k == "PB" then ref.publisher = v elseif k == "CY" or k == "PP" then ref["publisher-place"] = v elseif k == "SP" then if ref.page then ref.page = v .. ref.page else ref.page = v end elseif k == "EP" then if ref.page then ref.page = ref.page .. "-" .. v else ref.page = "-" .. v end elseif k == "AU" or k == "A1" or k == "A2" or k == "A3" then if ref.author then table.insert(ref.author, v) else ref.author = {v} end elseif k == "TI" or k == "T1" or k == "CT" or (k == "BT" and ref.type == "book") then ref.title = v elseif k == "ET" then ref.edition = v elseif k == "NV" then ref["number-of-volumes"] = v elseif k == "AB" then ref.abstract = v elseif k == "ED" then if ref.editor then table.insert(ref.editor, v) else ref.editor = {v} end elseif k == "JO" or k == "JF" or k == "T2" or (k == "BT" and ref.type ~= "book") then ref["container-title"] = v elseif k == "PY" or k == "Y1" then ref.issued = v elseif k == "IS" then ref.issue = v elseif k == "SN" then ref.ISSN = v elseif k == "L" then ref.lang = v elseif k == "UR" or k == "LK" then ref.URL = v end end return ref end function Reader(input, reader_options) local refs = {} local thisref = {} local ids = {} for line in string.gmatch(tostring(input), "[^\n]*") do key, val = string.match(line, "([A-Z][A-Z0-9]) %- (.*)") if key == "ER" then -- clean up fields local newref = clean(thisref) -- ensure we have an id and if not, create a sensible one if not newref.id then newref.id = "" for _,x in ipairs(newref.author) do newref.id = newref.id .. string.match(pandoc.utils.stringify(x), "%a+") end if newref.issued then newref.id = newref.id .. string.match(newref.issued, "%d+") end if ids[newref.id] then -- add disambiguator if needed newref.id = newref.id .. "-" .. #ids end end table.insert(ids, newref.id) table.insert(refs, newref) thisref = {} elseif key then table.insert(thisref, {key, val}) end end return pandoc.Pandoc({}, pandoc.Meta { references = refs } ) end ``` Example of use: ``` % pandoc -f ris.lua -t bibtex TY - JOUR AU - Shannon, Claude E. PY - 1948 DA - July TI - A Mathematical Theory of Communication T2 - Bell System Technical Journal SP - 379 EP - 423 VL - 27 ER - TY - JOUR T1 - On computable numbers, with an application to the Entscheidungsproblem A1 - Turing, Alan Mathison JO - Proc. of London Mathematical Society VL - 47 IS - 1 SP - 230 EP - 265 Y1 - 1937 ER - ^D @article{Shannon1948, author = {Shannon, Claude E.}, title = {A {Mathematical} {Theory} of {Communication}}, journal = {Bell System Technical Journal}, volume = {27}, pages = {379-423}, year = {1948} } @article{Turing1937, author = {Turing, Alan Mathison}, title = {On Computable Numbers, with an Application to the {Entscheidungsproblem}}, journal = {Proc. of London Mathematical Society}, volume = {47}, number = {1}, pages = {230-265}, year = {1937} } ``` # Example: a wiki Creole reader This is a parser for [Creole common wiki markup]. It uses an [lpeg] grammar. Fun fact: this custom reader is faster than pandoc's built-in creole reader! This shows that high-performance readers can be designed in this way. [Creole common wiki markup]: http://www.wikicreole.org/wiki/CheatSheet ```lua -- A sample custom reader for Creole 1.0 (common wiki markup) -- http://www.wikicreole.org/wiki/CheatSheet -- For better performance we put these functions in local variables: local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt = lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V, lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt local whitespacechar = S(" \t\r\n") local specialchar = S("/*~[]\\{}|") local wordchar = (1 - (whitespacechar + specialchar)) local spacechar = S(" \t") local newline = P"\r"^-1 * P"\n" local blankline = spacechar^0 * newline local endline = newline * #-blankline local endequals = spacechar^0 * P"="^0 * spacechar^0 * newline local cellsep = spacechar^0 * P"|" local function trim(s) return (s:gsub("^%s*(.-)%s*$", "%1")) end local function ListItem(lev, ch) local start if ch == nil then start = S"*#" else start = P(ch) end local subitem = function(c) if lev < 6 then return ListItem(lev + 1, c) else return (1 - 1) -- fails end end local parser = spacechar^0 * start^lev * #(- start) * spacechar^0 * Ct((V"Inline" - (newline * spacechar^0 * S"*#"))^0) * newline * (Ct(subitem("*")^1) / pandoc.BulletList + Ct(subitem("#")^1) / pandoc.OrderedList + Cc(nil)) / function (ils, sublist) return { pandoc.Plain(ils), sublist } end return parser end -- Grammar G = P{ "Doc", Doc = Ct(V"Block"^0) / pandoc.Pandoc ; Block = blankline^0 * ( V"Header" + V"HorizontalRule" + V"CodeBlock" + V"List" + V"Table" + V"Para") ; Para = Ct(V"Inline"^1) * newline / pandoc.Para ; HorizontalRule = spacechar^0 * P"----" * spacechar^0 * newline / pandoc.HorizontalRule; Header = (P("=")^1 / string.len) * spacechar^1 * Ct((V"Inline" - endequals)^1) * endequals / pandoc.Header; CodeBlock = P"{{{" * blankline * C((1 - (newline * P"}}}"))^0) * newline * P"}}}" / pandoc.CodeBlock; Placeholder = P"<<<" * C(P(1) - P">>>")^0 * P">>>" / function() return pandoc.Div({}) end; List = V"BulletList" + V"OrderedList" ; BulletList = Ct(ListItem(1,'*')^1) / pandoc.BulletList ; OrderedList = Ct(ListItem(1,'#')^1) / pandoc.OrderedList ; Table = (V"TableHeader" + Cc{}) * Ct(V"TableRow"^1) / function(headrow, bodyrows) local numcolumns = #(bodyrows[1]) local aligns = {} local widths = {} for i = 1,numcolumns do aligns[i] = pandoc.AlignDefault widths[i] = 0 end return pandoc.utils.from_simple_table( pandoc.SimpleTable({}, aligns, widths, headrow, bodyrows)) end ; TableHeader = Ct(V"HeaderCell"^1) * cellsep^-1 * spacechar^0 * newline ; TableRow = Ct(V"BodyCell"^1) * cellsep^-1 * spacechar^0 * newline ; HeaderCell = cellsep * P"=" * spacechar^0 * Ct((V"Inline" - (newline + cellsep))^0) / function(ils) return { pandoc.Plain(ils) } end ; BodyCell = cellsep * spacechar^0 * Ct((V"Inline" - (newline + cellsep))^0) / function(ils) return { pandoc.Plain(ils) } end ; Inline = V"Emph" + V"Strong" + V"LineBreak" + V"Link" + V"URL" + V"Image" + V"Str" + V"Space" + V"SoftBreak" + V"Escaped" + V"Placeholder" + V"Code" + V"Special" ; Str = wordchar^1 / pandoc.Str; Escaped = P"~" * C(P(1)) / pandoc.Str ; Special = specialchar / pandoc.Str; Space = spacechar^1 / pandoc.Space ; SoftBreak = endline * # -(V"HorizontalRule" + V"CodeBlock") / pandoc.SoftBreak ; LineBreak = P"\\\\" / pandoc.LineBreak ; Code = P"{{{" * C((1 - P"}}}")^0) * P"}}}" / trim / pandoc.Code ; Link = P"[[" * C((1 - (P"]]" + P"|"))^0) * (P"|" * Ct((V"Inline" - P"]]")^1))^-1 * P"]]" / function(url, desc) local txt = desc or {pandoc.Str(url)} return pandoc.Link(txt, url) end ; Image = P"{{" * #-P"{" * C((1 - (S"}"))^0) * (P"|" * Ct((V"Inline" - P"}}")^1))^-1 * P"}}" / function(url, desc) local txt = desc or "" return pandoc.Image(txt, url) end ; URL = P"http" * P"s"^-1 * P":" * (1 - (whitespacechar + (S",.?!:;\"'" * #whitespacechar)))^1 / function(url) return pandoc.Link(pandoc.Str(url), url) end ; Emph = P"//" * Ct((V"Inline" - P"//")^1) * P"//" / pandoc.Emph ; Strong = P"**" * Ct((V"Inline" -P"**")^1) * P"**" / pandoc.Strong ; } function Reader(input, reader_options) return lpeg.match(G, tostring(input)) end ``` Example of use: ``` % pandoc -f creole.lua -t markdown == Wiki Creole You can make things **bold** or //italic// or **//both//** or //**both**//. Character formatting extends across line breaks: **bold, this is still bold. This line deliberately does not end in star-star. Not bold. Character formatting does not cross paragraph boundaries. You can use [[internal links]] or [[http://www.wikicreole.org|external links]], give the link a [[internal links|different]] name. ^D ## Wiki Creole You can make things **bold** or *italic* or ***both*** or ***both***. Character formatting extends across line breaks: \*\*bold, this is still bold. This line deliberately does not end in star-star. Not bold. Character formatting does not cross paragraph boundaries. You can use [internal links](internal links) or [external links](http://www.wikicreole.org), give the link a [different](internal links) name. ``` # Example: parsing JSON from an API This custom reader consumes the JSON output of and produces a document containing the current top articles on the Haskell subreddit. It assumes that the `luajson` library is available. (It can be installed using `luarocks install luajson`---but be sure you are installing it for Lua 5.3, which is the version packaged with pandoc.) ```lua -- consumes the output of https://www.reddit.com/r/haskell.json local json = require'json' -- luajson must be available local function read_inlines(raw) local doc = pandoc.read(raw, "commonmark") return pandoc.utils.blocks_to_inlines(doc.blocks) end local function read_blocks(raw) local doc = pandoc.read(raw, "commonmark") return doc.blocks end function Reader(input) local parsed = json.decode(tostring(input)) local blocks = {} for _,entry in ipairs(parsed.data.children) do local d = entry.data table.insert(blocks, pandoc.Header(2, pandoc.Link(read_inlines(d.title), d.url))) for _,block in ipairs(read_blocks(d.selftext)) do table.insert(blocks, block) end end return pandoc.Pandoc(blocks) end ``` Similar code can be used to consume JSON output from other APIs. Note that the content of the text fields is markdown, so we convert it using `pandoc.read()`. # Example: syntax-highlighted code files This is a reader that puts the content of each input file into a code block, sets the file's extension as the block's class to enable code highlighting, and places the filename as a header above each code block. ``` lua function to_code_block (source) local _, lang = pandoc.path.split_extension(source.name) return pandoc.Div{ pandoc.Header(1, source.name == '' and '' or source.name), pandoc.CodeBlock(source.text, {class=lang}), } end function Reader (input, opts) return pandoc.Pandoc(input:map(to_code_block)) end ``` # Example: "readable HTML" reader This reader uses the command-line program `readable` (install via `npm install -g readability-cli`) to clean out parts of HTML input that have to do with navigation, leaving only the content. ``` lua -- Custom reader for "readable HTML." This pipes HTML content -- through the 'readable' program (npm install -g readability-cli) -- and then calls the HTML reader. In addition, Divs that seem -- to have only a layout function are removed to avoid clutter. function make_readable(source) local result if not pcall(function () local name = source.name if not name:match("http") then name = "file:///" .. name end result = pandoc.pipe("readable", {"--keep-classes","--base",name}, source.text) end) then io.stderr:write("Error running 'readable': do you have it installed?\n") io.stderr:write("npm install -g readability-cli\n") os.exit(1) end return result end local boring_classes = { row = true, page = true, container = true } local boring_attributes = { "role" } local function is_boring_class(cl) return boring_classes[cl] or cl:match("col%-") or cl:match("pull%-") end local function handle_div(el) for i,class in ipairs(el.classes) do if is_boring_class(class) then el.classes[i] = nil end end for i,k in ipairs(boring_attributes) do el.attributes[k] = nil end if el.identifier:match("readability-") then el.identifier = "" end if #el.classes == 0 and #el.attributes == 0 and #el.identifier == 0 then return el.content else return el end end function Reader(sources) local readable = '' for _,source in ipairs(sources) do readable = readable .. make_readable(source) end local doc = pandoc.read(readable, "html", PANDOC_READER_OPTIONS) -- Now remove Divs used only for layout return doc:walk{ Div = handle_div } end ``` Example of use: ``` pandoc -f readable.lua -t markdown https://pandoc.org ``` and compare the output to ``` pandoc -f html -t markdown https://pandoc.org ```