Update filter documentation.

Remove example using pandoc API directly (we have other
docs for that and it was outdated).

Closes #6065.
This commit is contained in:
John MacFarlane 2020-01-14 11:18:24 -08:00
parent 9009bda179
commit dfac1239d9

View file

@ -16,19 +16,52 @@ The pandoc AST format is defined in the module
](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html).
A "filter" is a program that modifies the AST, between the
reader and the writer:
reader and the writer.
INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
Filters are "pipes" that read from standard input and write to
standard output. They consume and produce a JSON representation
of the pandoc AST. (In recent versions, this representation
includes a `pandoc-api-version` field which refers to a
version of `pandoc-types`.) Filters may be written in any programming
language. To use a filter, you need only specify it on the
command line using `--filter`, e.g.
Pandoc supports two kinds of filters:
pandoc -s input.txt --filter pandoc-citeproc -o output.htl
- **Lua filters** use the Lua language to
define transformations on the pandoc AST. They are
described in a [separate document](lua-filters.html).
- **JSON filters**, described here, are pipes that read from
standard input and write to standard output, consuming and
producing a JSON representation of the pandoc AST:
source format
(pandoc)
JSON-formatted AST
(JSON filter)
JSON-formatted AST
(pandoc)
target format
Lua filters have a couple of advantages. They use a Lua
interpreter that is embedded in pandoc, so you don't need
to have any external software installed. And they are
usually faster than JSON filters. But if you wish to
write your filter in a language other than Lua, you may
prefer to use a JSON filter. JSON filters may be written
in any programming language.
You can use a JSON filter directly in a pipeline:
pandoc -s input.txt -t json | \
pandoc-citeproc | \
pandoc -s -f json -o output.html
But it is more convenient to use the `--filter` option,
which handles the plumbing automatically:
pandoc -s input.txt --filter pandoc-citeproc -o output.html
For a gentle introduction into writing your own filters,
continue this guide. Theres also a [list of third party filters
@ -37,7 +70,7 @@ on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).
# A simple example
Suppose you wanted to replace all level 2+ headers in a markdown
Suppose you wanted to replace all level 2+ headings in a markdown
document with regular paragraphs, with text in italics. How would you go
about doing this?
@ -47,10 +80,10 @@ like this:
perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt
This should work most of the time. But don't forget
that ATX style headers can end with a sequence of `#`s
that is not part of the header text:
that ATX style headings can end with a sequence of `#`s
that is not part of the heading text:
## My header ##
## My heading ##
And what if your document contains a line starting with `##` in an HTML
comment or delimited code block?
@ -60,14 +93,14 @@ comment or delimited code block?
-->
~~~~
### A third level header in standard markdown
### A third level heading in standard markdown
~~~~
We don't want to touch *these* lines. Moreover, what about setext
style second-level headers?
We don't want to touch *these* lines. Moreover, what about Setext
style second-level heading?
A header
--------
A heading
---------
We need to handle those too. Finally, can we be sure that adding
asterisks to each side of our string will put it in italics?
@ -76,25 +109,23 @@ end up with bold text, which is not what we want. And what if it contains
a regular unescaped asterisk?
How would you modify your regular expression to handle these cases? It
would be hairy, to say the least. What we need is a real parser.
would be hairy, to say the least.
Well, pandoc has a real markdown parser, the library function
`readMarkdown`. This transforms markdown text to an abstract syntax tree
(AST) that represents the document structure. Why not manipulate the
AST directly in a short Haskell script, then convert the result back to
markdown using `writeMarkdown`?
A better approach is to let pandoc handle the parsing, and
then modify the AST before the document is written. For this,
we can use a filter.
First, let's see what this AST looks like. We can use pandoc's `native`
output format:
To see what sort of AST is produced when pandoc parses our text,
we can use pandoc's `native` output format:
~~~~
% cat test.txt
## my header
## my heading
text with *italics*
% pandoc -s -t native test.txt
Pandoc (Meta {unMeta = fromList []})
[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"]
[Header 2 ("my-heading",[],[]) [Str "My",Space,Str "heading"]
, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
~~~~
@ -106,136 +137,46 @@ the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].
[haddock documentation for `Text.Pandoc.Definition`]: https://hackage.haskell.org/package/pandoc-types
Here's a short Haskell script that reads markdown, changes level
2+ headers to regular paragraphs, and writes the result as markdown.
If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`.
It will act like a unix pipe, reading from `stdin` and writing to `stdout`.
Or, if you want, you can compile it, using `ghc --make behead`, then run
the resulting executable `behead`.
~~~~ {.haskell}
-- behead.hs
import Text.Pandoc
import Text.Pandoc.Walk (walk)
behead :: Block -> Block
behead (Header n _ xs) | n >= 2 = Para [Emph xs]
behead x = x
readDoc :: String -> Pandoc
readDoc s = readMarkdown def s
-- or, for pandoc 1.14 and greater, use:
-- readDoc s = case readMarkdown def s of
-- Right doc -> doc
-- Left err -> error (show err)
writeDoc :: Pandoc -> String
writeDoc doc = writeMarkdown def doc
main :: IO ()
main = interact (writeDoc . walk behead . readDoc)
~~~~
The magic here is the `walk` function, which converts
our `behead` function (a function from `Block` to `Block`) to
a transformation on whole `Pandoc` documents.
(See the [haddock documentation for `Text.Pandoc.Walk`].)
[haddock documentation for `Text.Pandoc.Walk`]: https://hackage.haskell.org/package/pandoc-types
# Queries: listing URLs
We can use this same technique to do much more complex transformations
and queries. Here's how we could extract all the URLs linked to in
a markdown document (again, not an easy task with regular expressions):
~~~~ {.haskell}
-- extracturls.hs
import Text.Pandoc
extractURL :: Inline -> [String]
extractURL (Link _ _ (u,_)) = [u]
extractURL (Image _ _ (u,_)) = [u]
extractURL _ = []
extractURLs :: Pandoc -> [String]
extractURLs = query extractURL
readDoc :: String -> Pandoc
readDoc = readMarkdown def
-- or, for pandoc 1.14, use:
-- readDoc s = case readMarkdown def s of
-- Right doc -> doc
-- Left err -> error (show err)
main :: IO ()
main = interact (unlines . extractURLs . readDoc)
~~~~
`query` is the query counterpart of `walk`: it lifts
a function that operates on `Inline` elements to one that operates
on the whole `Pandoc` AST. The results returned by applying
`extractURL` to each `Inline` element are concatenated in the
result.
# JSON filters
`behead.hs` is a very special-purpose program. It reads a
specific input format (markdown) and writes a specific output format
(HTML), with a specific set of options (here, the defaults).
But the basic operation it performs is one that would be useful
in many document transformations. It would be nice to isolate the
part of the program that transforms the pandoc AST, leaving the rest
to pandoc itself. What we want is a *filter* that *just* operates
on the AST---or rather, on a JSON representation of the AST that
pandoc can produce and consume:
source format
(pandoc)
JSON-formatted AST
(filter)
JSON-formatted AST
(pandoc)
target format
The module `Text.Pandoc.JSON` (from `pandoc-types`) contains a
function `toJSONFilter` that makes it easy to write such
filters. Here is a filter version of `behead.hs`:
We can use Haskell to create a JSON filter that transforms this
AST, replacing each `Header` block with level >= 2 with a `Para`
with its contents wrapped inside an `Emph` inline:
~~~~ {.haskell}
#!/usr/bin/env runhaskell
-- behead2.hs
-- behead.hs
import Text.Pandoc.JSON
main :: IO ()
main = toJSONFilter behead
where behead (Header n _ xs) | n >= 2 = Para [Emph xs]
behead x = x
behead :: Block -> Block
behead (Header n _ xs) | n >= 2 = Para [Emph xs]
behead x = x
~~~~
It can be used this way:
The `toJSONFilter` function does two things. First, it lifts
the `behead` function (which maps `Block -> Block`) onto a
transformation of the entire `Pandoc` AST, walking the AST
and transforming each block. Second, it wraps this `Pandoc ->
Pandoc` transformation with the necessary JSON serialization
and deserialization, producing an executable that consumes
JSON from stdin and produces JSON to stdout.
pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \
pandoc -f json -t TARGETFORMAT
To use the filter, make it executable:
But it is easier to use the `--filter` option with pandoc:
chmod +x behead.hs
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs
and then
Note that this approach requires that `behead2.hs` be executable,
so we must
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead.hs
chmod +x behead2.hs
(It is also necessary that `pandoc-types` be installed in the
local package repository: `cabal install pandoc-types` should
ensure this.)
Alternatively, we could compile the filter:
ghc --make behead2.hs
ghc --make behead.hs
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead
Note that if the filter is placed in the system PATH, then the initial
@ -243,6 +184,7 @@ Note that if the filter is placed in the system PATH, then the initial
multiple instances of `--filter`: the filters will be applied in
sequence.
# LaTeX for WordPress
Another easy example. WordPress blogs require a special format for
@ -283,7 +225,7 @@ Here's our "beheading" filter in python:
#!/usr/bin/env python
"""
Pandoc filter to convert all level 2+ headers to paragraphs with
Pandoc filter to convert all level 2+ headings to paragraphs with
emphasized text.
"""