Update filter documentation.
Remove example using pandoc API directly (we have other docs for that and it was outdated). Closes #6065.
This commit is contained in:
parent
9009bda179
commit
dfac1239d9
1 changed files with 84 additions and 142 deletions
226
doc/filters.md
226
doc/filters.md
|
@ -16,19 +16,52 @@ The pandoc AST format is defined in the module
|
|||
](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html).
|
||||
|
||||
A "filter" is a program that modifies the AST, between the
|
||||
reader and the writer:
|
||||
reader and the writer.
|
||||
|
||||
INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
|
||||
|
||||
Filters are "pipes" that read from standard input and write to
|
||||
standard output. They consume and produce a JSON representation
|
||||
of the pandoc AST. (In recent versions, this representation
|
||||
includes a `pandoc-api-version` field which refers to a
|
||||
version of `pandoc-types`.) Filters may be written in any programming
|
||||
language. To use a filter, you need only specify it on the
|
||||
command line using `--filter`, e.g.
|
||||
Pandoc supports two kinds of filters:
|
||||
|
||||
pandoc -s input.txt --filter pandoc-citeproc -o output.htl
|
||||
- **Lua filters** use the Lua language to
|
||||
define transformations on the pandoc AST. They are
|
||||
described in a [separate document](lua-filters.html).
|
||||
|
||||
- **JSON filters**, described here, are pipes that read from
|
||||
standard input and write to standard output, consuming and
|
||||
producing a JSON representation of the pandoc AST:
|
||||
|
||||
source format
|
||||
↓
|
||||
(pandoc)
|
||||
↓
|
||||
JSON-formatted AST
|
||||
↓
|
||||
(JSON filter)
|
||||
↓
|
||||
JSON-formatted AST
|
||||
↓
|
||||
(pandoc)
|
||||
↓
|
||||
target format
|
||||
|
||||
Lua filters have a couple of advantages. They use a Lua
|
||||
interpreter that is embedded in pandoc, so you don't need
|
||||
to have any external software installed. And they are
|
||||
usually faster than JSON filters. But if you wish to
|
||||
write your filter in a language other than Lua, you may
|
||||
prefer to use a JSON filter. JSON filters may be written
|
||||
in any programming language.
|
||||
|
||||
You can use a JSON filter directly in a pipeline:
|
||||
|
||||
pandoc -s input.txt -t json | \
|
||||
pandoc-citeproc | \
|
||||
pandoc -s -f json -o output.html
|
||||
|
||||
But it is more convenient to use the `--filter` option,
|
||||
which handles the plumbing automatically:
|
||||
|
||||
pandoc -s input.txt --filter pandoc-citeproc -o output.html
|
||||
|
||||
For a gentle introduction into writing your own filters,
|
||||
continue this guide. There’s also a [list of third party filters
|
||||
|
@ -37,7 +70,7 @@ on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).
|
|||
|
||||
# A simple example
|
||||
|
||||
Suppose you wanted to replace all level 2+ headers in a markdown
|
||||
Suppose you wanted to replace all level 2+ headings in a markdown
|
||||
document with regular paragraphs, with text in italics. How would you go
|
||||
about doing this?
|
||||
|
||||
|
@ -47,10 +80,10 @@ like this:
|
|||
perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt
|
||||
|
||||
This should work most of the time. But don't forget
|
||||
that ATX style headers can end with a sequence of `#`s
|
||||
that is not part of the header text:
|
||||
that ATX style headings can end with a sequence of `#`s
|
||||
that is not part of the heading text:
|
||||
|
||||
## My header ##
|
||||
## My heading ##
|
||||
|
||||
And what if your document contains a line starting with `##` in an HTML
|
||||
comment or delimited code block?
|
||||
|
@ -60,14 +93,14 @@ comment or delimited code block?
|
|||
-->
|
||||
|
||||
~~~~
|
||||
### A third level header in standard markdown
|
||||
### A third level heading in standard markdown
|
||||
~~~~
|
||||
|
||||
We don't want to touch *these* lines. Moreover, what about setext
|
||||
style second-level headers?
|
||||
We don't want to touch *these* lines. Moreover, what about Setext
|
||||
style second-level heading?
|
||||
|
||||
A header
|
||||
--------
|
||||
A heading
|
||||
---------
|
||||
|
||||
We need to handle those too. Finally, can we be sure that adding
|
||||
asterisks to each side of our string will put it in italics?
|
||||
|
@ -76,25 +109,23 @@ end up with bold text, which is not what we want. And what if it contains
|
|||
a regular unescaped asterisk?
|
||||
|
||||
How would you modify your regular expression to handle these cases? It
|
||||
would be hairy, to say the least. What we need is a real parser.
|
||||
would be hairy, to say the least.
|
||||
|
||||
Well, pandoc has a real markdown parser, the library function
|
||||
`readMarkdown`. This transforms markdown text to an abstract syntax tree
|
||||
(AST) that represents the document structure. Why not manipulate the
|
||||
AST directly in a short Haskell script, then convert the result back to
|
||||
markdown using `writeMarkdown`?
|
||||
A better approach is to let pandoc handle the parsing, and
|
||||
then modify the AST before the document is written. For this,
|
||||
we can use a filter.
|
||||
|
||||
First, let's see what this AST looks like. We can use pandoc's `native`
|
||||
output format:
|
||||
To see what sort of AST is produced when pandoc parses our text,
|
||||
we can use pandoc's `native` output format:
|
||||
|
||||
~~~~
|
||||
% cat test.txt
|
||||
## my header
|
||||
## my heading
|
||||
|
||||
text with *italics*
|
||||
% pandoc -s -t native test.txt
|
||||
Pandoc (Meta {unMeta = fromList []})
|
||||
[Header 3 ("my-header",[],[]) [Str "My",Space,Str "header"]
|
||||
[Header 2 ("my-heading",[],[]) [Str "My",Space,Str "heading"]
|
||||
, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
|
||||
~~~~
|
||||
|
||||
|
@ -106,136 +137,46 @@ the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].
|
|||
|
||||
[haddock documentation for `Text.Pandoc.Definition`]: https://hackage.haskell.org/package/pandoc-types
|
||||
|
||||
Here's a short Haskell script that reads markdown, changes level
|
||||
2+ headers to regular paragraphs, and writes the result as markdown.
|
||||
If you save it as `behead.hs`, you can run it using `runhaskell behead.hs`.
|
||||
It will act like a unix pipe, reading from `stdin` and writing to `stdout`.
|
||||
Or, if you want, you can compile it, using `ghc --make behead`, then run
|
||||
the resulting executable `behead`.
|
||||
|
||||
~~~~ {.haskell}
|
||||
-- behead.hs
|
||||
import Text.Pandoc
|
||||
import Text.Pandoc.Walk (walk)
|
||||
|
||||
behead :: Block -> Block
|
||||
behead (Header n _ xs) | n >= 2 = Para [Emph xs]
|
||||
behead x = x
|
||||
|
||||
readDoc :: String -> Pandoc
|
||||
readDoc s = readMarkdown def s
|
||||
-- or, for pandoc 1.14 and greater, use:
|
||||
-- readDoc s = case readMarkdown def s of
|
||||
-- Right doc -> doc
|
||||
-- Left err -> error (show err)
|
||||
|
||||
writeDoc :: Pandoc -> String
|
||||
writeDoc doc = writeMarkdown def doc
|
||||
|
||||
main :: IO ()
|
||||
main = interact (writeDoc . walk behead . readDoc)
|
||||
~~~~
|
||||
|
||||
The magic here is the `walk` function, which converts
|
||||
our `behead` function (a function from `Block` to `Block`) to
|
||||
a transformation on whole `Pandoc` documents.
|
||||
(See the [haddock documentation for `Text.Pandoc.Walk`].)
|
||||
|
||||
[haddock documentation for `Text.Pandoc.Walk`]: https://hackage.haskell.org/package/pandoc-types
|
||||
|
||||
# Queries: listing URLs
|
||||
|
||||
We can use this same technique to do much more complex transformations
|
||||
and queries. Here's how we could extract all the URLs linked to in
|
||||
a markdown document (again, not an easy task with regular expressions):
|
||||
|
||||
~~~~ {.haskell}
|
||||
-- extracturls.hs
|
||||
import Text.Pandoc
|
||||
|
||||
extractURL :: Inline -> [String]
|
||||
extractURL (Link _ _ (u,_)) = [u]
|
||||
extractURL (Image _ _ (u,_)) = [u]
|
||||
extractURL _ = []
|
||||
|
||||
extractURLs :: Pandoc -> [String]
|
||||
extractURLs = query extractURL
|
||||
|
||||
readDoc :: String -> Pandoc
|
||||
readDoc = readMarkdown def
|
||||
-- or, for pandoc 1.14, use:
|
||||
-- readDoc s = case readMarkdown def s of
|
||||
-- Right doc -> doc
|
||||
-- Left err -> error (show err)
|
||||
|
||||
main :: IO ()
|
||||
main = interact (unlines . extractURLs . readDoc)
|
||||
~~~~
|
||||
|
||||
`query` is the query counterpart of `walk`: it lifts
|
||||
a function that operates on `Inline` elements to one that operates
|
||||
on the whole `Pandoc` AST. The results returned by applying
|
||||
`extractURL` to each `Inline` element are concatenated in the
|
||||
result.
|
||||
|
||||
# JSON filters
|
||||
|
||||
`behead.hs` is a very special-purpose program. It reads a
|
||||
specific input format (markdown) and writes a specific output format
|
||||
(HTML), with a specific set of options (here, the defaults).
|
||||
But the basic operation it performs is one that would be useful
|
||||
in many document transformations. It would be nice to isolate the
|
||||
part of the program that transforms the pandoc AST, leaving the rest
|
||||
to pandoc itself. What we want is a *filter* that *just* operates
|
||||
on the AST---or rather, on a JSON representation of the AST that
|
||||
pandoc can produce and consume:
|
||||
|
||||
source format
|
||||
↓
|
||||
(pandoc)
|
||||
↓
|
||||
JSON-formatted AST
|
||||
↓
|
||||
(filter)
|
||||
↓
|
||||
JSON-formatted AST
|
||||
↓
|
||||
(pandoc)
|
||||
↓
|
||||
target format
|
||||
|
||||
The module `Text.Pandoc.JSON` (from `pandoc-types`) contains a
|
||||
function `toJSONFilter` that makes it easy to write such
|
||||
filters. Here is a filter version of `behead.hs`:
|
||||
We can use Haskell to create a JSON filter that transforms this
|
||||
AST, replacing each `Header` block with level >= 2 with a `Para`
|
||||
with its contents wrapped inside an `Emph` inline:
|
||||
|
||||
~~~~ {.haskell}
|
||||
#!/usr/bin/env runhaskell
|
||||
-- behead2.hs
|
||||
-- behead.hs
|
||||
import Text.Pandoc.JSON
|
||||
|
||||
main :: IO ()
|
||||
main = toJSONFilter behead
|
||||
where behead (Header n _ xs) | n >= 2 = Para [Emph xs]
|
||||
behead x = x
|
||||
|
||||
behead :: Block -> Block
|
||||
behead (Header n _ xs) | n >= 2 = Para [Emph xs]
|
||||
behead x = x
|
||||
~~~~
|
||||
|
||||
It can be used this way:
|
||||
The `toJSONFilter` function does two things. First, it lifts
|
||||
the `behead` function (which maps `Block -> Block`) onto a
|
||||
transformation of the entire `Pandoc` AST, walking the AST
|
||||
and transforming each block. Second, it wraps this `Pandoc ->
|
||||
Pandoc` transformation with the necessary JSON serialization
|
||||
and deserialization, producing an executable that consumes
|
||||
JSON from stdin and produces JSON to stdout.
|
||||
|
||||
pandoc -f SOURCEFORMAT -t json | runhaskell behead2.hs | \
|
||||
pandoc -f json -t TARGETFORMAT
|
||||
To use the filter, make it executable:
|
||||
|
||||
But it is easier to use the `--filter` option with pandoc:
|
||||
chmod +x behead.hs
|
||||
|
||||
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead2.hs
|
||||
and then
|
||||
|
||||
Note that this approach requires that `behead2.hs` be executable,
|
||||
so we must
|
||||
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead.hs
|
||||
|
||||
chmod +x behead2.hs
|
||||
(It is also necessary that `pandoc-types` be installed in the
|
||||
local package repository: `cabal install pandoc-types` should
|
||||
ensure this.)
|
||||
|
||||
Alternatively, we could compile the filter:
|
||||
|
||||
ghc --make behead2.hs
|
||||
ghc --make behead.hs
|
||||
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead
|
||||
|
||||
Note that if the filter is placed in the system PATH, then the initial
|
||||
|
@ -243,6 +184,7 @@ Note that if the filter is placed in the system PATH, then the initial
|
|||
multiple instances of `--filter`: the filters will be applied in
|
||||
sequence.
|
||||
|
||||
|
||||
# LaTeX for WordPress
|
||||
|
||||
Another easy example. WordPress blogs require a special format for
|
||||
|
@ -283,7 +225,7 @@ Here's our "beheading" filter in python:
|
|||
#!/usr/bin/env python
|
||||
|
||||
"""
|
||||
Pandoc filter to convert all level 2+ headers to paragraphs with
|
||||
Pandoc filter to convert all level 2+ headings to paragraphs with
|
||||
emphasized text.
|
||||
"""
|
||||
|
||||
|
|
Loading…
Add table
Reference in a new issue