2017-09-17 08:00:20 +02:00
|
|
|
|
% Pandoc filters
|
|
|
|
|
% John MacFarlane
|
|
|
|
|
|
|
|
|
|
# Summary
|
|
|
|
|
|
|
|
|
|
Pandoc provides an interface for users to write programs (known
|
|
|
|
|
as filters) which act on pandoc’s AST.
|
|
|
|
|
|
|
|
|
|
Pandoc consists of a set of readers and writers. When converting
|
|
|
|
|
a document from one format to another, text is parsed by a
|
|
|
|
|
reader into pandoc’s intermediate representation of the
|
|
|
|
|
document---an "abstract syntax tree" or AST---which is then
|
|
|
|
|
converted by the writer into the target format.
|
|
|
|
|
The pandoc AST format is defined in the module
|
2018-10-16 18:10:34 +02:00
|
|
|
|
[`Text.Pandoc.Definition` in the `pandoc-types` package
|
|
|
|
|
](https://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html).
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
A "filter" is a program that modifies the AST, between the
|
2020-01-14 20:18:24 +01:00
|
|
|
|
reader and the writer.
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
|
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
Pandoc supports two kinds of filters:
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
- **Lua filters** use the Lua language to
|
|
|
|
|
define transformations on the pandoc AST. They are
|
|
|
|
|
described in a [separate document](lua-filters.html).
|
|
|
|
|
|
|
|
|
|
- **JSON filters**, described here, are pipes that read from
|
|
|
|
|
standard input and write to standard output, consuming and
|
|
|
|
|
producing a JSON representation of the pandoc AST:
|
|
|
|
|
|
|
|
|
|
source format
|
|
|
|
|
↓
|
|
|
|
|
(pandoc)
|
|
|
|
|
↓
|
|
|
|
|
JSON-formatted AST
|
|
|
|
|
↓
|
|
|
|
|
(JSON filter)
|
|
|
|
|
↓
|
|
|
|
|
JSON-formatted AST
|
|
|
|
|
↓
|
|
|
|
|
(pandoc)
|
|
|
|
|
↓
|
|
|
|
|
target format
|
|
|
|
|
|
|
|
|
|
Lua filters have a couple of advantages. They use a Lua
|
|
|
|
|
interpreter that is embedded in pandoc, so you don't need
|
|
|
|
|
to have any external software installed. And they are
|
|
|
|
|
usually faster than JSON filters. But if you wish to
|
|
|
|
|
write your filter in a language other than Lua, you may
|
|
|
|
|
prefer to use a JSON filter. JSON filters may be written
|
|
|
|
|
in any programming language.
|
|
|
|
|
|
|
|
|
|
You can use a JSON filter directly in a pipeline:
|
|
|
|
|
|
|
|
|
|
pandoc -s input.txt -t json | \
|
|
|
|
|
pandoc-citeproc | \
|
|
|
|
|
pandoc -s -f json -o output.html
|
|
|
|
|
|
|
|
|
|
But it is more convenient to use the `--filter` option,
|
|
|
|
|
which handles the plumbing automatically:
|
|
|
|
|
|
|
|
|
|
pandoc -s input.txt --filter pandoc-citeproc -o output.html
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
For a gentle introduction into writing your own filters,
|
|
|
|
|
continue this guide. There’s also a [list of third party filters
|
|
|
|
|
on the wiki](https://github.com/jgm/pandoc/wiki/Pandoc-Filters).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# A simple example
|
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
Suppose you wanted to replace all level 2+ headings in a markdown
|
2017-09-17 08:00:20 +02:00
|
|
|
|
document with regular paragraphs, with text in italics. How would you go
|
|
|
|
|
about doing this?
|
|
|
|
|
|
|
|
|
|
A first thought would be to use regular expressions. Something
|
|
|
|
|
like this:
|
|
|
|
|
|
|
|
|
|
perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt
|
|
|
|
|
|
|
|
|
|
This should work most of the time. But don't forget
|
2020-01-14 20:18:24 +01:00
|
|
|
|
that ATX style headings can end with a sequence of `#`s
|
|
|
|
|
that is not part of the heading text:
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
## My heading ##
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
And what if your document contains a line starting with `##` in an HTML
|
|
|
|
|
comment or delimited code block?
|
|
|
|
|
|
|
|
|
|
<!--
|
|
|
|
|
## This is just a comment
|
|
|
|
|
-->
|
|
|
|
|
|
|
|
|
|
~~~~
|
2020-01-14 20:18:24 +01:00
|
|
|
|
### A third level heading in standard markdown
|
2017-09-17 08:00:20 +02:00
|
|
|
|
~~~~
|
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
We don't want to touch *these* lines. Moreover, what about Setext
|
|
|
|
|
style second-level heading?
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
A heading
|
|
|
|
|
---------
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
We need to handle those too. Finally, can we be sure that adding
|
|
|
|
|
asterisks to each side of our string will put it in italics?
|
|
|
|
|
What if the string already contains asterisks around it? Then we'll
|
|
|
|
|
end up with bold text, which is not what we want. And what if it contains
|
|
|
|
|
a regular unescaped asterisk?
|
|
|
|
|
|
|
|
|
|
How would you modify your regular expression to handle these cases? It
|
2020-01-14 20:18:24 +01:00
|
|
|
|
would be hairy, to say the least.
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
A better approach is to let pandoc handle the parsing, and
|
|
|
|
|
then modify the AST before the document is written. For this,
|
|
|
|
|
we can use a filter.
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
To see what sort of AST is produced when pandoc parses our text,
|
|
|
|
|
we can use pandoc's `native` output format:
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
~~~~
|
|
|
|
|
% cat test.txt
|
2020-01-14 20:18:24 +01:00
|
|
|
|
## my heading
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
text with *italics*
|
|
|
|
|
% pandoc -s -t native test.txt
|
|
|
|
|
Pandoc (Meta {unMeta = fromList []})
|
2020-01-14 20:18:24 +01:00
|
|
|
|
[Header 2 ("my-heading",[],[]) [Str "My",Space,Str "heading"]
|
2017-09-17 08:00:20 +02:00
|
|
|
|
, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]
|
|
|
|
|
~~~~
|
|
|
|
|
|
|
|
|
|
A `Pandoc` document consists of a `Meta` block (containing
|
|
|
|
|
metadata like title, authors, and date) and a list of `Block`
|
|
|
|
|
elements. In this case, we have two `Block`s, a `Header` and a `Para`.
|
|
|
|
|
Each has as its content a list of `Inline` elements. For more details on
|
|
|
|
|
the pandoc AST, see the [haddock documentation for `Text.Pandoc.Definition`].
|
|
|
|
|
|
2019-05-02 02:09:36 +02:00
|
|
|
|
[haddock documentation for `Text.Pandoc.Definition`]: https://hackage.haskell.org/package/pandoc-types
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
We can use Haskell to create a JSON filter that transforms this
|
|
|
|
|
AST, replacing each `Header` block with level >= 2 with a `Para`
|
|
|
|
|
with its contents wrapped inside an `Emph` inline:
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
~~~~ {.haskell}
|
|
|
|
|
#!/usr/bin/env runhaskell
|
2020-01-14 20:18:24 +01:00
|
|
|
|
-- behead.hs
|
2017-09-17 08:00:20 +02:00
|
|
|
|
import Text.Pandoc.JSON
|
|
|
|
|
|
|
|
|
|
main :: IO ()
|
|
|
|
|
main = toJSONFilter behead
|
2020-01-14 20:18:24 +01:00
|
|
|
|
|
|
|
|
|
behead :: Block -> Block
|
|
|
|
|
behead (Header n _ xs) | n >= 2 = Para [Emph xs]
|
|
|
|
|
behead x = x
|
2017-09-17 08:00:20 +02:00
|
|
|
|
~~~~
|
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
The `toJSONFilter` function does two things. First, it lifts
|
|
|
|
|
the `behead` function (which maps `Block -> Block`) onto a
|
|
|
|
|
transformation of the entire `Pandoc` AST, walking the AST
|
|
|
|
|
and transforming each block. Second, it wraps this `Pandoc ->
|
|
|
|
|
Pandoc` transformation with the necessary JSON serialization
|
|
|
|
|
and deserialization, producing an executable that consumes
|
|
|
|
|
JSON from stdin and produces JSON to stdout.
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
To use the filter, make it executable:
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
chmod +x behead.hs
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
and then
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead.hs
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
(It is also necessary that `pandoc-types` be installed in the
|
2020-01-15 01:31:09 +01:00
|
|
|
|
local package repository. To do this using cabal-install,
|
|
|
|
|
`cabal v2-update && cabal v2-install --lib pandoc-types`.)
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
Alternatively, we could compile the filter:
|
|
|
|
|
|
2020-01-15 01:31:09 +01:00
|
|
|
|
ghc -package-env=default --make behead.hs
|
2017-09-17 08:00:20 +02:00
|
|
|
|
pandoc -f SOURCEFORMAT -t TARGETFORMAT --filter ./behead
|
|
|
|
|
|
|
|
|
|
Note that if the filter is placed in the system PATH, then the initial
|
|
|
|
|
`./` is not needed. Note also that the command line can include
|
|
|
|
|
multiple instances of `--filter`: the filters will be applied in
|
|
|
|
|
sequence.
|
|
|
|
|
|
2020-01-14 20:18:24 +01:00
|
|
|
|
|
2017-09-17 08:00:20 +02:00
|
|
|
|
# LaTeX for WordPress
|
|
|
|
|
|
|
|
|
|
Another easy example. WordPress blogs require a special format for
|
|
|
|
|
LaTeX math. Instead of `$e=mc^2$`, you need: `$LaTeX e=mc^2$`.
|
|
|
|
|
How can we convert a markdown document accordingly?
|
|
|
|
|
|
|
|
|
|
Again, it's difficult to do the job reliably with regexes.
|
|
|
|
|
A `$` might be a regular currency indicator, or it might occur in
|
|
|
|
|
a comment or code block or inline code span. We just want to find
|
|
|
|
|
the `$`s that begin LaTeX math. If only we had a parser...
|
|
|
|
|
|
|
|
|
|
We do. Pandoc already extracts LaTeX math, so:
|
|
|
|
|
|
|
|
|
|
~~~~ {.haskell}
|
|
|
|
|
#!/usr/bin/env runhaskell
|
|
|
|
|
-- wordpressify.hs
|
|
|
|
|
import Text.Pandoc.JSON
|
|
|
|
|
|
|
|
|
|
main = toJSONFilter wordpressify
|
|
|
|
|
where wordpressify (Math x y) = Math x ("LaTeX " ++ y)
|
|
|
|
|
wordpressify x = x
|
|
|
|
|
~~~~
|
|
|
|
|
|
|
|
|
|
Mission accomplished. (I've omitted type signatures here,
|
|
|
|
|
just to show it can be done.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# But I don't want to learn Haskell!
|
|
|
|
|
|
|
|
|
|
While it's easiest to write pandoc filters in Haskell, it is fairly
|
|
|
|
|
easy to write them in python using the `pandocfilters` package.
|
|
|
|
|
The package is in PyPI and can be installed using `pip install
|
|
|
|
|
pandocfilters` or `easy_install pandocfilters`.
|
|
|
|
|
|
|
|
|
|
Here's our "beheading" filter in python:
|
|
|
|
|
|
|
|
|
|
~~~ {.python}
|
|
|
|
|
#!/usr/bin/env python
|
|
|
|
|
|
|
|
|
|
"""
|
2020-01-14 20:18:24 +01:00
|
|
|
|
Pandoc filter to convert all level 2+ headings to paragraphs with
|
2017-09-17 08:00:20 +02:00
|
|
|
|
emphasized text.
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
from pandocfilters import toJSONFilter, Emph, Para
|
|
|
|
|
|
|
|
|
|
def behead(key, value, format, meta):
|
|
|
|
|
if key == 'Header' and value[0] >= 2:
|
|
|
|
|
return Para([Emph(value[2])])
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
|
toJSONFilter(behead)
|
|
|
|
|
~~~
|
|
|
|
|
|
|
|
|
|
`toJSONFilter(behead)` walks the AST and applies the `behead` action
|
|
|
|
|
to each element. If `behead` returns nothing, the node is unchanged;
|
|
|
|
|
if it returns an object, the node is replaced; if it returns a list,
|
|
|
|
|
the new list is spliced in.
|
|
|
|
|
|
|
|
|
|
Note that, although these parameters are not used in this example,
|
|
|
|
|
`format` provides access to the target format, and `meta` provides access to
|
|
|
|
|
the document's metadata.
|
|
|
|
|
|
|
|
|
|
There are many examples of python filters in [the pandocfilters
|
2019-05-02 02:09:36 +02:00
|
|
|
|
repository](https://github.com/jgm/pandocfilters).
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
For a more Pythonic alternative to pandocfilters, see
|
2019-05-02 02:09:36 +02:00
|
|
|
|
the [panflute](https://pypi.org/project/panflute) library.
|
2020-12-02 21:28:38 +01:00
|
|
|
|
Don't like Python? There are also ports of pandocfilters in
|
|
|
|
|
|
|
|
|
|
- [PHP](https://github.com/vinai/pandocfilters-php),
|
|
|
|
|
- [perl](https://metacpan.org/pod/Pandoc::Filter),
|
|
|
|
|
- TypeScript/JavaScript via Node.js
|
|
|
|
|
- [pandoc-filter](https://github.com/mvhenderson/pandoc-filter-node),
|
|
|
|
|
- [node-pandoc-filter](https://github.com/mu-io/node-pandoc-filter),
|
|
|
|
|
- [Groovy](https://github.com/dfrommi/groovy-pandoc), and
|
|
|
|
|
- [Ruby](https://heerdebeer.org/Software/markdown/paru/).
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
Starting with pandoc 2.0, pandoc includes built-in support for
|
|
|
|
|
writing filters in lua. The lua interpreter is built in to
|
|
|
|
|
pandoc, so a lua filter does not require any additional software
|
|
|
|
|
to run. See the [documentation on lua
|
2019-05-02 02:09:36 +02:00
|
|
|
|
filters](https://pandoc.org/lua-filters.html).
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
# Include files
|
|
|
|
|
|
|
|
|
|
So none of our transforms have involved IO. How about a script that
|
|
|
|
|
reads a markdown document, finds all the inline code blocks with
|
|
|
|
|
attribute `include`, and replaces their contents with the contents of
|
|
|
|
|
the file given?
|
|
|
|
|
|
|
|
|
|
~~~~ {.haskell}
|
|
|
|
|
#!/usr/bin/env runhaskell
|
|
|
|
|
-- includes.hs
|
|
|
|
|
import Text.Pandoc.JSON
|
2020-03-15 17:59:44 +01:00
|
|
|
|
import qualified Data.Text.IO as TIO
|
|
|
|
|
import qualified Data.Text as T
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
doInclude :: Block -> IO Block
|
|
|
|
|
doInclude cb@(CodeBlock (id, classes, namevals) contents) =
|
|
|
|
|
case lookup "include" namevals of
|
2020-03-15 17:59:44 +01:00
|
|
|
|
Just f -> CodeBlock (id, classes, namevals) <$>
|
|
|
|
|
TIO.readFile (T.unpack f)
|
2017-09-17 08:00:20 +02:00
|
|
|
|
Nothing -> return cb
|
|
|
|
|
doInclude x = return x
|
|
|
|
|
|
|
|
|
|
main :: IO ()
|
|
|
|
|
main = toJSONFilter doInclude
|
|
|
|
|
~~~~
|
|
|
|
|
|
|
|
|
|
Try this on the following:
|
|
|
|
|
|
|
|
|
|
Here's the pandoc README:
|
|
|
|
|
|
|
|
|
|
~~~~ {include="README"}
|
|
|
|
|
this will be replaced by contents of README
|
|
|
|
|
~~~~
|
|
|
|
|
|
|
|
|
|
# Removing links
|
|
|
|
|
|
|
|
|
|
What if we want to remove every link from a document, retaining
|
|
|
|
|
the link's text?
|
|
|
|
|
|
|
|
|
|
~~~~ {.haskell}
|
|
|
|
|
#!/usr/bin/env runhaskell
|
|
|
|
|
-- delink.hs
|
|
|
|
|
import Text.Pandoc.JSON
|
|
|
|
|
|
|
|
|
|
main = toJSONFilter delink
|
|
|
|
|
|
|
|
|
|
delink :: Inline -> [Inline]
|
|
|
|
|
delink (Link _ txt _) = txt
|
|
|
|
|
delink x = [x]
|
|
|
|
|
~~~~
|
|
|
|
|
|
|
|
|
|
Note that `delink` can't be a function of type `Inline -> Inline`,
|
|
|
|
|
because the thing we want to replace the link with is not a single
|
|
|
|
|
`Inline` element, but a list of them. So we make `delink` a function
|
|
|
|
|
from an `Inline` element to a list of `Inline` elements.
|
|
|
|
|
`toJSONFilter` can still lift this function to a transformation of type
|
|
|
|
|
`Pandoc -> Pandoc`.
|
|
|
|
|
|
|
|
|
|
# A filter for ruby text
|
|
|
|
|
|
|
|
|
|
Finally, here's a nice real-world example, developed on the
|
2019-05-02 02:09:36 +02:00
|
|
|
|
[pandoc-discuss](https://groups.google.com/group/pandoc-discuss/browse_thread/thread/7baea325565878c8) list. Qubyte wrote:
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
> I'm interested in using pandoc to turn my markdown notes on Japanese
|
|
|
|
|
> into nicely set HTML and (Xe)LaTeX. With HTML5, ruby (typically used to
|
|
|
|
|
> phonetically read chinese characters by placing text above or to the
|
|
|
|
|
> side) is standard, and support from browsers is emerging (Webkit based
|
|
|
|
|
> browsers appear to fully support it). For those browsers that don't
|
|
|
|
|
> support it yet (notably Firefox) the feature falls back in a nice way
|
|
|
|
|
> by placing the phonetic reading inside brackets to the side of each
|
|
|
|
|
> Chinese character, which is suitable for other output formats too. As
|
|
|
|
|
> for (Xe)LaTeX, ruby is not an issue.
|
|
|
|
|
>
|
|
|
|
|
> At the moment, I use inline HTML to achieve the result when the
|
|
|
|
|
> conversion is to HTML, but it's ugly and uses a lot of keystrokes, for
|
|
|
|
|
> example
|
|
|
|
|
>
|
|
|
|
|
> ~~~ {.xml}
|
|
|
|
|
> <ruby>ご<rt></rt>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby>
|
|
|
|
|
> ~~~
|
|
|
|
|
>
|
|
|
|
|
> sets ご飯 "gohan" with "han" spelt phonetically above the second
|
|
|
|
|
> character, or to the right of it in brackets if the browser does not
|
|
|
|
|
> support ruby. I'd like to have something more like
|
|
|
|
|
>
|
|
|
|
|
> r[はん](飯)
|
|
|
|
|
>
|
|
|
|
|
> or any keystroke saving convention would be welcome.
|
|
|
|
|
|
|
|
|
|
We came up with the following script, which uses the convention that a
|
|
|
|
|
markdown link with a URL beginning with a hyphen is interpreted as ruby:
|
|
|
|
|
|
|
|
|
|
[はん](-飯)
|
|
|
|
|
|
|
|
|
|
~~~ {.haskell}
|
2020-03-15 17:59:44 +01:00
|
|
|
|
{-# LANGUAGE OverloadedStrings #-}
|
2017-09-17 08:00:20 +02:00
|
|
|
|
-- handleruby.hs
|
|
|
|
|
import Text.Pandoc.JSON
|
|
|
|
|
import System.Environment (getArgs)
|
2020-03-15 17:59:44 +01:00
|
|
|
|
import qualified Data.Text as T
|
2017-09-17 08:00:20 +02:00
|
|
|
|
|
|
|
|
|
handleRuby :: Maybe Format -> Inline -> Inline
|
2020-03-15 17:59:44 +01:00
|
|
|
|
handleRuby (Just format) x@(Link attr [Str ruby] (src,_)) =
|
|
|
|
|
case T.uncons src of
|
|
|
|
|
Just ('-',kanji)
|
|
|
|
|
| format == Format "html" -> RawInline format $
|
|
|
|
|
"<ruby>" <> kanji <> "<rp>(</rp><rt>" <> ruby <>
|
|
|
|
|
"</rt><rp>)</rp></ruby>"
|
|
|
|
|
| format == Format "latex" -> RawInline format $
|
|
|
|
|
"\\ruby{" <> kanji <> "}{" <> ruby <> "}"
|
|
|
|
|
| otherwise -> Str ruby
|
|
|
|
|
_ -> x
|
2017-09-17 08:00:20 +02:00
|
|
|
|
handleRuby _ x = x
|
|
|
|
|
|
|
|
|
|
main :: IO ()
|
|
|
|
|
main = toJSONFilter handleRuby
|
|
|
|
|
~~~
|
|
|
|
|
|
|
|
|
|
Note that, when a script is called using `--filter`, pandoc passes
|
|
|
|
|
it the target format as the first argument. When a function's
|
|
|
|
|
first argument is of type `Maybe Format`, `toJSONFilter` will
|
|
|
|
|
automatically assign it `Just` the target format or `Nothing`.
|
|
|
|
|
|
|
|
|
|
We compile our script:
|
|
|
|
|
|
|
|
|
|
ghc --make handleRuby
|
|
|
|
|
|
|
|
|
|
Then run it:
|
|
|
|
|
|
|
|
|
|
% pandoc -F ./handleRuby -t html
|
|
|
|
|
[はん](-飯)
|
|
|
|
|
^D
|
|
|
|
|
<p><ruby>飯<rp>(</rp><rt>はん</rt><rp>)</rp></ruby></p>
|
|
|
|
|
% pandoc -F ./handleRuby -t latex
|
|
|
|
|
[はん](-飯)
|
|
|
|
|
^D
|
|
|
|
|
\ruby{飯}{はん}
|
|
|
|
|
|
2020-03-15 17:59:44 +01:00
|
|
|
|
Note: to use this to generate PDFs via LaTeX, you'll need
|
|
|
|
|
to use `--pdf-engine=xelatex`, specify a `mainfont` that has
|
|
|
|
|
the Japanese characters (e.g. "Noto Sans CJK TC"), and add
|
|
|
|
|
`\usepackage{ruby}` to your template or header-includes.
|
|
|
|
|
|
2017-09-17 08:00:20 +02:00
|
|
|
|
# Exercises
|
|
|
|
|
|
|
|
|
|
1. Put all the regular text in a markdown document in ALL CAPS
|
|
|
|
|
(without touching text in URLs or link titles).
|
|
|
|
|
|
|
|
|
|
2. Remove all horizontal rules from a document.
|
|
|
|
|
|
|
|
|
|
3. Renumber all enumerated lists with roman numerals.
|
|
|
|
|
|
|
|
|
|
4. Replace each delimited code block with class `dot` with an
|
|
|
|
|
image generated by running `dot -Tpng` (from graphviz) on the
|
|
|
|
|
contents of the code block.
|
|
|
|
|
|
|
|
|
|
5. Find all code blocks with class `python` and run them
|
|
|
|
|
using the python interpreter, printing the results to the console.
|
|
|
|
|
|
2020-11-07 00:37:24 +01:00
|
|
|
|
# Technical details of JSON filters
|
|
|
|
|
|
|
|
|
|
A JSON filter is any program which can consume and produce a
|
|
|
|
|
valid pandoc JSON document representation. This section describes
|
|
|
|
|
the technical details surrounding the invocation of filters.
|
|
|
|
|
|
|
|
|
|
## Arguments
|
|
|
|
|
|
|
|
|
|
The program will always be called with the target format as the
|
|
|
|
|
only argument. A pandoc invocation like
|
|
|
|
|
|
|
|
|
|
pandoc --filter demo --to=html
|
|
|
|
|
|
|
|
|
|
will cause pandoc to call the program `demo` with argument `html`.
|
|
|
|
|
|
|
|
|
|
## Environment variables
|
|
|
|
|
|
|
|
|
|
Pandoc sets additional environment variables before calling a
|
|
|
|
|
filter.
|
|
|
|
|
|
|
|
|
|
`PANDOC_VERSION`
|
|
|
|
|
: The version of the pandoc binary used to process the document.
|
|
|
|
|
Example: `2.11.1`.
|
|
|
|
|
|
|
|
|
|
`PANDOC_READER_OPTIONS`
|
|
|
|
|
: JSON object representation of the options passed to the input
|
|
|
|
|
parser.
|
|
|
|
|
|
|
|
|
|
Object fields:
|
|
|
|
|
|
|
|
|
|
`readerAbbreviations`
|
|
|
|
|
: set of known abbreviations (array of strings).
|
|
|
|
|
|
|
|
|
|
`readerColumns`
|
|
|
|
|
: number of columns in terminal; an integer.
|
|
|
|
|
|
|
|
|
|
`readerDefaultImageExtension`
|
|
|
|
|
: default extension for images; a string.
|
|
|
|
|
|
|
|
|
|
`readerExtensions`
|
|
|
|
|
: integer representation of the syntax extensions bit
|
|
|
|
|
field.
|
|
|
|
|
|
|
|
|
|
`readerIndentedCodeClasses`
|
|
|
|
|
: default classes for indented code blocks; array of
|
|
|
|
|
strings.
|
|
|
|
|
|
|
|
|
|
`readerStandalone`
|
|
|
|
|
: whether the input was a standalone document with header;
|
|
|
|
|
either `true` or `false`.
|
|
|
|
|
|
|
|
|
|
`readerStripComments`
|
|
|
|
|
: HTML comments are stripped instead of parsed as raw HTML;
|
|
|
|
|
either `true` or `false`.
|
|
|
|
|
|
|
|
|
|
`readerTabStop`
|
|
|
|
|
: width (i.e. equivalent number of spaces) of tab stops;
|
|
|
|
|
integer.
|
|
|
|
|
|
|
|
|
|
`readerTrackChanges`
|
|
|
|
|
: track changes setting for docx; one of
|
|
|
|
|
`"accept-changes"`, `"reject-changes"`, and
|
|
|
|
|
`"all-changes"`.
|
|
|
|
|
|
|
|
|
|
## Supported interpreters
|
|
|
|
|
|
|
|
|
|
Files passed to the `--filter`/`-F` parameter are expected to be
|
|
|
|
|
executable. However, if the executable bit is not set, then
|
|
|
|
|
pandoc tries to guess a suitable interpreter from the file
|
|
|
|
|
extension.
|
|
|
|
|
|
|
|
|
|
file extension interpreter
|
|
|
|
|
---------------- --------------
|
|
|
|
|
.py `python`
|
|
|
|
|
.hs `runhaskell`
|
2020-12-20 12:11:42 +01:00
|
|
|
|
.pl `perl`
|
|
|
|
|
.rb `ruby`
|
2020-11-07 00:37:24 +01:00
|
|
|
|
.php `php`
|
|
|
|
|
.js `node`
|
|
|
|
|
.r `Rscript`
|