f0a25e4be1
git-svn-id: https://pandoc.googlecode.com/svn/trunk@24 788f1e2b-df1e-0410-8736-df70ead52e1b
518 lines
17 KiB
Text
518 lines
17 KiB
Text
% pandoc
|
|
% John MacFarlane
|
|
% August 10, 2006
|
|
|
|
`pandoc` is a [Haskell] library for converting from one markup format
|
|
to another, and a command-line tool that uses this library. It can read
|
|
[markdown] and (subsets of) [reStructuredText], [HTML], and [LaTeX],
|
|
and it can write [markdown], [reStructuredText], [HTML], [LaTeX], [RTF],
|
|
and [S5] HTML slide shows. `pandoc`'s version of markdown contains some
|
|
enhancements, like footnotes and embedded LaTeX.
|
|
|
|
In contrast to existing tools for converting markdown to HTML, which
|
|
use regex substitutions, `pandoc` has a modular design: it consists of a
|
|
set of readers, which parse text in a given format and produce a native
|
|
representation of the document, and a set of writers, which convert
|
|
this native representation into a target format. Thus, adding an input
|
|
or output format requires only adding a reader or writer.
|
|
|
|
[markdown]: http://daringfireball.net/projects/markdown/
|
|
[reStructuredText]: http://docutils.sourceforge.net/docs/ref/rst/introduction.html
|
|
[S5]: http://meyerweb.com/eric/tools/s5/
|
|
[HTML]: http://www.w3.org/TR/html40/
|
|
[LaTeX]: http://www.latex-project.org/
|
|
[RTF]: http://en.wikipedia.org/wiki/Rich_Text_Format
|
|
[Haskell]: http://www.haskell.org/
|
|
|
|
(c) 2006 John MacFarlane (jgm At berkeley.edu). Released under the
|
|
[GPL], version 2 or greater. This software carries no warranty of
|
|
any kind. (See LICENSE for full copyright and warranty notices.)
|
|
Recai Oktaş (roktas At debian.org) deserves credit for the build
|
|
system, the debian package, and the robust wrapper scripts.
|
|
|
|
[GPL]: http://www.gnu.org/copyleft/gpl.html
|
|
|
|
# Installation
|
|
|
|
## Installing GHC
|
|
|
|
To compile `pandoc`, you'll need [GHC] version 6.4 or greater.
|
|
|
|
If you don't have GHC already, you can get it from the
|
|
[GHC Download] page.
|
|
|
|
[GHC]: http://www.haskell.org/ghc/
|
|
[GHC Download]: http://www.haskell.org/ghc/download.html
|
|
|
|
You'll also need standard build tools: GNU `make`, `sed`, `bash`, and `perl`.
|
|
These are standard on unix systems (including MacOS X). If you're
|
|
using Windows, you can install [Cygwin].
|
|
|
|
[Cygwin]: http://www.cygwin.com/
|
|
|
|
## Installing `pandoc`
|
|
|
|
1. Change to the directory containing the `pandoc` distribution.
|
|
|
|
2. Compile:
|
|
|
|
make
|
|
|
|
3. See if it worked (optional, but recommended):
|
|
|
|
make test
|
|
|
|
4. Install:
|
|
|
|
make install
|
|
|
|
Note: This installs `pandoc`, together with its wrappers and
|
|
documentation, into the `/usr/local` directory, which requires root
|
|
privileges. If you don't have root privileges or would prefer to
|
|
install `pandoc` and the associated shell scripts into your `~/bin`
|
|
directory, type this instead:
|
|
|
|
PREFIX=~ make install-exec
|
|
|
|
5. Install Haskell libraries (optional):
|
|
|
|
make install-lib
|
|
|
|
6. Install library documentation into `/usr/local/pandoc-doc` (optional):
|
|
|
|
make install-lib-doc
|
|
|
|
## Removing `pandoc`
|
|
|
|
Each of the installation steps described above can be reversed:
|
|
|
|
make uninstall
|
|
|
|
PREFIX=~ make uninstall-exec
|
|
|
|
make uninstall-lib
|
|
|
|
make uninstall-lib-doc
|
|
|
|
# Using `pandoc`
|
|
|
|
If you run `pandoc` without arguments, it will accept input from
|
|
STDIN. If you run it with file names as arguments, it will take input
|
|
from those files. It accepts several command-line options. For a
|
|
list, type
|
|
|
|
pandoc -h
|
|
|
|
The most important options specify the format of the source file and
|
|
the output. The default reader is markdown; the default writer is
|
|
HTML. So if you don't specify a reader or writer, `pandoc` will
|
|
convert markdown to HTML. For example,
|
|
|
|
pandoc hello.txt
|
|
|
|
will convert `hello.txt` from markdown to HTML. For other conversions,
|
|
you must specify a reader and/or a writer using the `-r` and `-w`
|
|
flags. To convert markdown to LaTeX, you would write:
|
|
|
|
pandoc -w latex hello.txt
|
|
|
|
To convert html to markdown:
|
|
|
|
pandoc -r html -w markdown hello.txt
|
|
|
|
Supported writers include `markdown`, `latex`, `html`, `rtf` (rich text
|
|
format), `rst` (reStructuredText), and `s5` (which produces an HTML
|
|
file that acts like powerpoint). Supported readers include `markdown`,
|
|
`html`, `latex`, and `rst`. Note that the `rst` reader only parses
|
|
a subset of reStructuredText syntax. For example, it doesn't handle
|
|
tables, definition lists, option lists, or footnotes. It handles only the
|
|
constructs expressible in unextended markdown. But for simple documents
|
|
it should be adequate. The `latex` and `html` readers are also limited
|
|
in what they can do.
|
|
|
|
`pandoc` writes its output to STDOUT. If you want to write to a file,
|
|
use redirection:
|
|
|
|
pandoc hello.txt > hello.html
|
|
|
|
Note that you can specify multiple input files on the command line.
|
|
`pandoc` will concatenate them all (with blank lines between them)
|
|
before parsing:
|
|
|
|
pandoc -s chapter1.txt chapter2.txt chapter3.txt references.txt > book.html
|
|
|
|
(The `-s` option here tells `pandoc` to produce a standalone HTML file,
|
|
with a proper header, rather than a fragment. For more details on this
|
|
and many other command-line options, see below.)
|
|
|
|
## Character encoding
|
|
|
|
Unfortunately, due to limitations in GHC, `pandoc` does not automatically
|
|
detect the system's local character encoding. Hence, all input and
|
|
output is assumed to be in the UTF-8 encoding. If you use accented or
|
|
foreign characters, you should convert the input file to UTF-8 before
|
|
processing it with `pandoc`. This can be done by piping the input through
|
|
[`iconv`]: for example,
|
|
|
|
iconv -t utf-8 source.txt | pandoc > output.html
|
|
|
|
will convert `source.txt` from the local encoding to UTF-8, then
|
|
convert it to HTML, putting the output in `output.html`.
|
|
|
|
[`iconv`]: http://www.gnu.org/software/libiconv/
|
|
|
|
The shell scripts (described below) automatically convert the source
|
|
from the local encoding to UTF-8 before running them through `pandoc`.
|
|
|
|
## The shell scripts
|
|
|
|
For convenience, five shell scripts have been included that make it
|
|
easy to run `pandoc` without remembering all the command-line options.
|
|
All of the scripts presuppose that `pandoc` is in the path, and
|
|
some have additional requirements. (For example, `html2markdown`
|
|
uses `tidy`, and `markdown2pdf` uses `pdflatex`.)
|
|
|
|
1. `markdown2html` converts markdown to HTML, running `iconv` first to
|
|
convert the file to UTF-8. (This can be used as a replacement for
|
|
`Markdown.pl`.)
|
|
|
|
2. `html2markdown` can take either a filename or a URL as argument. If
|
|
it is given a URL, it uses `curl`, `wget`, or an available text-based
|
|
browser to fetch the contents of the specified URL, then filters this
|
|
through `tidy` to straighten up the HTML and convert to UTF-8,
|
|
and finally passes this HTML to `pandoc` to produce markdown text:
|
|
|
|
html2markdown http://www.fsf.org
|
|
|
|
html2markdown www.fsf.org
|
|
|
|
html2markdown subdir/mylocalfile.html
|
|
|
|
3. `latex2markdown` converts a LaTeX file to markdown.
|
|
|
|
latex2markdown mytexfile.tex
|
|
|
|
4. `markdown2latex` converts markdown to LaTeX:
|
|
|
|
markdown2latex mytextfile.txt
|
|
|
|
5. `markdown2pdf` converts markdown to PDF using `pdflatex`. Example:
|
|
|
|
markdown2pdf mytextfile.txt
|
|
|
|
creates a file `mytextfile.pdf`.
|
|
|
|
# Command-line options
|
|
|
|
Various command-line options can be used to customize the output.
|
|
For a complete list, type
|
|
|
|
pandoc --help
|
|
|
|
`-p` or `--preserve-tabs` causes tabs in the source text to be
|
|
preserved, rather than converted to spaces (the default).
|
|
|
|
`--tabstop` allows the user to set the tab stop (which defaults to 4).
|
|
|
|
`-R` or `--parse-raw` causes the HTML and LaTeX readers to parse HTML
|
|
codes and LaTeX environments that it can't translate as raw HTML or
|
|
LaTeX. Raw HTML can be printed in markdown, reStructuredText, HTML,
|
|
and S5 output; raw LaTeX can be printed in markdown, reStructuredText,
|
|
and LaTeX output. The default is for the readers to omit
|
|
untranslatable HTML codes and LaTeX environments. (The LaTeX reader
|
|
does pass through untranslatable LaTeX commands, even if `-R` is not
|
|
specified.)
|
|
|
|
`-s` or `--standalone` causes `pandoc` to produce a standalone file,
|
|
complete with appropriate document headers. By default, `pandoc`
|
|
produces a fragment.
|
|
|
|
`--custom-header` can be used to specify a custom document header. To
|
|
see the headers used by default, use the `-D` option: for example,
|
|
`pandoc -D html` prints the default HTML header.
|
|
|
|
`-c` or `--css` allows the user to specify a custom stylesheet that
|
|
will be linked to in HTML and S5 output.
|
|
|
|
`-H` or `--include-in-header` specifies a file to be included
|
|
(verbatim) at the end of the document header. This can be used, for
|
|
example, to include special CSS or javascript in HTML documents.
|
|
|
|
`-B` or `--include-before-body` specifies a file to be included
|
|
(verbatim) at the beginning of the document body (after the `<body>`
|
|
tag in HTML, or the `\begin{document}` command in LaTeX). This can be
|
|
used to include navigation bars or banners in HTML documents.
|
|
|
|
`-A` or `--include-after-body` specifies a file to be included
|
|
(verbatim) at the end of the docment body (before the `</body>` tag in
|
|
HTML, or the `\end{document}` command in LaTeX).
|
|
|
|
`-T` or `--title-prefix` specifies a string to be included as a prefix
|
|
at the beginning of the title that appears in the HTML header (but not
|
|
in the title as it appears at the beginning of the HTML body). (See
|
|
below on Titles.)
|
|
|
|
`-S` or `--smartypants` causes `pandoc` to produce typographically
|
|
correct HTML output, along the lines of John Gruber's [Smartypants].
|
|
Straight quotes are converted to curly quotes, `---` to dashes, and
|
|
`...` to ellipses.
|
|
|
|
[Smartypants]: http://daringfireball.net/projects/smartypants/
|
|
|
|
`-m` or `--asciimathml` will cause LaTeX formulas (between $ signs) in
|
|
HTML or S5 to display as formulas rather than as code. The trick will
|
|
not work in all browsers, but it works in Firefox. Peter Jipsen's
|
|
[ASCIIMathML] script is used to do the magic.
|
|
|
|
[ASCIIMathML]: http://www1.chapman.edu/~jipsen/mathml/asciimath.html
|
|
|
|
`-i` or `--incremental` causes all lists in S5 output to be displayed
|
|
incrementally by default (one item at a time). The normal default
|
|
is for lists to be displayed all at once.
|
|
|
|
`-N` or `--number-sections` causes sections to be numbered in LaTeX
|
|
output. By default, sections are not numbered.
|
|
|
|
# `pandoc`'s markdown vs. standard markdown
|
|
|
|
In parsing markdown, `pandoc` departs from and extends [standard markdown]
|
|
in a few respects. (To run `pandoc` on the official
|
|
markdown test suite, type `make test-markdown`.)
|
|
|
|
[standard markdown]: http://daringfireball.net/projects/markdown/syntax
|
|
|
|
## Lists
|
|
|
|
`pandoc` behaves differently from standard markdown on some "edge
|
|
cases" involving lists. Consider this source:
|
|
|
|
1. First
|
|
2. Second:
|
|
- Fee
|
|
- Fie
|
|
- Foe
|
|
|
|
3. Third
|
|
|
|
`pandoc` transforms this into a "compact list" (with no `<p>` tags
|
|
around "First", "Second", or "Third"), while markdown puts `<p>`
|
|
tags around "Second" and "Third" (but not "First"), because of
|
|
the blank space around "Third". `pandoc` follows a simple rule:
|
|
if the text is followed by a blank line, it is treated as a
|
|
paragraph. Since "Second" is followed by a list, and not a blank
|
|
line, it isn't treated as a paragraph. The fact that the list
|
|
is followed by a blank line is irrelevant.
|
|
|
|
## Literal quotes in titles
|
|
|
|
Standard markdown allows unescaped literal quotes in titles, as
|
|
in
|
|
|
|
[foo]: "bar "embedded" baz"
|
|
|
|
`pandoc` requires all quotes within titles to be escaped:
|
|
|
|
[foo]: "bar \"embedded\" baz"
|
|
|
|
## Reference links
|
|
|
|
`pandoc` allows implicit reference links in either of two styles:
|
|
|
|
1. Here's my [link]
|
|
2. Here's my [link][]
|
|
|
|
[link]: linky.com
|
|
|
|
If there's no corresponding reference, the implicit reference link
|
|
will appear as regular bracketed text. Note: even `[link][]` will
|
|
appear as `[link]` if there's no reference for `link`. If you want
|
|
`[link][]`, use a backslash escape: `\[link]\[]`.
|
|
|
|
## Footnotes
|
|
|
|
`pandoc`'s markdown allows footnotes, using the following syntax:
|
|
|
|
here is a footnote reference,^(1) and another.^(longnote)
|
|
|
|
^(1) Here is the footnote. It can go anywhere in the document,
|
|
except in embedded contexts like block quotes or lists.
|
|
|
|
^(longnote) Here's the other note. This one contains multiple
|
|
blocks.
|
|
^
|
|
^ Caret characters are used to indicate that the blocks all belong
|
|
to a single footnote (as with block quotes).
|
|
^
|
|
^ If you want, you can use a caret at the beginning of every line,
|
|
^ as with blockquotes, but all that you need is a caret at the
|
|
^ beginning of the first line of the block and any preceding
|
|
^ blank lines.
|
|
|
|
Footnote references may not contain spaces, tabs, or newlines.
|
|
|
|
## Embedded HTML
|
|
|
|
`pandoc` treats embedded HTML in markdown a bit differently than
|
|
Markdown 1.0. While Markdown 1.0 leaves HTML blocks exactly as they
|
|
are, `pandoc` treats text between HTML tags as markdown. Thus, for
|
|
example, `pandoc` will turn
|
|
|
|
<table>
|
|
<tr>
|
|
<td>*one*</td>
|
|
<td>[a link](http://google.com)</td>
|
|
</tr>
|
|
</table>
|
|
|
|
into
|
|
|
|
<table>
|
|
<tr>
|
|
<td><em>one</em></td>
|
|
<td><a href="http://google.com">a link</a></td>
|
|
</tr>
|
|
</table>
|
|
|
|
whereas Markdown 1.0 will preserve it as is.
|
|
|
|
There is one exception to this rule: text between `<script>` and
|
|
`</script>` tags is not interpreted as markdown.
|
|
|
|
This departure from standard markdown should make it easier to mix
|
|
markdown with HTML block elements. For example, one can surround
|
|
a block of markdown text with `<div>` tags without preventing it
|
|
from being interpreted as markdown.
|
|
|
|
## Title blocks
|
|
|
|
If the file begins with a title block
|
|
|
|
% title
|
|
% author(s) (separated by commas)
|
|
% date
|
|
|
|
it will be parsed as bibliographic information, not regular text. (It
|
|
will be used, for example, in the title of standalone LaTeX or HTML
|
|
output.) The block may contain just a title, a title and an author,
|
|
or all three lines. Each must begin with a % and fit on one line.
|
|
The title may contain standard inline formatting. If you want to
|
|
include an author but no title, or a title and a date but no author,
|
|
you need a blank line:
|
|
|
|
% My title
|
|
%
|
|
% June 15, 2006
|
|
|
|
Titles will be written only when the `--standalone` (`-s`) option is
|
|
chosen. In HTML output, titles will appear twice: once in the
|
|
document head -- this is the title that will appear at the top of the
|
|
window in a browser -- and once at the beginning of the document body.
|
|
The title in the document head can have an optional prefix attached
|
|
(`--title-prefix` or `-T` option). The title in the body appears as
|
|
an H1 element with class "title", so it can be suppressed or
|
|
reformatted with CSS.
|
|
|
|
If a title prefix is specified with `-T` and no title block appears
|
|
in the document, the title prefix will be used by itself as the
|
|
HTML title.
|
|
|
|
## Box-style blockquotes
|
|
|
|
`pandoc` supports emacs-style boxquote block quotes, in addition to
|
|
standard markdown (email-style) boxquotes:
|
|
|
|
,----
|
|
| They look like this.
|
|
`----
|
|
|
|
## Inline LaTeX
|
|
|
|
Anything between two $ characters will be parsed as LaTeX math. The
|
|
opening $ must have a character immediately to its right, while the
|
|
closing $ must have a character immediately to its left. Thus,
|
|
`$20,000 and $30,000` won't parse as math. The $ character can be
|
|
escaped with a backslash if needed.
|
|
|
|
If you pass the `-m` (`--asciimathml`) option to `pandoc`, it will
|
|
include the [ASCIIMathML] script in the resulting HTML. This will
|
|
cause LaTeX math to be displayed as formulas in better browsers.
|
|
|
|
[ASCIIMathML]: http://www1.chapman.edu/~jipsen/asciimath.html
|
|
|
|
Inline LaTeX commands will also be preserved and passed unchanged
|
|
to the LaTeX writer. Thus, for example, you can use LaTeX to
|
|
include BibTeX citations:
|
|
|
|
This result was proved in \cite{jones.1967}.
|
|
|
|
You can also use LaTeX environments. For example,
|
|
|
|
\begin{tabular}{|l|l|}\hline
|
|
Age & Frequency \\ \hline
|
|
18--25 & 15 \\
|
|
26--35 & 33 \\
|
|
36--45 & 22 \\ \hline
|
|
\end{tabular}
|
|
|
|
Note, however, that material between the begin and end tags will
|
|
be interpreted as raw LaTeX, not as markdown.
|
|
|
|
## Custom headers
|
|
|
|
When run with the "standalone" option (`-s`), `pandoc` creates a
|
|
standalone file, complete with an appropriate header. To see the
|
|
default headers used for html and latex, use the following commands:
|
|
|
|
pandoc -D html
|
|
|
|
pandoc -D latex
|
|
|
|
If you want to use a different header, just create a file containing
|
|
it and specify it on the command line as follows:
|
|
|
|
pandoc --header=MyHeaderFile
|
|
|
|
# Producing S5 with `pandoc`
|
|
|
|
Producing an [S5] slide show with `pandoc` is easy. A title page is
|
|
constructed automatically from the document's title block (see above).
|
|
Each section (with a level-one header) produces a single slide. (Note
|
|
that if the section is too big, the slide will not fit on the page; S5
|
|
is not smart enough to produce multiple pages.)
|
|
|
|
Here's the markdown source for a simple slide show, `eating.txt`:
|
|
|
|
% Eating Habits
|
|
% John Doe
|
|
% March 22, 2005
|
|
|
|
# In the morning
|
|
|
|
- Eat eggs
|
|
- Drink coffee
|
|
|
|
# In the evening
|
|
|
|
- Eat spaghetti
|
|
- Drink wine
|
|
|
|
To produce the slide show, simply type
|
|
|
|
pandoc -w s5 -s eating.txt > eating.html
|
|
|
|
and open up `eating.html` in a browser. The HTML file embeds
|
|
all the required javascript and CSS, so no other files are necessary.
|
|
|
|
Note that by default, the S5 writer produces lists that display
|
|
"all at once." If you want your lists to display incrementally
|
|
(one item at a time), use the `-i` option. If you want a
|
|
particular list to depart from the default (that is, to display
|
|
incrementally without the `-i` option and all at once with the
|
|
`-i` option), put it in a block quote:
|
|
|
|
> - Eat spaghetti
|
|
> - Drink wine
|
|
|
|
In this way incremental and nonincremental lists can be mixed in
|
|
a single document.
|
|
|