Removed html2markdown and hsmarkdown.

html2markdown is no longer needed, since you can pass URI arguments
to pandoc and directly convert web pages. (Note, however, that pandoc
assumes the pages are UTF8. html2markdown made an attempt to guess the
encoding and convert them.)

hsmarkdown is pointless -- a large executable that could be replaced
by 'pandoc --strict'.

git-svn-id: https://pandoc.googlecode.com/svn/trunk@1834 788f1e2b-df1e-0410-8736-df70ead52e1b
This commit is contained in:
fiddlosopher 2010-02-06 18:55:28 +00:00
parent 645d5d48b9
commit 997ea5ea1d
7 changed files with 36 additions and 498 deletions

106
README
View file

@ -127,92 +127,49 @@ will convert `source.txt` from the local encoding to UTF-8, then
convert it to HTML, then convert back to the local encoding,
putting the output in `output.html`.
The wrapper scripts (described below) automatically convert the input
from the local encoding to UTF-8 before running them through `pandoc`,
then convert the output back to the local encoding.
Wrappers
========
Three wrapper scripts, `markdown2pdf`, `html2markdown`, and
`hsmarkdown`, are included in the standard Pandoc installation. (The
Windows binary package does not include `html2markdown`, which is
a POSIX shell script. It does include portable Haskell versions of
`markdown2pdf` and `hsmarkdown`.)
`markdown2pdf`
--------------
1. `markdown2pdf` produces a PDF file from markdown-formatted
text, using `pandoc` and `pdflatex`. The default
behavior of `markdown2pdf` is to create a file with the same
base name as the first argument and the extension `pdf`; thus,
for example,
The standard Pandoc installation includes `markdown2pdf`, a wrapper
around `pandoc` and `pdflatex` that produces PDFs directly from markdown
sources. The default behavior of `markdown2pdf` is to create a file with
the same base name as the first argument and the extension `pdf`; thus,
for example,
markdown2pdf sample.txt endnotes.txt
markdown2pdf sample.txt endnotes.txt
will produce `sample.pdf`. (If `sample.pdf` exists already,
it will be backed up before being overwritten.) An output file
name can be specified explicitly using the `-o` option:
will produce `sample.pdf`. (If `sample.pdf` exists already,
it will be backed up before being overwritten.) An output file
name can be specified explicitly using the `-o` option:
markdown2pdf -o book.pdf chap1 chap2
markdown2pdf -o book.pdf chap1 chap2
If no input file is specified, input will be taken from stdin.
All of `pandoc`'s options will work with `markdown2pdf` as well.
If no input file is specified, input will be taken from stdin.
All of `pandoc`'s options will work with `markdown2pdf` as well.
`markdown2pdf` assumes that `pdflatex` is in the path. It also
assumes that the following LaTeX packages are available:
`unicode`, `fancyhdr` (if you have verbatim text in footnotes),
`graphicx` (if you use images), `array` (if you use tables),
and `ulem` (if you use strikeout text). If they are not already
included in your LaTeX distribution, you can get them from
[CTAN]. A full [TeX Live] or [MacTeX] distribution will have all of
these packages.
`markdown2pdf` assumes that `pdflatex` is in the path. It also
assumes that the following LaTeX packages are available:
`unicode`, `fancyhdr` (if you have verbatim text in footnotes),
`graphicx` (if you use images), `array` (if you use tables),
and `ulem` (if you use strikeout text). If they are not already
included in your LaTeX distribution, you can get them from
[CTAN]. A full [TeX Live] or [MacTeX] distribution will have all of
these packages.
2. `html2markdown` grabs a web page from a file or URL and converts
it to markdown-formatted text, using `tidy` and `pandoc`.
`hsmarkdown`
------------
All of `pandoc`'s options will work with `html2markdown` as well.
In addition, the following special options may be used.
The special options must be separated from the `html2markdown`
command and any regular Pandoc options by the delimiter `--`:
html2markdown -o out.txt -- -e latin1 -g curl google.com
The `-e` or `--encoding` option specifies the character encoding
of the HTML input. If this option is not specified, and input
is not from stdin, `html2markdown` will attempt to determine the
page's character encoding from the "Content-type" meta tag.
If this is not present, UTF-8 is assumed.
The `-g` or `--grabber` option specifies the command to be used to
fetch the contents of a URL:
html2markdown -g 'curl --user foo:bar' www.mysite.com
If this option is not specified, `html2markdown` searches for an
available program (`wget`, `curl`, or a text-mode browser) to fetch
the contents of a URL.
`html2markdown` requires [HTML Tidy], which must be in the path.
It uses [`iconv`] for character encoding conversions; if `iconv`
is absent, it will still work, but it will treat everything as UTF-8.
3. `hsmarkdown` is designed to be used as a drop-in replacement for
`Markdown.pl`. It forces `pandoc` to convert from markdown to
HTML, and to use the `--strict` flag for maximal compliance with
official markdown syntax. (All of Pandoc's syntax extensions and
variants, described below, are disabled.) No other command-line
options are allowed. (In fact, options will be interpreted as
filenames.)
As an alternative to using the `hsmarkdown` script, the
user may create a symbolic link to `pandoc` called `hsmarkdown`.
When invoked under the name `hsmarkdown`, `pandoc` will behave
as if the `--strict` flag had been selected, and no command-line
options will be recognized. However, this approach does not work
under Cygwin, due to problems with its simulation of symbolic
links.
A user who wants a drop-in replacement for `Markdown.pl` may create
a symbolic link to the `pandoc` executable called `hsmarkdown`. When
invoked under the name `hsmarkdown`, `pandoc` will behave as if the
`--strict` flag had been selected, and no command-line options will be
recognized. However, this approach does not work under Cygwin, due to
problems with its simulation of symbolic links.
[Cygwin]: http://www.cygwin.com/
[HTML Tidy]: http://tidy.sourceforge.net/
[`iconv`]: http://www.gnu.org/software/libiconv/
[CTAN]: http://www.ctan.org "Comprehensive TeX Archive Network"
[TeX Live]: http://www.tug.org/texlive/
@ -562,8 +519,7 @@ Pandoc's markdown vs. standard markdown
In parsing markdown, Pandoc departs from and extends [standard markdown]
in a few respects. Except where noted, these differences can
be suppressed by specifying the `--strict` command-line option or by
using the `hsmarkdown` wrapper.
be suppressed by specifying the `--strict` command-line option.
[standard markdown]: http://daringfireball.net/projects/markdown/syntax
"Markdown syntax description"

View file

@ -51,7 +51,7 @@ makeManPages :: Args -> BuildFlags -> PackageDescription -> LocalBuildInfo -> IO
makeManPages _ flags _ _ = mapM_ (makeManPage (fromFlag $ buildVerbosity flags)) manpages
manpages :: [FilePath]
manpages = ["pandoc.1", "hsmarkdown.1", "html2markdown.1", "markdown2pdf.1"]
manpages = ["pandoc.1", "markdown2pdf.1"]
manDir :: FilePath
manDir = "man" </> "man1"
@ -80,7 +80,7 @@ installScripts pkg lbi verbosity copy =
(zip (repeat ".") (wrappers \\ exes))
where exes = map exeName $ filter isBuildable $ executables pkg
isBuildable = buildable . buildInfo
wrappers = ["html2markdown", "hsmarkdown", "markdown2pdf"]
wrappers = ["markdown2pdf"]
installManpages :: PackageDescription -> LocalBuildInfo
-> Verbosity -> CopyDest -> IO ()

View file

@ -1,221 +0,0 @@
#!/bin/sh -e
# converts HTML from a URL, file, or stdin to markdown
# uses an available program to fetch URL and tidy to normalize it first
REQUIRED="tidy"
SYNOPSIS="converts HTML from a URL, file, or STDIN to markdown-formatted text."
THIS=${0##*/}
NEWLINE='
'
err () { echo "$*" | fold -s -w ${COLUMNS:-110} >&2; }
errn () { printf "$*" | fold -s -w ${COLUMNS:-110} >&2; }
usage () {
err "$1 - $2" # short description
err "See the $1(1) man page for usage."
}
# Portable which(1).
pathfind () {
oldifs="$IFS"; IFS=':'
for _p in $PATH; do
if [ -x "$_p/$*" ] && [ -f "$_p/$*" ]; then
IFS="$oldifs"
return 0
fi
done
IFS="$oldifs"
return 1
}
for p in pandoc $REQUIRED; do
pathfind $p || {
err "You need '$p' to use this program!"
exit 1
}
done
CONF=$(pandoc --dump-args "$@" 2>&1) || {
errcode=$?
echo "$CONF" | sed -e '/^pandoc \[OPTIONS\] \[FILES\]/,$d' >&2
[ $errcode -eq 2 ] && usage "$THIS" "$SYNOPSIS"
exit $errcode
}
OUTPUT=$(echo "$CONF" | sed -ne '1p')
ARGS=$(echo "$CONF" | sed -e '1d')
grab_url_with () {
url="${1:?internal error: grab_url_with: url required}"
shift
cmdline="$@"
prog=
prog_opts=
if [ -n "$cmdline" ]; then
eval "set -- $cmdline"
prog=$1
shift
prog_opts="$@"
fi
if [ -z "$prog" ]; then
# Locate a sensible web grabber (note the order).
for p in wget lynx w3m curl links w3c; do
if pathfind $p; then
prog=$p
break
fi
done
[ -n "$prog" ] || {
errn "$THIS: Couldn't find a program to fetch the file from URL "
err "(e.g. wget, w3m, lynx, w3c, or curl)."
return 1
}
else
pathfind "$prog" || {
err "$THIS: No such web grabber '$prog' found; aborting."
return 1
}
fi
# Setup proper base options for known grabbers.
base_opts=
case "$prog" in
wget) base_opts="-O-" ;;
lynx) base_opts="-source" ;;
w3m) base_opts="-dump_source" ;;
curl) base_opts="" ;;
links) base_opts="-source" ;;
w3c) base_opts="-n -get" ;;
*) err "$THIS: unhandled web grabber '$prog'; hope it succeeds."
esac
err "$THIS: invoking '$prog $base_opts $prog_opts $url'..."
eval "set -- $base_opts $prog_opts"
$prog "$@" "$url"
}
# Parse command-line arguments
parse_arguments () {
while [ $# -gt 0 ]; do
case "$1" in
--encoding=*)
wholeopt="$1"
# extract encoding from after =
encoding="${wholeopt#*=}" ;;
-e|--encoding|-encoding)
shift
encoding="$1" ;;
--grabber=*)
wholeopt="$1"
# extract encoding from after =
grabber="\"${wholeopt#*=}\"" ;;
-g|--grabber|-grabber)
shift
grabber="$1" ;;
*)
if [ -z "$argument" ]; then
argument="$1"
else
err "Warning: extra argument '$1' will be ignored."
fi ;;
esac
shift
done
}
argument=
encoding=
grabber=
oldifs="$IFS"
IFS=$NEWLINE
parse_arguments $ARGS
IFS="$oldifs"
inurl=
if [ -n "$argument" ] && ! [ -f "$argument" ]; then
# Treat given argument as an URL.
inurl="$argument"
fi
# As a security measure refuse to proceed if mktemp is not available.
pathfind mktemp || { err "Couldn't find 'mktemp'; aborting."; exit 1; }
# Avoid issues with /tmp directory on Windows/Cygwin
cygwin=
cygwin=$(uname | sed -ne '/^CYGWIN/p')
if [ -n "$cygwin" ]; then
TMPDIR=.
export TMPDIR
fi
THIS_TEMPDIR=
THIS_TEMPDIR="$(mktemp -d -t $THIS.XXXXXXXX)" || exit 1
readonly THIS_TEMPDIR
trap 'exitcode=$?
[ -z "$THIS_TEMPDIR" ] || rm -rf "$THIS_TEMPDIR"
exit $exitcode' 0 1 2 3 13 15
if [ -n "$inurl" ]; then
err "Attempting to fetch file from '$inurl'..."
grabber_out=$THIS_TEMPDIR/grabber.out
grabber_log=$THIS_TEMPDIR/grabber.log
if ! grab_url_with "$inurl" "$grabber" 1>$grabber_out 2>$grabber_log; then
errn "grab_url_with failed"
if [ -f $grabber_log ]; then
err " with the following error log."
err
cat >&2 $grabber_log
else
err .
fi
exit 1
fi
argument="$grabber_out"
fi
if [ -z "$encoding" ] && [ "x$argument" != "x" ]; then
# Try to determine character encoding if not specified
# and input is not STDIN.
encoding=$(
head "$argument" |
LC_ALL=C tr 'A-Z' 'a-z' |
sed -ne '/<meta .*content-type.*charset=/ {
s/.*charset=["'\'']*\([-a-zA-Z0-9]*\).*["'\'']*/\1/p
}'
)
fi
if [ -n "$encoding" ] && pathfind iconv; then
alias to_utf8='iconv -f "$encoding" -t utf-8'
else # assume UTF-8
alias to_utf8='cat'
fi
htmlinput=$THIS_TEMPDIR/htmlinput
if [ -z "$argument" ]; then
to_utf8 > $htmlinput # read from STDIN
elif [ -f "$argument" ]; then
to_utf8 "$argument" > $htmlinput # read from file
else
err "File '$argument' not found."
exit 1
fi
if ! cat $htmlinput | pandoc --ignore-args -r html -w markdown "$@" ; then
err "Failed to parse HTML. Trying again with tidy..."
tidy -q -asxhtml -utf8 $htmlinput | \
pandoc --ignore-args -r html -w markdown "$@"
fi

View file

@ -1,42 +0,0 @@
% HSMARKDOWN(1) Pandoc User Manuals
% John MacFarlane
% January 8, 2008
# NAME
hsmarkdown - convert markdown-formatted text to HTML
# SYNOPSIS
hsmarkdown [*input-file*]...
# DESCRIPTION
`hsmarkdown` converts markdown-formatted text to HTML. It is designed
to be usable as a drop-in replacement for John Gruber's `Markdown.pl`.
If no *input-file* is specified, input is read from *stdin*.
Otherwise, the *input-files* are concatenated (with a blank
line between each) and used as input. Output goes to *stdout* by
default. For output to a file, use shell redirection:
hsmarkdown input.txt > output.html
`hsmarkdown` uses the UTF-8 character encoding for both input and output.
If your local character encoding is not UTF-8, you should pipe input
and output through `iconv`:
iconv -t utf-8 input.txt | hsmarkdown | iconv -f utf-8
`hsmarkdown` is implemented as a wrapper around `pandoc`(1). It
calls `pandoc` with the options `--from markdown --to html
--strict` and disables all other options. (Command-line options
will be interpreted as filenames, as they are by `Markdown.pl`.)
# SEE ALSO
`pandoc`(1). The *README*
file distributed with Pandoc contains full documentation.
The Pandoc source code and all documentation may be downloaded from
<http://johnmacfarlane.net/pandoc/>.

View file

@ -1,95 +0,0 @@
% HTML2MARKDOWN(1) Pandoc User Manuals
% John MacFarlane and Recai Oktas
% January 8, 2008
# NAME
html2markdown - converts HTML to markdown-formatted text
# SYNOPSIS
html2markdown [*pandoc-options*] [\-- *special-options*] [*input-file* or
*URL*]
# DESCRIPTION
`html2markdown` converts *input-file* or *URL* (or text
from *stdin*) from HTML to markdown-formatted plain text.
If a URL is specified, `html2markdown` uses an available program
(e.g. wget, w3m, lynx or curl) to fetch its contents. Output is sent
to *stdout* unless an output file is specified using the `-o`
option.
`html2markdown` uses the character encoding specified in the
"Content-type" meta tag. If this is not present, or if input comes
from *stdin*, UTF-8 is assumed. A character encoding may be specified
explicitly using the `-e` special option.
# OPTIONS
`html2markdown` is a wrapper for `pandoc`, so all of
`pandoc`'s options may be used. See `pandoc`(1) for
a complete list. The following options are most relevant:
-s, \--standalone
: Include title, author, and date information (if present) at the
top of markdown output.
-o *FILE*, \--output=*FILE*
: Write output to *FILE* instead of *stdout*.
\--strict
: Use strict markdown syntax, with no extensions or variants.
\--reference-links
: Use reference-style links, rather than inline links, in writing markdown
or reStructuredText.
-R, \--parse-raw
: Parse untranslatable HTML codes as raw HTML.
\--no-wrap
: Disable text wrapping in output. (Default is to wrap text.)
-H *FILE*, \--include-in-header=*FILE*
: Include contents of *FILE* at the end of the header. Implies
`-s`.
-B *FILE*, \--include-before-body=*FILE*
: Include contents of *FILE* at the beginning of the document body.
-A *FILE*, \--include-after-body=*FILE*
: Include contents of *FILE* at the end of the document body.
-C *FILE*, \--custom-header=*FILE*
: Use contents of *FILE*
as the document header (overriding the default header, which can be
printed using `pandoc -D markdown`). Implies `-s`.
# SPECIAL OPTIONS
In addition, the following special options may be used. The special
options must be separated from the `html2markdown` command and any
regular `pandoc` options by the delimiter \``--`', as in
html2markdown -o foo.txt -- -g 'curl -u bar:baz' -e latin1 \
www.foo.com
-e *encoding*, \--encoding=*encoding*
: Assume the character encoding *encoding* in reading HTML.
(Note: *encoding* will be passed to `iconv`; a list of
available encodings may be obtained using `iconv -l`.)
If this option is not specified and input is not from
*stdin*, `html2markdown` will try to extract the character encoding
from the "Content-type" meta tag. If no character encoding is
specified in this way, or if input is from *stdin*, UTF-8 will be
assumed.
-g *command*, \--grabber=*command*
: Use *command* to fetch the contents of a URL. (By default,
`html2markdown` searches for an available program or text-based
browser to fetch the contents of a URL.)
# SEE ALSO
`pandoc`(1), `iconv`(1)

View file

@ -59,11 +59,10 @@ Data-Files:
-- documentation
README, INSTALL, COPYRIGHT, BUGS, changelog,
-- wrappers
markdown2pdf, html2markdown, hsmarkdown
markdown2pdf
Extra-Source-Files:
-- sources for man pages
man/man1/pandoc.1.md, man/man1/markdown2pdf.1.md,
man/man1/html2markdown.1.md, man/man1/hsmarkdown.1.md,
-- tests
tests/bodybg.gif,
tests/writer.latex,
@ -120,8 +119,7 @@ Extra-Source-Files:
tests/lhs-test.html+lhs,
tests/lhs-test.fragment.html+lhs,
tests/RunTests.hs
Extra-Tmp-Files: man/man1/pandoc.1, man/man1/hsmarkdown.1,
man/man1/html2markdown.1, man/man1/markdown2pdf.1
Extra-Tmp-Files: man/man1/pandoc.1, man/man1/markdown2pdf.1
Flag highlighting
Description: Compile in support for syntax highlighting of code blocks.
@ -130,7 +128,7 @@ Flag executable
Description: Build the pandoc executable.
Default: True
Flag wrappers
Description: Build the wrappers (hsmarkdown, markdown2pdf).
Description: Build the wrappers (markdown2pdf).
Default: True
Flag library
Description: Build the pandoc library.
@ -219,17 +217,6 @@ Executable pandoc
else
Buildable: False
Executable hsmarkdown
Hs-Source-Dirs: src
Main-Is: hsmarkdown.hs
Ghc-Options: -Wall -threaded
Ghc-Prof-Options: -auto-all
Extensions: CPP
if flag(wrappers)
Buildable: True
else
Buildable: False
Executable markdown2pdf
Hs-Source-Dirs: src
Main-Is: markdown2pdf.hs

View file

@ -1,47 +0,0 @@
{-
Copyright (C) 2006-8 John MacFarlane <jgm@berkeley.edu>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-}
{- |
Copyright : Copyright (C) 2009 John MacFarlane
License : GNU GPL, version 2 or above
Maintainer : John MacFarlane <jgm@berkeley@edu>
Stability : alpha
Portability : portable
Wrapper around pandoc that emulates Markdown.pl as closely as possible.
-}
module Main where
import System.Process
import System.Environment ( getArgs )
-- Note: ghc >= 6.12 (base >=4.2) supports unicode through iconv
-- So we use System.IO.UTF8 only if we have an earlier version
#if MIN_VERSION_base(4,2,0)
#else
import Prelude hiding ( putStr, putStrLn, writeFile, readFile, getContents )
import System.IO.UTF8
#endif
import Control.Monad (forM_)
main :: IO ()
main = do
files <- getArgs
let runPandoc inp = readProcess "pandoc" ["--from", "markdown", "--to", "html", "--strict"] inp >>= putStrLn
if null files
then getContents >>= runPandoc
else forM_ files $ \f -> readFile f >>= runPandoc