They should by default scope over the group in which they
are defined (except `\gdef` and `\xdef`, which are global).
In addition, environments must be treated as groups.
We handle this by making sMacros in the LaTeX parser state
a STACK of macro tables. Opening a group adds a table to
the stack, closing one removes one. Only the top of the stack
is queried.
This commit adds a parameter for scope to the Macro constructor
(not exported).
Closes#7494.
- Fixed semantics for `\let`.
- Implement `\edef`, `\gdef`, and `\xdef`.
- Add comment noting that currently `\def` and `\edef` set global
macros (so are equivalent to `\gdef` and `\xdef`). This should be
fixed by scoping macro definitions to groups, in a future commit.
Closes#7474.
When the slide level is set to 0, headings won't be used at all
in splitting the document into slides. Horizontal rules must be
used to separate slides.
Closes#7476.
We only depend on the urlEncode function in the package, which is also
provided by http-types. The HTTP package also depends on the network
package, which has difficulty building on ghcjs.
Add internal module Text.Pandoc.Network.HTTP, exporting `urlEncode`.
In some cases, the rounding performed by the LaTeX table
writer would introduce visible overrun outside the text
area.
This adds two more decimal places to the width values.
Previously we just set the source name to "chunk" when parsing
from strings, to avoid misleading source positions.
This had the side effect that `rebase_relative_paths` would break
inside sections that were parsed as strings.
So, now we use "ORIGINAL_SOURCE_PATH_chunk" instead of just "chunk".
Closes#7464.
ulem is conditionally included already when the `strikeout`
variable is set, so we set this when there is underlined text,
and use `\uline` instead of `\underline`.
This fixes wrapping for underlined text.
Closes#7351.
Originally intended for referring to UNIX manual pages, either part of the same DocBook document as refentry element, or external – hence the manvolnum element.
These days, refentry is more general, for example the element documentation pages linked below are each a refentry.
As per the *Processing expectations* section of citerefentry, the element is supposed to be a hyperlink to a refentry (when in the same document) but pandoc does not support refentry tag at the moment so that is moot.
https://tdg.docbook.org/tdg/5.1/citerefentry.htmlhttps://tdg.docbook.org/tdg/5.1/manvolnum.htmlhttps://tdg.docbook.org/tdg/5.1/refentry.html
This roughly corresponds to a `manpage` role in rST syntax, which produces a `Code` AST node with attributes `.interpreted-text role=manpage` but that does not fit DocBook parser.
https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#role-manpage
We now use source positions from the token stream to tell us
how much of the text stream to consume. Getting this to
work required a few other changes to make token source positions
accurate.
Closes#7434.
Just like it is possible to avoid incorporating an image in EPUB by
passing `data-external="1"` to a raw HTML snippet, this makes the same
possible for native Images, by looking for an associated `external`
attribute.
Preserve all attributes in img tags. If attributes have a `data-`
prefix, it will be stripped. In particular, this preserves a
`data-external` attribute as an `external` attribute in the pandoc AST.
Use latest citeproc, which uses a Span with a class rather
than a Note for notes. This helps us distinguish between
user notes and citation notes.
Don't put citations at the beginning of a note in parentheses.
(Closes #7394.)
Previously, using `--citeproc` could cause punctuation to move in
quotes even when there aer no citations. This has been changed;
now, punctuation moving is limited to citations.
In addition, we only move footnotes around punctuation if the
style is a note style, even if `notes-after-punctuation` is `true`.
In a previous commit we used strings because boolean False
wouldn't render as `false`. This is changed in the dev
version ofdoctemplates, so we can go back to the more
straightforward approach.
This ensures that we have proper spacing before the next
line (which might e.g. be a table bottom border).
This gives better results in cases like test/command/7272.md.
Previously the parser would accept characters in domains
that are illegal in domains, and this sometimes caused it
to gobble bits of the following text.
Closes#7398.
Note that this change, by itself, caused some txt2tag reader
tests to fail. txt2tags allows bare email addresses with
a following form query. So, in addition to the change
to emailAddress, we modify the txt2tags parser so it can
still handle these cases.
Previously it was impossible to specify false values for
options that default to true; setting the option to false
just caused the portion of the template setting the option
to be omitted.
Now we prepopulate all the variables with their default
values, including them unconditionally and allowing them
to be overridden.
Long URLs cannot be treated as mediaPaths, but System.FilePath's
`isRelative` often returns True for them. So we add a check
for an absolute URL. We also ensure that extensions are derived
only from the path portion of URLs (previously a following query
was being included).
Closes#7391.
If inline references are used (in the metadata `references` field),
we should still only include in the bibliography items that are
actually cited -- unless `nocite` is used.
Closes#7376.
With the 2.14 release `--extract-media` stopped working as before;
there could be mismatches between the paths in the rendered document and
the extracted media.
This patch makes several changes (while keeping the same API).
The `mediaPath` in 2.14 was always constructed from the SHA1 hash of
the media contents. Now, we preserve the original path unless it's
an absolute path or contains `..` segments (in that case we use a path
based on the SHA1 hash of the contents).
When constructing a path from the SHA1 hash, we always use the
original extension, if there is one. Otherwise we look up an
appropriate extension for the mime type.
`mediaDirectory` and `mediaItems` now use the `mediaPath`, rather
than the mediabag key, for the first component of the tuple.
This makes more sense, I think, and fits with the documentation
of these functions; eventually, though, we should rework the API so that
`mediaItems` returns both the keys and the MediaItems.
Rewriting of source paths in `extractMedia` has been fixed.
`fillMediaBag` has been modified so that it doesn't modify
image paths (that was part of the problem in #7345).
We now do path normalization (e.g. `\` separators on Windows) only
in writing the media; the paths are left unchanged in the image
links (sensibly, since they might be URLs and not file paths).
These changes should restore the original behavior from before 2.14.
Closes#7345.
When we do a reverse lookup in the MIME table, we just get the
last match, so when the same mime type is associated with several
different extensions, we sometimes got weird results, e.g. `.vs`
for `text/plain`. These special cases help us get the most standard
extensions for mime types like `text/plain`.
In recent versions the table headers were no longer bottom-aligned
(if more than one line). This patch fixes that by using minipages
for table headers in non-simple tables.
Closes#7347.
Previously pipe tables with empty headers (that is, a header
line with all empty cells) would be rendered as headerless
tables. This broke in 2.11.4.
The fix here is to produce an AST with an empty table head
when a pipe table has all empty header cells.
Closes#7343.
Generally we allow optional starred variants of LaTeX commands
(since many allow them, and if we don't accept these explicitly,
ignoring the star usually gives acceptable results). But we
don't want to do this for `\(*\)` and similar cases.
Closes#7340.
* Column spans
* Row spans
- The spec says that if the `val` attribute is ommitted, its value
should be assumed to be `continue`, and that its values are
restricted to {`restart`, `continue`}. If the value has any other
value, I think it seems reasonable to default it to `continue`. It
might cause problems if the spec is extended in the future by adding
a third possible value, in which case this would probably give
incorrect behaviour, and wouldn't error.
* Allow multiple header rows
* Include table description in simple caption
- The table description element is like alt text for a table (along
with the table caption element). It seems like we should include
this somewhere, but I’m not 100% sure how – I’m pairing it with the
simple caption for the moment. (Should it maybe go in the block
caption instead?)
* Detect table captions
- Check for caption paragraph style /and/ either the simple or
complex table field. This means the caption detection fails for
captions which don’t contain a field, as in an example doc I added
as a test. However, I think it’s better to be too conservative: a
missed table caption will still show up as a paragraph next to the
table, whereas if I incorrectly classify something else as a table
caption it could cause havoc by pairing it up with a table it’s
not at all related to, or dropping it entirely.
* Update tests and add new ones
Partially fixes: #6316
- Recognize locators spelled with a capital letter.
Closes#7323.
- Add a comma and a space in front of the suffix if it doesn't start
with space or punctuation. Closes#7324.
- Add manual entry for (non-default) extension
`rebase_relative_paths`.
- Add constructor `Ext_rebase_relative_paths` to `Extensions`
in Text.Pandoc.Extensions [API change]. When enabled, this
extension rewrites relative image and link paths by prepending
the (relative) directory of the containing file.
- Make Markdown reader sensitive to the new extension.
- Add tests for #3752.
Closes#3752.
NB. currently the extension applies to markdown and associated
readers but not commonmark/gfm.
- Improve parsing of `\def` macros. We previously set "verbatim mode"
even for parsing the initial `\def`; this caused problems for things
like
```
\def\foo{\def\bar{BAR}}
\foo
\bar
```
- Implement `\newif`.
- Add tests.
In the current dev version, we will sometimes add
a version of an image with a hashed name, keeping
the original version with the original name, which
would leave to undesirable duplication.
This change separates the media's filename from the
media's canonical name (which is the path of the link
in the document itself). Filenames are based on SHA1
hashes and assigned automatically.
In Text.Pandoc.MediaBag:
- Export MediaItem type [API change].
- Change MediaBag type to a map from Text to MediaItem [API change].
- `lookupMedia` now returns a `MediaItem` [API change].
- Change `insertMedia` so it sets the `mediaPath` to
a filename based on the SHA1 hash of the contents.
This will be used when contents are extracted.
In Text.Pandoc.Class.PandocMonad:
- Remove `fetchMediaResource` [API change].
Lua MediaBag module has been changed minimally. In the future
it would be better, probably, to give Lua access to the full
MediaItem type.
See <https://www.w3.org/TR/html4/types.html#h-6.6>.
"A relative length has the form "i*", where "i" is an integer. When
allotting space among elements competing for that space, user agents
allot pixel and percentage lengths first, then divide up remaining
available space among relative lengths. Each relative length receives a
portion of the available space that is proportional to the integer
preceding the "*". The value "*" is equivalent to "1*". Thus, if 60
pixels of space are available after the user agent allots pixel and
percentage space, and the competing relative lengths are 1*, 2*, and 3*,
the 1* will be alloted 10 pixels, the 2* will be alloted 20 pixels, and
the 3* will be alloted 30 pixels."
Closes#4063.
There's still one slight divergence from the siunitx behavior:
we get 'kg m/A/s' instead of 'kg m/(A s)'. At the moment I'm
not going to worry about that.
Closes#5016
- change ordered list from itemize to enumerate
- adds new itemgroup for ordered lists
- add fontfeature for table figures
- remove width from itemize in context writer
Successive quote characters are separated with a thin space to improve
readability and to prevent unwanted ligatures. Detection of these quotes
sometimes had failed if the second quote was nested in a span element.
Closes: #6958
...to a Div with id 'refs'. Previously we just left the
attributes of such a Div alone, which meant that style
options like entry-spacing had no effect there.
If a code block is defined with `<pre><code
class="language-x">…</code></pre>`, where the `<pre>` element has no
attributes, then the attributes from the `<code>` element are used
instead. Any leading `language-` prefix is dropped in the code's *class*
attribute are dropped to improve syntax highlighting.
Closes: #7221
From settings.xml in the reference-doc, we now include:
`zoom`, `embedSystemFonts`, `doNotTrackMoves`, `defaultTabStop`,
`drawingGridHorizontalSpacing`, `drawingGridVerticalSpacing`,
`displayHorizontalDrawingGridEvery`, `displayVerticalDrawingGridEvery`,
`characterSpacingControl`, `savePreviewPicture`, `mathPr`, `themeFontLang`,
`decimalSymbol`, `listSeparator`, `autoHyphenation`, `compat`.
Closes#7240.
The tags `<title>` and `<h1 class="title">` often contain the same
information, so the latter was dropped from the document. However, as
this can lead to loss of information, the heading is now always
retained.
Use `--shift-heading-level-by=-1` to turn the `<h1>` into the document
title, or a filter to restore the previous behavior.
Closes: #2293
This fixes a regression introduced with the in the colspan/rowspan
changes that caused column alignments to be ignored. The column
alignment is used only if a default alignment is specified at the cell
level; otherwise the cell-level alignment takes precedence.
The change provides a way to use citation keys that contain
special characters not usable with the standard citation
key syntax. Example: `@{foo_bar{x}'}` for the key `foo_bar{x}`.
Closes#6026.
The change requires adding a new parameter to the `citeKey`
parser from Text.Pandoc.Parsing [API change].
Markdown reader: recognize @{..} syntax for citatinos.
Markdown writer: use @{..} syntax for citations when needed.
Update manual with curly-brace syntax for citations.
Closes#6026.
The settings we can carry over from a reference.docx are
autoHyphenation, consecutiveHyphenLimit, hyphenationZone,
doNotHyphenateCap, evenAndOddHeaders, and proofState.
Previously this was implemented in a buggy way, so that the
reference doc's values AND the new values were included.
This change allows users to create a reference.docx that
sets w:proofState for spelling or grammar to "dirty,"
so that spell/grammar checking will be triggered on the
generated docx.
Closes#1209.
Previously, when multiple file arguments were provided, pandoc
simply concatenated them and passed the contents to the readers,
which took a Text argument.
As a result, the readers had no way of knowing which file
was the source of any particular bit of text. This meant that
we couldn't report accurate source positions on errors or
include accurate source positions as attributes in the AST.
More seriously, it meant that we couldn't resolve resource
paths relative to the files containing them
(see e.g. #5501, #6632, #6384, #3752).
Add Text.Pandoc.Sources (exported module), with a `Sources` type
and a `ToSources` class. A `Sources` wraps a list of `(SourcePos,
Text)` pairs. [API change] A parsec `Stream` instance is provided for
`Sources`. The module also exports versions of parsec's `satisfy` and
other Char parsers that track source positions accurately from a
`Sources` stream (or any instance of the new `UpdateSourcePos` class).
Text.Pandoc.Parsing now exports these modified Char parsers instead of
the ones parsec provides. Modified parsers to use a `Sources` as stream
[API change].
The readers that previously took a `Text` argument have been
modified to take any instance of `ToSources`. So, they may still
be used with a `Text`, but they can also be used with a `Sources`
object.
In Text.Pandoc.Error, modified the constructor PandocParsecError
to take a `Sources` rather than a `Text` as first argument,
so parse error locations can be accurately reported.
T.P.Error: showPos, do not print "-" as source name.
Treat a leading " with no closing " as a left curly quote.
This supports the practice, in fiction, of continuing
paragraphs quoting the same speaker without an end quote.
It also helps with quotes that break over lines in line
blocks.
Closes#7216.
If the element has a content-type attribute, or at least one class, then
that value is used as `content-type` and the span is put inside a
`<named-content>` element. Otherwise a `<styled-content>` element is
used instead.
Closes: #7211
Also taking this opportunity to note, for the record, that
the commit for #7241 should be marked [API change].
It changes the type of `languagesByExtension` in Highlighting,
adding a parameter for a `SyntaxMap`.
Languages defined using `--syntax-definition` were not recognized by `languagesByExtension`.
This patch corrects that, allowing the writers to see all custom definitions.
The LaTeX still uses the default syntax map, but that's okay in that context, since
`--syntax-definition` won't create new listings styles.
When a block only has a single class and no other attributes,
it is not necessary to wrap the class attribute in curly braces –
the class name can be placed after the opening mark as is.
This will result in bit cleaner output when pandoc is used
as a markdown pretty-printer.
This fixes a bug, which caused the writer to look at the LAST
rather than the FIRST character in determining whether quotes
were needed. So we got spurious quotes in some cases and
didn't get necessary quotes in others.
Closes#7245. Updated a number of test cases accordingly.
even if it differs from localeLanguage. (It is designed
to be possible to override the locale language, and this
is especially useful when one wants to use the unicode
extension syntx, e.g. fr-u-kb.)
The `<p>` element is used for wrapping in cases were the contents would
otherwise not be allowed in a certain context. Unnecessary wrapping is
avoided, especially around quotes (`<disp-quote>` elements).
Closes: #7227
Spans with attributes are converted to `<named-content>` elements
instead of being wrapped with `<milestone-start/>` and `<milestone-end>`
elements. Milestone elements are not allowed in documents using the
articleauthoring tag set, so this change ensures the creation of valid
documents.
Closes: #7211
Footnotes in the backmatter are given the footnote's number as a label.
The articleauthoring output is unaffected from this change, as footnotes
are placed inline there.
Closes: #7210
XML identifiers must start with an underscore or letter, and can contain
only a limited set of punctuation characters. Any IDs not adhering to
these rules are rewritten by writing the offending characters as Uxxxx,
where `xxxx` is the character's hex code.
Instead of encoding a giant (and incomplete) map, we now
just use unicode-transforms to normalize the text to
a canonical decomposition, and manipulate the result.
The new `toAsciiText` is equivalent to the old
`T.pack . mapMaybe toAsciiChar . T.unpack` but should be faster.
This is a bit more limited than with markdown, as documented
in the manual:
- The YAML block must be the first thing in the input.
- The leaf notes are parsed in isolation from the rest of
the document. So, for example, you can't use reference
links if the references are defined later in the document.
Closes#6537.
Add key-value pairs found in the attributes list of Header.Attr as
XML attributes on the corresponding section element.
Any key name not allowed as an XML attribute name is dropped, as
are keys with invalid values where they are defined as enums in
DocBook, and xml:id (for DocBook 5)/id (for DocBook 4) to not
intervene with computed identifiers.
[API change]
These are inefficient association list lookups.
Replace with more efficient functions in the writers that
used them (with 10-25% performance improvements in
haddock, org, rtf, texinfo writers).
T.P.Parsing: revise type of readWithM so that it takes a Text
rather than a polymorphic ToText value.
These typeclasses were there to ease the transition from String
to Text. They are no longer needed, and they may clash with
more useful versions under the same name.
This will require a bump to 2.13.
Previously we assigned a random number (though in a deterministic
way). But changes in the random package mean we get different
results now on different architectures, even with the same random
seed. We don't need random values; so now we just assign a value
based on the list number id, which is guaranteed to be unique
to the list marker.
Previously the nocite metadata field was ignored with
these formats. Now it populates a `nocite-ids` template
variable and causes a `\nocite` command to be issued.
Closes#4585.
This is code that incorporates a prefix like `https://doi.org/`
into a following link when appropriate. But it didn't work because
we were walking with a `[Inline] -> [Inline]` function on an `Inlines`.
Changed the point of application of `fixLink` to resolve the issue.
Closes#7130.
This prevents emitting invalid HTML.
Ultimately it would be good to prevent this in the types
themselves, but this is better for now.
T.P.Logging: Add DuplicateAttribute constructor to LogMessage.
[API change]
Code blocks that are not marked as a language supported by Jira are
rendered as preformatted text with `{noformat}` blocks.
Fixes: tarleb/jira-wiki-markup#4
Previously, if `--resource-path` were used multiple times, the last
resource path would replace the others.
With this change, each time `--resource-path` is used, it prepends
the specified path components to the existing resource path.
Similarly, when `resource-path` is specified in a defaults file,
the paths provided will be prepended to the existing resource
path.
This change also allows one to avoid using the OS-specific path
separator; instead, one can simply use `--resource-path`
a number of times with single paths. This form of command
will not have an OS-dependent behavior.
This change facilitates the use of multiple, small defaults
files: each can specify a directory containing its own
resources without clobbering the resource paths set by
the others.
Closes#6152.
to refer to the directory where the default file is.
This will make it possible to create moveable
"packages" of resources in a directory.
Closes#5871.
This allows the syntax `${HOME}` to be used, in fields that expect
file paths only. Any environment variable may be interpolated
in this way. A warning will be raised for undefined variables.
The special variable `USERDATA` is automatically set to the
user data directory in force when the defaults file is parsed.
(Note: it may be different from the eventual user data directory,
if the defaults file or further command line options change that.)
Closes#5982.
Closes#5977.
Closes#6108 (path not taken).
Rationale: the manual says that the XDG data directory will
be used if it exists, otherwise the legacy data directory.
So we should just determine this and use this directory,
rather than having a search path which could cause some
things to be taken from one data directory and others from
others.
[API change]
[API change]
These were only exported for testing, which seems the
wrong thing to do. They don't belong in the public
API and are not really usable as they are, without access
to the Tok type which is not exported.
Removed the tokenize/untokenize roundtrip test.
We put a quickcheck property in the comments which
may be used when this code is touched (if it is).
...when handling URL argument served with no charset in the mime type.
The assumption is that most pages that don't specify a charset
in the mime type are either UTF-8 or latin1. I think that's a good
assumption, though I'm not sure.
[API change] This affects `readFile`, `getContents`, `writeFileWith`,
`writeFile`, `putStrWith`, `putStr`, `putStrLnWith`, `putStrLn`.
`hPutStrWith`, `hPutStr`, `hPutStrLnWith`, `hPutStrLn`, `hGetContents`.
This avoids the need to uselessly create a linked list of characters
when emiting output.
Also, remove exported class NamedTag(..) [API change].
This was just intended to smooth over the transition from String to Text
and is no longer needed.
The functions isInlineTag and isBlockTag are no longer
polymorphic.
With the new XML parser, we can avoid the expensive tree
normalization step we used to do.
This gives a significant speed boost in docbook and JATS
parsing (e.g. 9.7 to 6 ms).
This is to prevent accidental creation of ligatures like
`` ?` `` and `` !` `` (especially in languages with quotations
like German), and similar ligature issues.
See jgm/citeproc#54.
Use `\vadjust pre` so that the hypertarget takes you to the
beginning of the paragraph rather than one line down.
Closes#7078.
This makes a particular difference for links to citations
using `--citeproc` and `link-citations: true`.
The org-ref syntax allows to list multiple citations separated by comma.
This fixes a bug that accepted commas as part of the citation id, so all
citation lists were parsed as one single citation.
Fixes: #7101
..and add new definitions isomorphic to xml-light's, but with
Text instead of String. This allows us to keep most of the code in
existing readers that use xml-light, but avoid lots of unnecessary
allocation.
We also add versions of the functions from xml-light's
Text.XML.Light.Output and Text.XML.Light.Proc that operate
on our modified XML types, and functions that convert
xml-light types to our types (since some of our dependencies,
like texmath, use xml-light).
Update golden tests for docx and pptx.
OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`.
Docx: Do a manual traversal to unwrap sdt and smartTag.
This is faster, and needed to pass the tests.
Benchmarks:
A = prior to 8ca191604d (Feb 8)
B = as of 8ca191604d (Feb 8)
C = this commit
| Reader | A | B | C |
| ------- | ----- | ------ | ----- |
| docbook | 18 ms | 12 ms | 10 ms |
| opml | 65 ms | 62 ms | 35 ms |
| jats | 15 ms | 11 ms | 9 ms |
| docx | 72 ms | 69 ms | 44 ms |
| odt | 78 ms | 41 ms | 28 ms |
| epub | 64 ms | 61 ms | 56 ms |
| fb2 | 14 ms | 5 ms | 4 ms |
- If src is empty, we simply skip the iframe.
- If src is invalid or cannot be fetched, we issue a warning
and skip instead of failing with an error.
- Closes#7099.
* Rewrote `withRaw` so it doesn't rely on fragile assumptions
about token positions (which break when macros are expanded).
This requires the addition of `sEnableWithRaw` and `sRawTokens`
in `LaTeXState`, and a new combinator `disablingWithRaw` to
disable collecting of raw tokens in certain contexts.
* Add `parseFromToks` to T.P.Readers.LaTeX.Parsing.
* Fix parsing of single character tokens so it doesn't mess
up the new raw token collecting.
* These changes slightly increase allocations and have a small
performance impact, but it's minor.
Closes#7092.
Setting SOURCE_DATE_EPOCH will allow reproducible builds.
Partially addresses #7093. This does not suffice to fully enable
reproducible in EPUB, since a unique id is being generated for each
build.
This attempts to read the SOURCE_DATE_EPOCH environment variable
and parse a UTC time from it (treating it as a unix date stamp,
see https://reproducible-builds.org/specs/source-date-epoch/).
If the variable is not set or can't be parsed as a unix date
stamp, then the function returns the current date.
This exports functions that uses xml-conduit's parser to
produce an xml-light Element or [Content]. This allows
existing pandoc code to use a better parser without
much modification.
The new parser is used in all places where xml-light's
parser was previously used. Benchmarks show a significant
performance improvement in parsing XML-based formats
(especially ODT and FB2).
Note that the xml-light types use String, so the
conversion from xml-conduit types involves a lot
of extra allocation. It would be desirable to
avoid that in the future by gradually switching
to using xml-conduit directly. This can be done
module by module.
The new parser also reports errors, which we report
when possible.
A new constructor PandocXMLError has been added to
PandocError in T.P.Error [API change].
Closes#7091, which was the main stimulus.
These changes revealed the need for some changes
in the tests. The docbook-reader.docbook test
lacked definitions for the entities it used; these
have been added. And the docx golden tests have been
updated, because the new parser does not preserve
the order of attributes.
Add entity defs to docbook-reader.docbook.
Update golden tests for docx.