Text.Pandoc.Shared: add parameter to uniqueIdent, inlineListToIdentifier.

The parameter is Extensions. This allows these functions to
be sensitive to the settings of `Ext_gfm_auto_identifiers` and
`Ext_ascii_identifiers`.

This allows us to use `uniqueIdent` in the CommonMark reader,
replacing some custom code.

It also means that `gfm_auto_identifiers` can now be used
in all formats.

Semantically, `gfm_auto_identifiers` is now a modifier of
`auto_identifiers`; for identifiers to be set, `auto_identifiers`
must be turned on, and then the type of identifier produced
depends on `gfm_auto_identifiers` and `ascii_identifiers` are set.

Closes #5057.
This commit is contained in:
John MacFarlane 2018-11-11 13:27:25 -08:00
parent ca17ae5246
commit a36d202e86
15 changed files with 109 additions and 83 deletions

View file

@ -1852,7 +1852,8 @@ output formats
enabled by default in
: `markdown`, `muse`
The algorithm used to derive the identifier from the header text is:
The default algorithm used to derive the identifier from the
header text is:
- Remove all formatting, links, etc.
- Remove all footnotes.
@ -1879,6 +1880,9 @@ same text; in this case, the first will get an identifier as described
above; the second will get the same identifier with `-1` appended; the
third with `-2`; and so on.
(However, a different algorithm is used if
`gfm_auto_identifiers` is enabled; see below.)
These identifiers are used to provide link targets in the table of
contents generated by the `--toc|--table-of-contents` option. They
also make it easy to provide links from one section of a document to
@ -1903,6 +1907,13 @@ Causes the identifiers produced by `auto_identifiers` to be pure ASCII.
Accents are stripped off of accented Latin letters, and non-Latin
letters are omitted.
#### Extension: `gfm_auto_identifiers` ####
Changes the algorithm used by `auto_identifiers` to conform to
GitHub's method. Spaces are converted to dashes (`-`),
uppercase characters to lowercase characters, and punctuation
characters other than `-` and `_` are removed.
Math Input
----------
@ -4189,10 +4200,11 @@ variants are supported:
`shortcut_reference_links`, `spaced_reference_links`.
`markdown_github` (deprecated GitHub-Flavored Markdown)
: `pipe_tables`, `raw_html`, `fenced_code_blocks`, `gfm_auto_identifiers`,
`ascii_identifiers`, `backtick_code_blocks`, `autolink_bare_uris`,
`space_in_atx_header`, `intraword_underscores`, `strikeout`,
`emoji`, `shortcut_reference_links`, `angle_brackets_escapable`,
: `pipe_tables`, `raw_html`, `fenced_code_blocks`, `auto_identifiers`,
`gfm_auto_identifiers`, `backtick_code_blocks`,
`autolink_bare_uris`, `space_in_atx_header`,
`intraword_underscores`, `strikeout`, `emoji`,
`shortcut_reference_links`, `angle_brackets_escapable`,
`lists_without_preceding_blankline`.
`markdown_mmd` (MultiMarkdown)
@ -4219,10 +4231,13 @@ individually disabled.
Also, `raw_tex` only affects `gfm` output, not input.
`gfm` (GitHub-Flavored Markdown)
: `pipe_tables`, `raw_html`, `fenced_code_blocks`, `gfm_auto_identifiers`,
`backtick_code_blocks`, `autolink_bare_uris`,
`intraword_underscores`, `strikeout`, `hard_line_breaks`, `emoji`,
`shortcut_reference_links`, `angle_brackets_escapable`.
: `pipe_tables`, `raw_html`, `fenced_code_blocks`, `auto_identifiers`,
`gfm_auto_identifiers`, `backtick_code_blocks`,
`autolink_bare_uris`, `space_in_atx_header`,
`intraword_underscores`, `strikeout`, `emoji`,
`shortcut_reference_links`, `angle_brackets_escapable`,
`lists_without_preceding_blankline`.
Producing slide shows with pandoc
=================================

View file

@ -56,7 +56,7 @@ import Data.Bits (clearBit, setBit, testBit, (.|.))
import Data.Data (Data)
import Data.Typeable (Typeable)
import GHC.Generics (Generic)
import Text.Pandoc.Shared (safeRead)
import Safe (readMay)
import Text.Parsec
#ifdef DERIVE_JSON_VIA_TH
@ -96,7 +96,8 @@ data Extension =
| Ext_all_symbols_escapable -- ^ Make all non-alphanumerics escapable
| Ext_amuse -- ^ Enable Text::Amuse extensions to Emacs Muse markup
| Ext_angle_brackets_escapable -- ^ Make < and > escapable
| Ext_ascii_identifiers -- ^ ascii-only identifiers for headers
| Ext_ascii_identifiers -- ^ ascii-only identifiers for headers;
-- presupposes Ext_auto_identifiers
| Ext_auto_identifiers -- ^ Automatic identifiers for headers
| Ext_autolink_bare_uris -- ^ Make all absolute URIs into links
| Ext_backtick_code_blocks -- ^ GitHub style ``` code blocks
@ -123,8 +124,9 @@ data Extension =
| Ext_fenced_divs -- ^ Allow fenced div syntax :::
| Ext_footnotes -- ^ Pandoc\/PHP\/MMD style footnotes
| Ext_four_space_rule -- ^ Require 4-space indent for list contents
| Ext_gfm_auto_identifiers -- ^ Automatic identifiers for headers, using
-- GitHub's method for generating identifiers
| Ext_gfm_auto_identifiers -- ^ Use GitHub's method for generating
-- header identifiers; presupposes
-- Ext_auto_identifiers
| Ext_grid_tables -- ^ Grid tables (pandoc, reST)
| Ext_hard_line_breaks -- ^ All newlines become hard line breaks
| Ext_header_attributes -- ^ Explicit header attributes {#id .class k=v}
@ -265,6 +267,7 @@ githubMarkdownExtensions = extensionsFromList
, Ext_pipe_tables
, Ext_raw_html
, Ext_fenced_code_blocks
, Ext_auto_identifiers
, Ext_gfm_auto_identifiers
, Ext_backtick_code_blocks
, Ext_autolink_bare_uris
@ -384,7 +387,7 @@ parseFormatSpec = parse formatSpec ""
extMod = do
polarity <- oneOf "-+"
name <- many $ noneOf "-+"
ext <- case safeRead ("Ext_" ++ name) of
ext <- case readMay ("Ext_" ++ name) of
Just n -> return n
Nothing
| name == "lhs" -> return Ext_literate_haskell

View file

@ -1291,7 +1291,7 @@ registerHeader (ident,classes,kvs) header' = do
let insert' = M.insertWith (\_new old -> old)
if null ident && Ext_auto_identifiers `extensionEnabled` exts
then do
let id' = uniqueIdent (B.toList header') ids
let id' = uniqueIdent exts (B.toList header') ids
let id'' = if Ext_ascii_identifiers `extensionEnabled` exts
then mapMaybe toAsciiChar id'
else id'

View file

@ -36,23 +36,20 @@ where
import Prelude
import CMarkGFM
import Control.Monad.State
import Data.Char (isAlphaNum, isSpace, toLower)
import Data.List (groupBy)
import qualified Data.Map as Map
import Data.Maybe (mapMaybe)
import qualified Data.Set as Set
import Data.Text (Text, unpack)
import Text.Pandoc.Asciify (toAsciiChar)
import Text.Pandoc.Class (PandocMonad)
import Text.Pandoc.Definition
import Text.Pandoc.Emoji (emojiToInline)
import Text.Pandoc.Options
import Text.Pandoc.Shared (stringify)
import Text.Pandoc.Shared (uniqueIdent)
import Text.Pandoc.Walk (walkM)
-- | Parse a CommonMark formatted string into a 'Pandoc' structure.
readCommonMark :: PandocMonad m => ReaderOptions -> Text -> m Pandoc
readCommonMark opts s = return $
(if isEnabled Ext_gfm_auto_identifiers opts
(if isEnabled Ext_auto_identifiers opts
then addHeaderIdentifiers opts
else id) $
nodeToPandoc opts $ commonmarkToNode opts' exts s
@ -78,30 +75,14 @@ convertEmojis s =
addHeaderIdentifiers :: ReaderOptions -> Pandoc -> Pandoc
addHeaderIdentifiers opts doc = evalState (walkM (addHeaderId opts) doc) mempty
addHeaderId :: ReaderOptions -> Block -> State (Map.Map String Int) Block
addHeaderId :: ReaderOptions -> Block -> State (Set.Set String) Block
addHeaderId opts (Header lev (_,classes,kvs) ils) = do
idmap <- get
let ident = toIdent opts ils
ident' <- case Map.lookup ident idmap of
Nothing -> do
put (Map.insert ident 1 idmap)
return ident
Just i -> do
put (Map.adjust (+ 1) ident idmap)
return (ident ++ "-" ++ show i)
return $ Header lev (ident',classes,kvs) ils
ids <- get
let ident = uniqueIdent (readerExtensions opts) ils ids
modify (Set.insert ident)
return $ Header lev (ident,classes,kvs) ils
addHeaderId _ x = return x
toIdent :: ReaderOptions -> [Inline] -> String
toIdent opts =
filterAscii . filterPunct . spaceToDash . map toLower. stringify
where
filterAscii = if isEnabled Ext_ascii_identifiers opts
then mapMaybe toAsciiChar
else id
filterPunct = filter (\c -> isAlphaNum c || c == '_' || c == '-')
spaceToDash = map (\c -> if isSpace c then '-' else c)
nodeToPandoc :: ReaderOptions -> Node -> Pandoc
nodeToPandoc opts (Node _ DOCUMENT nodes) =
Pandoc nullMeta $ foldr (addBlock opts) [] nodes

View file

@ -442,9 +442,11 @@ parPartToInlines' (BookMark _ anchor) =
(modify $ \s -> s { docxAnchorMap = M.insert anchor prevAnchor anchorMap})
return mempty
Nothing -> do
exts <- readerExtensions <$> asks docxOptions
let newAnchor =
if not inHdrBool && anchor `elem` M.elems anchorMap
then uniqueIdent [Str anchor] (Set.fromList $ M.elems anchorMap)
then uniqueIdent exts [Str anchor]
(Set.fromList $ M.elems anchorMap)
else anchor
unless inHdrBool
(modify $ \s -> s { docxAnchorMap = M.insert anchor newAnchor anchorMap})
@ -487,8 +489,9 @@ makeHeaderAnchor' (Header n (ident, classes, kvs) ils)
| (c:_) <- filter isAnchorSpan ils
, (Span (anchIdent, ["anchor"], _) cIls) <- c = do
hdrIDMap <- gets docxAnchorMap
exts <- readerExtensions <$> asks docxOptions
let newIdent = if null ident
then uniqueIdent ils (Set.fromList $ M.elems hdrIDMap)
then uniqueIdent exts ils (Set.fromList $ M.elems hdrIDMap)
else ident
newIls = concatMap f ils where f il | il == c = cIls
| otherwise = [il]
@ -499,8 +502,9 @@ makeHeaderAnchor' (Header n (ident, classes, kvs) ils)
makeHeaderAnchor' (Header n (ident, classes, kvs) ils) =
do
hdrIDMap <- gets docxAnchorMap
exts <- readerExtensions <$> asks docxOptions
let newIdent = if null ident
then uniqueIdent ils (Set.fromList $ M.elems hdrIDMap)
then uniqueIdent exts ils (Set.fromList $ M.elems hdrIDMap)
else ident
modify $ \s -> s {docxAnchorMap = M.insert newIdent newIdent hdrIDMap}
return $ Header n (newIdent, classes, kvs) ils

View file

@ -54,6 +54,7 @@ import qualified Text.XML.Light as XML
import Text.Pandoc.Builder
import Text.Pandoc.MediaBag (MediaBag, insertMedia)
import Text.Pandoc.Shared
import Text.Pandoc.Extensions (extensionsFromList, Extension(..))
import Text.Pandoc.Readers.Odt.Base
import Text.Pandoc.Readers.Odt.Namespaces
@ -253,7 +254,9 @@ getPrettyAnchor = proc (baseIdent, uglyAnchor) -> do
getHeaderAnchor :: OdtReaderSafe Inlines Anchor
getHeaderAnchor = proc title -> do
state <- getExtraState -< ()
let anchor = uniqueIdent (toList title) (Set.fromList $ usedAnchors state)
let exts = extensionsFromList [Ext_auto_identifiers]
let anchor = uniqueIdent exts (toList title)
(Set.fromList $ usedAnchors state)
modifyExtraState (putPrettyAnchor anchor anchor) -<< anchor
@ -768,6 +771,7 @@ read_maybe_nested_img_frame = matchingElement NsDraw "frame"
read_frame :: OdtReaderSafe Inlines Inlines
read_frame =
proc blocks -> do
let exts = extensionsFromList [Ext_auto_identifiers]
w <- ( findAttr' NsSVG "width" ) -< ()
h <- ( findAttr' NsSVG "height" ) -< ()
titleNodes <- ( matchChildContent' [ read_frame_title ] ) -< blocks
@ -776,7 +780,8 @@ read_frame =
_ <- updateMediaWithResource -< resource
alt <- (matchChildContent [] read_plain_text) -< blocks
arr (uncurry4 imageWith ) -<
(image_attributes w h, src, inlineListToIdentifier (toList titleNodes), alt)
(image_attributes w h, src,
inlineListToIdentifier exts (toList titleNodes), alt)
image_attributes :: Maybe String -> Maybe String -> Attr
image_attributes x y =

View file

@ -116,8 +116,7 @@ import Control.Monad (MonadPlus (..), msum, unless)
import qualified Control.Monad.State.Strict as S
import qualified Data.ByteString.Lazy as BL
import qualified Data.Bifunctor as Bifunctor
import Data.Char (isAlpha, isDigit, isLetter, isLower, isSpace, isUpper,
toLower)
import Data.Char (isAlpha, isLower, isSpace, isUpper, toLower, isAlphaNum)
import Data.Data (Data, Typeable)
import Data.List (find, intercalate, intersperse, stripPrefix)
import qualified Data.Map as M
@ -137,7 +136,9 @@ import Text.HTML.TagSoup (RenderOptions (..), Tag (..), renderOptions,
import Text.Pandoc.Builder (Blocks, Inlines, ToMetaValue (..))
import qualified Text.Pandoc.Builder as B
import Data.Time
import Text.Pandoc.Asciify (toAsciiChar)
import Text.Pandoc.Definition
import Text.Pandoc.Extensions (Extensions, Extension(..), extensionEnabled)
import Text.Pandoc.Generic (bottomUp)
import Text.Pandoc.Pretty (charWidth)
import Text.Pandoc.Walk
@ -483,18 +484,29 @@ instance Walkable Block Element where
query f (Blk x) = query f x
query f (Sec _ _ _ ils elts) = query f ils `mappend` query f elts
-- | Convert Pandoc inline list to plain text identifier. HTML
-- identifiers must start with a letter, and may contain only
-- letters, digits, and the characters _-.
inlineListToIdentifier :: [Inline] -> String
inlineListToIdentifier =
dropWhile (not . isAlpha) . intercalate "-" . words .
map (nbspToSp . toLower) .
filter (\c -> isLetter c || isDigit c || c `elem` "_-. ") .
stringify
where nbspToSp '\160' = ' '
nbspToSp x = x
inlineListToIdentifier :: Extensions -> [Inline] -> String
inlineListToIdentifier exts =
dropNonLetter . filterAscii . toIdent . stringify
where
dropNonLetter
| extensionEnabled Ext_gfm_auto_identifiers exts = id
| otherwise = dropWhile (not . isAlpha)
filterAscii
| extensionEnabled Ext_ascii_identifiers exts
= mapMaybe toAsciiChar
| otherwise = id
toIdent
| extensionEnabled Ext_gfm_auto_identifiers exts =
filterPunct . spaceToDash . map toLower
| otherwise = intercalate "-" . words . filterPunct . map toLower
filterPunct = filter (\c -> isSpace c || isAlphaNum c || isAllowedPunct c)
isAllowedPunct c
| extensionEnabled Ext_gfm_auto_identifiers exts = c == '_' || c == '-'
| otherwise = c == '_' || c == '-' || c == '.'
spaceToDash = map (\c -> if isSpace c then '-' else c)
-- | Convert list of Pandoc blocks into (hierarchical) list of Elements
hierarchicalize :: [Block] -> [Element]
@ -530,17 +542,20 @@ headerLtEq _ _ = False
-- | Generate a unique identifier from a list of inlines.
-- Second argument is a list of already used identifiers.
uniqueIdent :: [Inline] -> Set.Set String -> String
uniqueIdent title' usedIdents
= let baseIdent = case inlineListToIdentifier title' of
"" -> "section"
x -> x
numIdent n = baseIdent ++ "-" ++ show n
in if baseIdent `Set.member` usedIdents
then case find (\x -> not $ numIdent x `Set.member` usedIdents) ([1..60000] :: [Int]) of
Just x -> numIdent x
Nothing -> baseIdent -- if we have more than 60,000, allow repeats
else baseIdent
uniqueIdent :: Extensions -> [Inline] -> Set.Set String -> String
uniqueIdent exts title' usedIdents =
if baseIdent `Set.member` usedIdents
then case find (\x -> not $ numIdent x `Set.member` usedIdents)
([1..60000] :: [Int]) of
Just x -> numIdent x
Nothing -> baseIdent
-- if we have more than 60,000, allow repeats
else baseIdent
where
baseIdent = case inlineListToIdentifier exts title' of
"" -> "section"
x -> x
numIdent n = baseIdent ++ "-" ++ show n
-- | True if block is a Header block.
isHeaderBlock :: Block -> Bool

View file

@ -170,7 +170,7 @@ blockToAsciiDoc _ HorizontalRule =
blockToAsciiDoc opts (Header level (ident,_,_) inlines) = do
contents <- inlineListToAsciiDoc opts inlines
ids <- gets autoIds
let autoId = uniqueIdent inlines ids
let autoId = uniqueIdent (writerExtensions opts) inlines ids
modify $ \st -> st{ autoIds = Set.insert autoId ids }
let identifier = if null ident ||
(isEnabled Ext_auto_identifiers opts && ident == autoId)

View file

@ -495,7 +495,7 @@ pandocToEPUB version opts doc@(Pandoc meta _) = do
-- body pages
-- add level 1 header to beginning if none there
let blocks' = addIdentifiers
let blocks' = addIdentifiers opts
$ case blocks of
(Header 1 _ _ : _) -> blocks
_ -> Header 1 ("",["unnumbered"],[])
@ -1056,12 +1056,12 @@ showChapter :: Int -> String
showChapter = printf "ch%03d.xhtml"
-- Add identifiers to any headers without them.
addIdentifiers :: [Block] -> [Block]
addIdentifiers bs = evalState (mapM go bs) Set.empty
addIdentifiers :: WriterOptions -> [Block] -> [Block]
addIdentifiers opts bs = evalState (mapM go bs) Set.empty
where go (Header n (ident,classes,kvs) ils) = do
ids <- get
let ident' = if null ident
then uniqueIdent ils ids
then uniqueIdent (writerExtensions opts) ils ids
else ident
modify $ Set.insert ident'
return $ Header n (ident',classes,kvs) ils

View file

@ -508,7 +508,7 @@ blockToMarkdown' opts (Header level attr inlines) = do
-- we calculate the id that would be used by auto_identifiers
-- so we know whether to print an explicit identifier
ids <- gets stIds
let autoId = uniqueIdent inlines ids
let autoId = uniqueIdent (writerExtensions opts) inlines ids
modify $ \st -> st{ stIds = Set.insert autoId ids }
let attr' = case attr of
("",[],[]) -> empty

View file

@ -270,7 +270,7 @@ blockToMuse (Header level (ident,_,_) inlines) = do
topLevel <- asks envTopLevel
contents <- local (\env -> env { envOneLine = True }) $ inlineListToMuse' inlines
ids <- gets stIds
let autoId = uniqueIdent inlines ids
let autoId = uniqueIdent (writerExtensions opts) inlines ids
modify $ \st -> st{ stIds = Set.insert autoId ids }
let attr' = if null ident || (isEnabled Ext_auto_identifiers opts && ident == autoId)

View file

@ -735,6 +735,7 @@ makeEndNotesSlideBlocks :: Pres [Block]
makeEndNotesSlideBlocks = do
noteIds <- gets stNoteIds
slideLevel <- asks envSlideLevel
exts <- writerExtensions <$> asks envOpts
meta <- asks envMetadata
-- Get identifiers so we can give the notes section a unique ident.
anchorSet <- M.keysSet <$> gets stAnchorMap
@ -743,7 +744,7 @@ makeEndNotesSlideBlocks = do
else let title = case lookupMetaInlines "notes-title" meta of
[] -> [Str "Notes"]
ls -> ls
ident = Shared.uniqueIdent title anchorSet
ident = Shared.uniqueIdent exts title anchorSet
hdr = Header slideLevel (ident, [], []) title
blks = concatMap (\(n, bs) -> makeNoteEntry n bs) $
M.toList noteIds

View file

@ -260,7 +260,8 @@ blockToRST (Header level (name,classes,_) inlines) = do
contents <- inlineListToRST inlines
-- we calculate the id that would be used by auto_identifiers
-- so we know whether to print an explicit identifier
let autoId = uniqueIdent inlines mempty
opts <- gets stOptions
let autoId = uniqueIdent (writerExtensions opts) inlines mempty
isTopLevel <- gets stTopLevel
if isTopLevel
then do

View file

@ -231,7 +231,8 @@ blockToTexinfo (Header level _ lst)
node <- inlineListForNode lst
txt <- inlineListToTexinfo lst
idsUsed <- gets stIdentifiers
let id' = uniqueIdent lst idsUsed
opts <- gets stOptions
let id' = uniqueIdent (writerExtensions opts) lst idsUsed
modify $ \st -> st{ stIdentifiers = Set.insert id' idsUsed }
sec <- seccmd level
return $ if (level > 0) && (level <= 4)

View file

@ -2,24 +2,24 @@ Check that the commonmark reader handles the `ascii_identifiers`
extension properly.
```
% pandoc -f commonmark+gfm_auto_identifiers+ascii_identifiers -t native
% pandoc -f commonmark+auto_identifiers+gfm_auto_identifiers+ascii_identifiers -t native
# non ascii ⚠️ räksmörgås
^D
[Header 1 ("non-ascii--raksmorgas",[],[]) [Str "non",Space,Str "ascii",Space,Str "\9888\65039",Space,Str "r\228ksm\246rg\229s"]]
```
```
% pandoc -f commonmark+gfm_auto_identifiers-ascii_identifiers -t native
% pandoc -f commonmark+auto_identifiers+gfm_auto_identifiers-ascii_identifiers -t native
# non ascii ⚠️ räksmörgås
^D
[Header 1 ("non-ascii-\65039-r\228ksm\246rg\229s",[],[]) [Str "non",Space,Str "ascii",Space,Str "\9888\65039",Space,Str "r\228ksm\246rg\229s"]]
```
`gfm` should have `ascii_identifiers` enabled by default.
`gfm` should have `ascii_identifiers` disabled by default.
```
% pandoc -f gfm -t native
# non ascii ⚠️ räksmörgås
^D
[Header 1 ("non-ascii--raksmorgas",[],[]) [Str "non",Space,Str "ascii",Space,Str "\9888\65039",Space,Str "r\228ksm\246rg\229s"]]
[Header 1 ("non-ascii-\65039-r\228ksm\246rg\229s",[],[]) [Str "non",Space,Str "ascii",Space,Str "\9888\65039",Space,Str "r\228ksm\246rg\229s"]]
```