Markdown reader: use CommonMark rules for list item nesting.

Closes #3511. Previously pandoc used the four-space rule: continuation paragraphs, sublists, and other block level content had to be indented 4 spaces. Now the indentation required is determined by the first line of the list item: to be included in the list item, blocks must be indented to the level of the first non-space content after the list marker. Exception: if are 5 or more spaces after the list marker, then the content is interpreted as an indented code block, and continuation paragraphs must be indented two spaces beyond the end of the list marker. See the CommonMark spec for more details and examples. Documents that adhere to the four-space rule should, in most cases, be parsed the same way by the new rules. Here are some examples of texts that will be parsed differently: - a - b will be parsed as a list item with a sublist; under the four-space rule, it would be a list with two items. - a code Here we have an indented code block under the list item, even though it is only indented six spaces from the margin, because it is four spaces past the point where a continuation paragraph could begin. With the four-space rule, this would be a regular paragraph rather than a code block. - a code Here the code block will start with two spaces, whereas under the four-space rule, it would start with `code`. With the four-space rule, indented code under a list item always must be indented eight spaces from the margin, while the new rules require only that it be indented four spaces from the beginning of the first non-space text after the list marker (here, `a`). This change was motivated by a slew of bug reports from people who expected lists to work differently (#3125, #2367, #2575, #2210, #1990, #1137, #744, #172, #137, #128) and by the growing prevalance of CommonMark (now used by GitHub, for example). Users who want to use the old rules can select the `four_space_rule` extension. * Added `four_space_rule` extension. * Added `Ext_four_space_rule` to `Extensions`. * `Parsing` now exports `gobbleAtMostSpaces`, and the type of `gobbleSpaces` has been changed so that a `ReaderOptions` parameter is not needed.
2017-08-19 10:56:15 -07:00 · 2017-08-19 10:56:15 -07:00 · a31241a08b
commit a31241a08b
parent 5ab1162def
8 changed files with 178 additions and 117 deletions
--- a/MANUAL.txt
+++ b/MANUAL.txt
@ -2078,12 +2078,12 @@ But Markdown also allows a "lazy" format:
    list item.
    * and my second.
-### The four-space rule ###
+### Block content in list items ###
 A list item may contain multiple paragraphs and other block-level
 content. However, subsequent paragraphs must be preceded by a blank line
-and indented four spaces or a tab. The list will look better if the first
+and indented to line up with the first non-space content after
-paragraph is aligned with the rest:
+the list marker.
      * First paragraph.
@ -2094,19 +2094,29 @@ paragraph is aligned with the rest:
            { code }
 Exception: if the list marker is followed by an indented code
 block, which must begin 5 spaces after the list marker, then
 subsequent paragraphs must begin two columns after the last
 character of the list marker:
    *     code
      continuation paragraph
 List items may include other lists.  In this case the preceding blank
-line is optional.  The nested list must be indented four spaces or
+line is optional.  The nested list must be indented to line up with
-one tab:
+the first non-space character after the list marker of the
 containing list item.
    * fruits
-        + apples
+      + apples
-            - macintosh
+        - macintosh
-            - red delicious
+        - red delicious
-        + pears
+      + pears
-        + peaches
+      + peaches
    * vegetables
-        + broccoli
+      + broccoli
-        + chard
+      + chard
 As noted above, Markdown allows you to write list items "lazily," instead of
 indenting continuation lines. However, if there are multiple paragraphs or
@ -2121,21 +2131,6 @@ other blocks in a list item, the first line of each must be indented.
        Second paragraph of second
    list item.
 **Note:**  Although the four-space rule for continuation paragraphs
 comes from the official [Markdown syntax guide], the reference implementation,
 `Markdown.pl`, does not follow it. So pandoc will give different results than
 `Markdown.pl` when authors have indented continuation paragraphs fewer than
 four spaces.
 The [Markdown syntax guide] is not explicit whether the four-space
 rule applies to *all* block-level content in a list item; it only
 mentions paragraphs and code blocks.  But it implies that the rule
 applies to all block-level content (including nested lists), and
 pandoc interprets it that way.
  [Markdown syntax guide]:
    http://daringfireball.net/projects/markdown/syntax#list
 ### Ordered lists ###
 Ordered lists work just like bulleted lists, except that the items
@ -3606,6 +3601,12 @@ implied by pandoc's default `all_symbols_escapable`.
 Allow a list to occur right after a paragraph, with no intervening
 blank space.
 #### Extension: `four_space_rule` ####
 Selects the pandoc <= 2.0 behavior for parsing lists, so that
 four spaces indent are needed for list item continuation
 paragraphs.
 #### Extension: `spaced_reference_links` ####
 Allow whitespace between the two components of a reference link,
--- a/src/Text/Pandoc/Extensions.hs
+++ b/src/Text/Pandoc/Extensions.hs
@ -111,6 +111,7 @@ data Extension =
    | Ext_autolink_bare_uris  -- ^ Make all absolute URIs into links
    | Ext_fancy_lists         -- ^ Enable fancy list numbers and delimiters
    | Ext_lists_without_preceding_blankline -- ^ Allow lists without preceding blank
    | Ext_four_space_rule     -- ^ Require 4-space indent for list contents
    | Ext_startnum            -- ^ Make start number of ordered list significant
    | Ext_definition_lists    -- ^ Definition lists as in pandoc, mmd, php
    | Ext_compact_definition_lists  -- ^ Definition lists without
--- a/src/Text/Pandoc/Parsing.hs
+++ b/src/Text/Pandoc/Parsing.hs
@ -50,6 +50,7 @@ module Text.Pandoc.Parsing ( takeWhileP,
                             blankline,
                             blanklines,
                             gobbleSpaces,
                             gobbleAtMostSpaces,
                             enclosed,
                             stringAnyCase,
                             parseFromString,
@ -380,14 +381,33 @@ blanklines = many1 blankline
 -- | Gobble n spaces; if tabs are encountered, expand them
 -- and gobble some or all of their spaces, leaving the rest.
-gobbleSpaces :: Monad m => ReaderOptions -> Int -> ParserT [Char] st m ()
+gobbleSpaces :: (HasReaderOptions st, Monad m)
-gobbleSpaces _    0 = return ()
+             => Int -> ParserT [Char] st m ()
-gobbleSpaces opts n = try $ do
+gobbleSpaces 0 = return ()
-  char ' ' <|> do char '\t'
+gobbleSpaces n
-                  inp <- getInput
+  | n < 0     = error "gobbleSpaces called with negative number"
-                  setInput $ replicate (readerTabStop opts - 1) ' ' ++ inp
+  | otherwise = try $ do
-                  return ' '
+      char ' ' <|> eatOneSpaceOfTab
-  gobbleSpaces opts (n - 1)
+      gobbleSpaces (n - 1)
 eatOneSpaceOfTab :: (HasReaderOptions st, Monad m) => ParserT [Char] st m Char
 eatOneSpaceOfTab = do
  char '\t'
  tabstop <- getOption readerTabStop
  inp <- getInput
  setInput $ replicate (tabstop - 1) ' ' ++ inp
  return ' '
 -- | Gobble up to n spaces; if tabs are encountered, expand them
 -- and gobble some or all of their spaces, leaving the rest.
 gobbleAtMostSpaces :: (HasReaderOptions st, Monad m)
                   => Int -> ParserT [Char] st m Int
 gobbleAtMostSpaces 0 = return 0
 gobbleAtMostSpaces n
  | n < 0     = error "gobbleAtMostSpaces called with negative number"
  | otherwise = option 0 $ do
      char ' ' <|> eatOneSpaceOfTab
      (+ 1) <$> gobbleAtMostSpaces (n - 1)
 -- | Parses material enclosed between start and end parsers.
 enclosed :: (Show end, Stream s  m Char) => ParserT s st m t   -- ^ start parser
--- a/src/Text/Pandoc/Readers/Markdown.hs
+++ b/src/Text/Pandoc/Readers/Markdown.hs
@ -138,12 +138,7 @@ nonindentSpaces = do
 skipNonindentSpaces :: PandocMonad m => MarkdownParser m Int
 skipNonindentSpaces = do
  tabStop <- getOption readerTabStop
-  atMostSpaces (tabStop - 1) <* notFollowedBy spaceChar
+  gobbleAtMostSpaces (tabStop - 1) <* notFollowedBy spaceChar
 atMostSpaces :: PandocMonad m => Int -> MarkdownParser m Int
 atMostSpaces n
  | n > 0     = (char ' ' >> (+1) <$> atMostSpaces (n-1)) <|> return 0
  | otherwise = return 0
 litChar :: PandocMonad m => MarkdownParser m Char
 litChar = escapedChar'
@ -809,49 +804,51 @@ blockQuote = do
 bulletListStart :: PandocMonad m => MarkdownParser m ()
 bulletListStart = try $ do
  optional newline -- if preceded by a Plain block in a list context
  startpos <- sourceColumn <$> getPosition
  skipNonindentSpaces
  notFollowedBy' (() <$ hrule)     -- because hrules start out just like lists
  satisfy isBulletListMarker
-  endpos <- sourceColumn <$> getPosition
+  gobbleSpaces 1 <|> () <$ lookAhead newline
-  tabStop <- getOption readerTabStop
+  try (gobbleAtMostSpaces 3 >> notFollowedBy spaceChar) <|> return ()
  lookAhead (newline <|> spaceChar)
  () <$ atMostSpaces (tabStop - (endpos - startpos))
-anyOrderedListStart :: PandocMonad m => MarkdownParser m (Int, ListNumberStyle, ListNumberDelim)
+orderedListStart :: PandocMonad m
-anyOrderedListStart = try $ do
+                 => Maybe (ListNumberStyle, ListNumberDelim)
                 -> MarkdownParser m (Int, ListNumberStyle, ListNumberDelim)
 orderedListStart mbstydelim = try $ do
  optional newline -- if preceded by a Plain block in a list context
  startpos <- sourceColumn <$> getPosition
  skipNonindentSpaces
  notFollowedBy $ string "p." >> spaceChar >> digit  -- page number
-  res <- do guardDisabled Ext_fancy_lists
+  (do guardDisabled Ext_fancy_lists
-            start <- many1 digit >>= safeRead
+      start <- many1 digit >>= safeRead
-            char '.'
+      char '.'
-            return (start, DefaultStyle, DefaultDelim)
+      gobbleSpaces 1 <|> () <$ lookAhead newline
-     <|> do (num, style, delim) <- anyOrderedListMarker
+      optional $ try (gobbleAtMostSpaces 3 >> notFollowedBy spaceChar)
-            -- if it could be an abbreviated first name,
+      return (start, DefaultStyle, DefaultDelim))
-            -- insist on more than one space
+   <|>
-            when (delim == Period && (style == UpperAlpha ||
+   (do (num, style, delim) <- maybe
-                 (style == UpperRoman &&
+          anyOrderedListMarker
-                  num `elem` [1, 5, 10, 50, 100, 500, 1000]))) $
+          (\(sty,delim) -> (\start -> (start,sty,delim)) <$>
-               () <$ spaceChar
+               orderedListMarker sty delim)
-            return (num, style, delim)
+          mbstydelim
-  endpos <- sourceColumn <$> getPosition
+       gobbleSpaces 1 <|> () <$ lookAhead newline
-  tabStop <- getOption readerTabStop
+       -- if it could be an abbreviated first name,
-  lookAhead (newline <|> spaceChar)
+       -- insist on more than one space
-  atMostSpaces (tabStop - (endpos - startpos))
+       when (delim == Period && (style == UpperAlpha ||
-  return res
+            (style == UpperRoman &&
             num `elem` [1, 5, 10, 50, 100, 500, 1000]))) $
              () <$ lookAhead (newline <|> spaceChar)
       optional $ try (gobbleAtMostSpaces 3 >> notFollowedBy spaceChar)
       return (num, style, delim))
 listStart :: PandocMonad m => MarkdownParser m ()
-listStart = bulletListStart <|> (anyOrderedListStart >> return ())
+listStart = bulletListStart <|> (orderedListStart Nothing >> return ())
-listLine :: PandocMonad m => MarkdownParser m String
+listLine :: PandocMonad m => Int -> MarkdownParser m String
-listLine = try $ do
+listLine continuationIndent = try $ do
-  notFollowedBy' (do indentSpaces
+  notFollowedBy' (do gobbleSpaces continuationIndent
-                     many spaceChar
+                     skipMany spaceChar
                     listStart)
  notFollowedByHtmlCloser
-  optional (() <$ indentSpaces)
+  optional (() <$ gobbleSpaces continuationIndent)
  listLineCommon
 listLineCommon :: PandocMonad m => MarkdownParser m String
@ -864,26 +861,39 @@ listLineCommon = concat <$> manyTill
 -- parse raw text for one list item, excluding start marker and continuations
 rawListItem :: PandocMonad m
            => MarkdownParser m a
-            -> MarkdownParser m String
+            -> MarkdownParser m (String, Int)
 rawListItem start = try $ do
  pos1 <- getPosition
  start
  pos2 <- getPosition
  continuationIndent <- (4 <$ guardEnabled Ext_four_space_rule)
                    <|> return (sourceColumn pos2 - sourceColumn pos1)
  first <- listLineCommon
  rest <- many (do notFollowedBy listStart
                   notFollowedBy (() <$ codeBlockFenced)
                   notFollowedBy blankline
-                   listLine)
+                   listLine continuationIndent)
  blanks <- many blankline
-  return $ unlines (first:rest) ++ blanks
+  let result = unlines (first:rest) ++ blanks
  return (result, continuationIndent)
 -- continuation of a list item - indented and separated by blankline
 -- or (in compact lists) endline.
 -- note: nested lists are parsed as continuations
-listContinuation :: PandocMonad m => MarkdownParser m String
+listContinuation :: PandocMonad m => Int -> MarkdownParser m String
-listContinuation = try $ do
+listContinuation continuationIndent = try $ do
-  lookAhead indentSpaces
+  x <- try $ do
-  result <- many1 listContinuationLine
+         notFollowedBy blankline
         notFollowedByHtmlCloser
         gobbleSpaces continuationIndent
         anyLineNewline
  xs <- many $ try $ do
         notFollowedBy blankline
         notFollowedByHtmlCloser
         gobbleSpaces continuationIndent <|> notFollowedBy' listStart
         anyLineNewline
  blanks <- many blankline
-  return $ concat result ++ blanks
+  return $ concat (x:xs) ++ blanks
 notFollowedByHtmlCloser :: PandocMonad m => MarkdownParser m ()
 notFollowedByHtmlCloser = do
@ -892,20 +902,12 @@ notFollowedByHtmlCloser = do
        Just t  -> notFollowedBy' $ htmlTag (~== TagClose t)
        Nothing -> return ()
 listContinuationLine :: PandocMonad m => MarkdownParser m String
 listContinuationLine = try $ do
  notFollowedBy blankline
  notFollowedBy' listStart
  notFollowedByHtmlCloser
  optional indentSpaces
  anyLineNewline
 listItem :: PandocMonad m
         => MarkdownParser m a
         -> MarkdownParser m (F Blocks)
 listItem start = try $ do
-  first <- rawListItem start
+  (first, continuationIndent) <- rawListItem start
-  continuations <- many listContinuation
+  continuations <- many (listContinuation continuationIndent)
  -- parsing with ListItemState forces markers at beginning of lines to
  -- count as list item markers, even if not separated by blank space.
  -- see definition of "endline"
@ -920,23 +922,14 @@ listItem start = try $ do
 orderedList :: PandocMonad m => MarkdownParser m (F Blocks)
 orderedList = try $ do
-  (start, style, delim) <- lookAhead anyOrderedListStart
+  (start, style, delim) <- lookAhead (orderedListStart Nothing)
  unless (style `elem` [DefaultStyle, Decimal, Example] &&
          delim `elem` [DefaultDelim, Period]) $
    guardEnabled Ext_fancy_lists
  when (style == Example) $ guardEnabled Ext_example_lists
  items <- fmap sequence $ many1 $ listItem
-                 ( try $ do
+                 (orderedListStart (Just (style, delim)))
-                     optional newline -- if preceded by Plain block in a list
+  start' <- (start <$ guardEnabled Ext_startnum) <|> return 1
                     startpos <- sourceColumn <$> getPosition
                     skipNonindentSpaces
                     res <- orderedListMarker style delim
                     endpos <- sourceColumn <$> getPosition
                     tabStop <- getOption readerTabStop
                     lookAhead (newline <|> spaceChar)
                     atMostSpaces (tabStop - (endpos - startpos))
                     return res )
  start' <- option 1 $ guardEnabled Ext_startnum >> return start
  return $ B.orderedListWith (start', style, delim) <$> fmap compactify items
 bulletList :: PandocMonad m => MarkdownParser m (F Blocks)
@ -1122,7 +1115,7 @@ rawHtmlBlocks = do
  updateState $ \st -> st{ stateInHtmlBlock = Just tagtype }
  let closer = htmlTag (\x -> x ~== TagClose tagtype)
  let block' = do notFollowedBy' closer
-                  atMostSpaces indentlevel
+                  gobbleAtMostSpaces indentlevel
                  block
  contents <- mconcat <$> many block'
  result <-
--- a/test/command/2434.md
+++ b/test/command/2434.md
@ -31,7 +31,7 @@
 ```
 % pandoc -t opendocument
-(@)  text
+(@) text
    some text
--- a/test/command/3511.md
+++ b/test/command/3511.md
@ -0,0 +1,46 @@
 ```
 % pandoc -t native
 - a
  - b
    - c
 -     code
 1000. one
    not continuation
 ^D
 [BulletList
 [[Plain [Str "a"]
  ,BulletList
   [[Plain [Str "b"]
    ,BulletList
     [[Plain [Str "c"]]]]]]
 ,[CodeBlock ("",[],[]) "code"]]
 ,OrderedList (1000,Decimal,Period)
 [[Plain [Str "one"]]]
 ,CodeBlock ("",[],[]) "not continuation"]
 ```
 ```
 % pandoc -t native -f markdown+four_space_rule
 - a
  - b
    - c
 -     not code
 1000. one
    continuation
 ^D
 [BulletList
 [[Plain [Str "a"]]
 ,[Plain [Str "b"]
  ,BulletList
   [[Plain [Str "c"]]]]
 ,[CodeBlock ("",[],[]) "not code"]]
 ,OrderedList (1000,Decimal,Period)
 [[Para [Str "one"]
  ,Para [Str "continuation"]]]]
 ```
--- a/test/markdown-reader-more.native
+++ b/test/markdown-reader-more.native
@ -29,13 +29,13 @@ Pandoc (Meta {unMeta = fromList [("author",MetaList [MetaInlines [Str "Author",S
 ,[Plain [Str "three"]]]
 ,Header 2 ("indented-code-at-beginning-of-list",[],[]) [Str "Indented",Space,Str "code",Space,Str "at",Space,Str "beginning",Space,Str "of",Space,Str "list"]
 ,BulletList
- [[CodeBlock ("",[],[]) "code\ncode"]]
+ [[CodeBlock ("",[],[]) "code\ncode"
-,OrderedList (1,Decimal,Period)
+  ,OrderedList (1,Decimal,Period)
- [[CodeBlock ("",[],[]) "code\ncode"]
+   [[CodeBlock ("",[],[]) "code\ncode"]
- ,[CodeBlock ("",[],[]) "code\ncode"]]
+   ,[CodeBlock ("",[],[]) "code\ncode"]]
-,BulletList
+  ,BulletList
- [[CodeBlock ("",[],[]) "code\ncode"]
+   [[CodeBlock ("",[],[]) "code\ncode"]
- ,[Plain [Str "no",Space,Str "code"]]]
+   ,[Plain [Str "no",Space,Str "code"]]]]]
 ,Header 2 ("backslash-newline",[],[]) [Str "Backslash",Space,Str "newline"]
 ,Para [Str "hi",LineBreak,Str "there"]
 ,Header 2 ("code-spans",[],[]) [Str "Code",Space,Str "spans"]
--- a/test/markdown-reader-more.txt
+++ b/test/markdown-reader-more.txt
@ -84,14 +84,14 @@ $PATH 90 $PATH
 ## Indented code at beginning of list
-       code
+-     code
-        code
+      code
-  1.    code
+  1.     code
-        code
+         code
-  12345678.    code
+  12345678.     code
-        code
+                code
  -     code
        code