pandoc/src/Text/Pandoc/UTF8.hs

{-
Copyright (C) 2010-2016 John MacFarlane <jgm@berkeley.edu>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
-}

{- |
   Module      : Text.Pandoc.UTF8
   Copyright   : Copyright (C) 2010-2016 John MacFarlane
   License     : GNU GPL, version 2 or above

   Maintainer  : John MacFarlane <jgm@berkeley.edu>
   Stability   : alpha
   Portability : portable

UTF-8 aware string IO functions that will work with GHC 6.10, 6.12, or 7.
-}
module Text.Pandoc.UTF8 ( readFile
                        , writeFile
                        , getContents
                        , putStr
                        , putStrLn
                        , hPutStr
                        , hPutStrLn
                        , hGetContents
                        , toString
                        , fromString
                        , toStringLazy
                        , fromStringLazy
                        , encodePath
                        , decodeArg
                        )

where

import System.IO hiding (readFile, writeFile, getContents,
                          putStr, putStrLn, hPutStr, hPutStrLn, hGetContents)
import Prelude hiding (readFile, writeFile, getContents, putStr, putStrLn)
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text.Encoding as T
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL

readFile :: FilePath -> IO String
readFile f = do
  h <- openFile (encodePath f) ReadMode
  hGetContents h

writeFile :: FilePath -> String -> IO ()
writeFile f s = withFile (encodePath f) WriteMode $ \h -> hPutStr h s

getContents :: IO String
getContents = hGetContents stdin

putStr :: String -> IO ()
putStr s = hPutStr stdout s

putStrLn :: String -> IO ()
putStrLn s = hPutStrLn stdout s

hPutStr :: Handle -> String -> IO ()
hPutStr h s = hSetEncoding h utf8 >> IO.hPutStr h s

hPutStrLn :: Handle -> String -> IO ()
hPutStrLn h s = hSetEncoding h utf8 >> IO.hPutStrLn h s

hGetContents :: Handle -> IO String
hGetContents = fmap toString . B.hGetContents
-- hGetContents h = hSetEncoding h utf8_bom
--                   >> hSetNewlineMode h universalNewlineMode
--                   >> IO.hGetContents h

-- | Drop BOM (byte order marker) if present at beginning of string.
-- Note that Data.Text converts the BOM to code point FEFF, zero-width
-- no-break space, so if the string begins with this  we strip it off.
dropBOM :: String -> String
dropBOM ('\xFEFF':xs) = xs
dropBOM xs = xs

filterCRs :: String -> String
filterCRs ('\r':'\n':xs) = '\n': filterCRs xs
filterCRs ('\r':xs) = '\n' : filterCRs xs
filterCRs (x:xs) = x : filterCRs xs
filterCRs []     = []

-- | Convert UTF8-encoded ByteString to String, also
-- removing '\r' characters.
toString :: B.ByteString -> String
toString = filterCRs . dropBOM . T.unpack . T.decodeUtf8

fromString :: String -> B.ByteString
fromString = T.encodeUtf8 . T.pack

-- | Convert UTF8-encoded ByteString to String, also
-- removing '\r' characters.
toStringLazy :: BL.ByteString -> String
toStringLazy = filterCRs . dropBOM . TL.unpack . TL.decodeUtf8

fromStringLazy :: String -> BL.ByteString
fromStringLazy = TL.encodeUtf8 . TL.pack

encodePath :: FilePath -> FilePath
encodePath = id

decodeArg :: String -> String
decodeArg = id
Added Text.Pandoc.UTF8 for portable UTF8 string IO. 2010-05-06 20:27:10 -07:00			`{-`
Updated copyright dates to include 2016. 2016-03-22 17:20:39 -07:00			`Copyright (C) 2010-2016 John MacFarlane <jgm@berkeley.edu>`
Added Text.Pandoc.UTF8 for portable UTF8 string IO. 2010-05-06 20:27:10 -07:00
			`This program is free software; you can redistribute it and/or modify`
			`it under the terms of the GNU General Public License as published by`
			`the Free Software Foundation; either version 2 of the License, or`
			`(at your option) any later version.`

			`This program is distributed in the hope that it will be useful,`
			`but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the`
			`GNU General Public License for more details.`

			`You should have received a copy of the GNU General Public License`
			`along with this program; if not, write to the Free Software`
			`Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA`
			`-}`

			`{- \|`
			`Module : Text.Pandoc.UTF8`
Updated copyright dates to include 2016. 2016-03-22 17:20:39 -07:00			`Copyright : Copyright (C) 2010-2016 John MacFarlane`
Fixed whitespace errors. 2012-07-26 22:32:53 -07:00			`License : GNU GPL, version 2 or above`
Added Text.Pandoc.UTF8 for portable UTF8 string IO. 2010-05-06 20:27:10 -07:00
			`Maintainer : John MacFarlane <jgm@berkeley.edu>`
			`Stability : alpha`
			`Portability : portable`

Revert "More intelligent handling of text encodings." This reverts commit 7272735b3d413a644fd9ab01eeae8ae9cd5a925b. 2012-09-23 22:53:34 -07:00			`UTF-8 aware string IO functions that will work with GHC 6.10, 6.12, or 7.`
Added Text.Pandoc.UTF8 for portable UTF8 string IO. 2010-05-06 20:27:10 -07:00			`-}`
			`module Text.Pandoc.UTF8 ( readFile`
			`, writeFile`
			`, getContents`
			`, putStr`
			`, putStrLn`
			`, hPutStr`
			`, hPutStrLn`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00			`, hGetContents`
Removed need for utf8-string package. * Depend on text. * Expose Text.Pandoc.UTF8. * Text.Pandoc.UTF8 now exports toString, fromString, toStringLazy, fromStringLazy. * These are used instead of the old utf8-string functions. 2012-09-25 19:54:21 -07:00			`, toString`
			`, fromString`
			`, toStringLazy`
			`, fromStringLazy`
Export encodePath/decodePath from UTF8. Removed duplicate code in src/pandoc.hs. 2012-09-23 10:43:03 -07:00			`, encodePath`
UTF8: Export decodeArg. 2012-09-23 11:01:09 -07:00			`, decodeArg`
Added Text.Pandoc.UTF8 for portable UTF8 string IO. 2010-05-06 20:27:10 -07:00			`)`

			`where`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00
			`import System.IO hiding (readFile, writeFile, getContents,`
			`putStr, putStrLn, hPutStr, hPutStrLn, hGetContents)`
Added CPP directives to avoid warnings. For 'import Prelude hiding (catch)'. catch is no longer in Prelude starting with base 4.6. 2013-05-29 09:11:01 -07:00			`import Prelude hiding (readFile, writeFile, getContents, putStr, putStrLn)`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00			`import qualified System.IO as IO`
Removed need for utf8-string package. * Depend on text. * Expose Text.Pandoc.UTF8. * Text.Pandoc.UTF8 now exports toString, fromString, toStringLazy, fromStringLazy. * These are used instead of the old utf8-string functions. 2012-09-25 19:54:21 -07:00			`import qualified Data.ByteString.Char8 as B`
			`import qualified Data.ByteString.Lazy as BL`
			`import qualified Data.Text.Encoding as T`
			`import qualified Data.Text as T`
			`import qualified Data.Text.Lazy as TL`
			`import qualified Data.Text.Lazy.Encoding as TL`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00
			`readFile :: FilePath -> IO String`
			`readFile f = do`
Don't encode/decode file paths if base >= 4.5. Prior to base 4.5 (and perhaps earlier - check), filepaths and command line arguments were treated as unencoded lists of bytes, not unicode strings, so we had to work around that by encoding and decoding them. This commit adds CPP checks for base 4.5 that disable the encoding/decoding. Fixes a bug with multilingual filenames when pandoc was compiled with ghc 7.4. Closes #540. 2012-06-22 21:24:02 +02:00			`h <- openFile (encodePath f) ReadMode`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00			`hGetContents h`

			`writeFile :: FilePath -> String -> IO ()`
Don't encode/decode file paths if base >= 4.5. Prior to base 4.5 (and perhaps earlier - check), filepaths and command line arguments were treated as unencoded lists of bytes, not unicode strings, so we had to work around that by encoding and decoding them. This commit adds CPP checks for base 4.5 that disable the encoding/decoding. Fixes a bug with multilingual filenames when pandoc was compiled with ghc 7.4. Closes #540. 2012-06-22 21:24:02 +02:00			`writeFile f s = withFile (encodePath f) WriteMode $ \h -> hPutStr h s`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00
			`getContents :: IO String`
			`getContents = hGetContents stdin`

			`putStr :: String -> IO ()`
			`putStr s = hPutStr stdout s`

			`putStrLn :: String -> IO ()`
			`putStrLn s = hPutStrLn stdout s`

			`hPutStr :: Handle -> String -> IO ()`
			`hPutStr h s = hSetEncoding h utf8 >> IO.hPutStr h s`

			`hPutStrLn :: Handle -> String -> IO ()`
			`hPutStrLn h s = hSetEncoding h utf8 >> IO.hPutStrLn h s`

			`hGetContents :: Handle -> IO String`
Text.Pandoc.UTF8: Use strict bytestrings in reading. The use of lazy bytestrings seemed to cause problems using pandoc on Windows 7/8 64-bit machines. Closes #874. 2013-07-04 15:43:42 -07:00			`hGetContents = fmap toString . B.hGetContents`
UTF8: Better error message for invalid UTF8. Read bytestring and use Text's decodeUtf8 instead of using System.IO's hGetContents. This way you get a message saying "invalid UTF-8 stream" instead of "invalid byte sequence." You are also told which byte caused the problem. 2012-09-26 09:04:21 -07:00			`-- hGetContents h = hSetEncoding h utf8_bom`
			`-- >> hSetNewlineMode h universalNewlineMode`
			`-- >> IO.hGetContents h`
UTF8 module: Use base 4.2 IO if available. This gives us proper line endings on windows, and some speed improvements. We fall back to the old functions if base < 4.2. hGetContents is now exported. 2011-01-30 16:01:31 -08:00
UTF8: Strip off BOM if present. Closes #743. 2013-02-08 09:45:15 -08:00			`-- \| Drop BOM (byte order marker) if present at beginning of string.`
			`-- Note that Data.Text converts the BOM to code point FEFF, zero-width`
			`-- no-break space, so if the string begins with this we strip it off.`
			`dropBOM :: String -> String`
			`dropBOM ('\xFEFF':xs) = xs`
			`dropBOM xs = xs`

UTF8: Better handling of bare CRs in input files. Previously we just stripped them out; now we convert other line ending styles to LF line endings. Closes #2132. 2015-05-05 12:41:35 -07:00			`filterCRs :: String -> String`
			`filterCRs ('\r':'\n':xs) = '\n': filterCRs xs`
			`filterCRs ('\r':xs) = '\n' : filterCRs xs`
			`filterCRs (x:xs) = x : filterCRs xs`
			`filterCRs [] = []`

UTF8 module: Remove `\r` when reading. This should prevent problems with extra CRs on windows. 2013-01-06 16:35:41 -08:00			`-- \| Convert UTF8-encoded ByteString to String, also`
			`-- removing '\r' characters.`
Removed need for utf8-string package. * Depend on text. * Expose Text.Pandoc.UTF8. * Text.Pandoc.UTF8 now exports toString, fromString, toStringLazy, fromStringLazy. * These are used instead of the old utf8-string functions. 2012-09-25 19:54:21 -07:00			`toString :: B.ByteString -> String`
UTF8: Better handling of bare CRs in input files. Previously we just stripped them out; now we convert other line ending styles to LF line endings. Closes #2132. 2015-05-05 12:41:35 -07:00			`toString = filterCRs . dropBOM . T.unpack . T.decodeUtf8`
Removed need for utf8-string package. * Depend on text. * Expose Text.Pandoc.UTF8. * Text.Pandoc.UTF8 now exports toString, fromString, toStringLazy, fromStringLazy. * These are used instead of the old utf8-string functions. 2012-09-25 19:54:21 -07:00
			`fromString :: String -> B.ByteString`
			`fromString = T.encodeUtf8 . T.pack`

UTF8 module: Remove `\r` when reading. This should prevent problems with extra CRs on windows. 2013-01-06 16:35:41 -08:00			`-- \| Convert UTF8-encoded ByteString to String, also`
			`-- removing '\r' characters.`
Removed need for utf8-string package. * Depend on text. * Expose Text.Pandoc.UTF8. * Text.Pandoc.UTF8 now exports toString, fromString, toStringLazy, fromStringLazy. * These are used instead of the old utf8-string functions. 2012-09-25 19:54:21 -07:00			`toStringLazy :: BL.ByteString -> String`
UTF8: Better handling of bare CRs in input files. Previously we just stripped them out; now we convert other line ending styles to LF line endings. Closes #2132. 2015-05-05 12:41:35 -07:00			`toStringLazy = filterCRs . dropBOM . TL.unpack . TL.decodeUtf8`
Removed need for utf8-string package. * Depend on text. * Expose Text.Pandoc.UTF8. * Text.Pandoc.UTF8 now exports toString, fromString, toStringLazy, fromStringLazy. * These are used instead of the old utf8-string functions. 2012-09-25 19:54:21 -07:00
			`fromStringLazy :: String -> BL.ByteString`
			`fromStringLazy = TL.encodeUtf8 . TL.pack`

Don't encode/decode file paths if base >= 4.5. Prior to base 4.5 (and perhaps earlier - check), filepaths and command line arguments were treated as unencoded lists of bytes, not unicode strings, so we had to work around that by encoding and decoding them. This commit adds CPP checks for base 4.5 that disable the encoding/decoding. Fixes a bug with multilingual filenames when pandoc was compiled with ghc 7.4. Closes #540. 2012-06-22 21:24:02 +02:00			`encodePath :: FilePath -> FilePath`
			`encodePath = id`
Remove unnecessary CPP condition in UTF8 Base 4.4 is ghc 7.2, so we don't have to worry about getting a lower version. 2016-09-01 07:07:03 -04:00
			`decodeArg :: String -> String`
UTF8: Export decodeArg. 2012-09-23 11:01:09 -07:00			`decodeArg = id`