pandoc API python

Using the pandoc API
Pandoc can be used as a Haskell library for writing your own conversion tools or powering web applications. This document describes how to use the pandoc API.

Detailed API documentation at the individual function and type levels can be found at https://hackage.haskell.org/package/pandoc.

Pandoc’s architecture
Pandoc is structured as a set of readers and a set of writers, which convert various input formats into an abstract syntax tree (Pandoc AST) representing a structured document, while a set of writers render this AST into various output formats. as the picture shows:

[input format] ==reader==> [Pandoc AST] ==writer==> [output format]

This architecture allows pandoc to perform M × ? N transformations using M readers and N writers.

Pandoc AST is defined in the pandoc-types package. You should first look at Text.Pandoc.Definition in the Haddock documentation. As you will see, aPandoc consists of some metadata and a list of s that make up the Block. There are many types of Blocks, including Para (paragraph), Header (chapter title) and BlockQuote. Some of these Blocks (like BlockQuote) contain a list of Block s, while others (like Para) contain a list of s Inline, and still others (like CodeBlock) contain plain text or no content. Inlines are the basic elements of paragraphs. The difference between Block and Inline For example, in the type system, there is no way to represent an Inline link text as a link to a block reference (Block). This limitation on expression is mostly a help rather than a hindrance, since many of the formats supported by pandoc have similar limitations.

The best way to explore the pandoc AST is to use pandoc -t native, which will display the AST corresponding to some Markdown input:

% echo -e "1. *foo*\
2. bar" | pandoc -t native
[OrderedList (1,Decimal,Period)
 [[Plain [Emph [Str "foo"]]]
 ,[Plain [Str "bar"]]]]

a simple example
Here is a simple example of using pandoc reader and writer to perform conversion:

import Text.Pandoc
import qualified Data.Text as T
import qualified Data.Text.IO as TIO

main::IO()
main=do
  result <- runIO $ do
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc
  rst <- handleError result
  TIO.putStrLn rst

Some notes:

The first part constructs a conversion pipeline: the input string is passed to readMarkdown, doc is then rendered into the resulting Pandoc AST by () writeRST. The conversion pipeline is “run” by runIO.

result has type Either PandocError Text. We can pattern match this manually, but in this case it is simpler to use the function in handleErrorText.Pandoc.Error. If the value is a , exit with the appropriate error code and message; if the value is a , Left returns. TextRight

PandocMonad class
Let’s look at the types of readMarkdown and writeRST:

readMarkdown :: (PandocMonad m, ToSources a)
             => ReaderOptions
             -> a
             -> m Pandoc
writeRST::PandocMonad m
             => WriterOptions
             -> Pandoc
             ->mText

The PandocMonad m => part is a type class constraint. It states that readMarkdown and writeRST define calculations that can be used in any instance of the PandocMonad type class. PandocMonad is defined in the module Text.Pandoc.Class.

Two instances of PandocMonad are provided: PandocIO and PandocPure. The difference is that the calculation in PandocIO allows IO (e.g. reading a file), while the calculation in PandocPure does not have any side effects. PandocPure is useful for sandbox environments when you want to prevent users from performing any malicious actions. To run conversions in PandocIO, use runIO (as described above). To run PandocPure, use runPure.

As you can see from Haddocks, Text.Pandoc.Class exports a number of helper functions that can be used in any instance of PandocMonad. For example:

-- | Get the verbosity level.
getVerbosity :: PandocMonad m => m Verbosity

-- | Set the verbosity level.
setVerbosity :: PandocMonad m => Verbosity -> m ()

-- Get the accumulated log messages (in temporal order).
getLog::PandocMonad m => m [LogMessage]
getLog = reverse <$> getsCommonState stLog

-- | Log a message using 'logOutput'. Note that 'logOutput' is
-- called only if the verbosity level exceeds the level of the
-- message, but the message is added to the list of log messages
-- that will be retrieved by 'getLog' regardless of its verbosity level.
report :: PandocMonad m => LogMessage -> m ()

-- | Fetch an image or other item from the local filesystem or the net.
-- Returns raw content and maybe mime type.
fetchItem::PandocMonad m
          => Text
          -> m (B.ByteString, Maybe MimeType)

-- Set the resource path searched by 'fetchItem'.
setResourcePath :: PandocMonad m => [FilePath] -> m ()

If we want a more detailed information message during the conversion process defined in the previous section, we can do this:

 result <- runIO $ do
    setVerbosity INFO
    doc <- readMarkdown def (T.pack "[testing](url)")
    writeRST def doc

Note that PandocIO is an instance of MonadIO , so you can use liftIO to perform arbitrary IO operations within the pandoc transformation chain.

readMarkdown’s second parameter is polymorphic, it can be any type that is an instance of the type class ToSources. You can use Text as shown in the example above. But you can also use [(FilePath, Text)] if the input comes from multiple files and you want to track the source location accurately.

Options
The first argument to each reader or writer are options that control the behavior of the reader or writer: ReaderOptions for readers and WriterOptions for writers. These are defined in Text.Pandoc.Options. It’s a good idea to research the options to see what you can adjust.

def (from Data.Default) represents the default value for each option. (You can also use defaultWriterOptions and defaultReaderOptions.) Typically you’ll want to use the defaults and only modify them when needed, for example:

writeRST def{ writerReferenceLinks = True }

Some particularly important options to know about:

writerTemplate: Default is , Nothing means a document fragment will be generated. If you want complete documentation, you need to specify Just template, where template is a Text.Pandoc.Templates containing the template content (not the path). Template Text

readerExtensions and writerExtensions: These specify extensions used in parsing and rendering. Extensions are defined in Text.Pandoc.Extensions.

builder
Sometimes it is useful to build Pandoc documents programmatically. To make this easier, we provide the module Text.Pandoc.Builder pandoc-types.

Because concatenating lists is slow, we use the special type Inlines and Blocks to wrap SequenceofInline and Block elements. These are instances of the Monoid type class and can be easily wired:

import Text.Pandoc.Builder

mydoc::Pandoc
mydoc = doc $ header 1 (text (T.pack "Hello!"))
           <> para (emph (text (T.pack "hello world")) <> text (T.pack "."))

main::IO()
main = print mydoc

If you use OverloadedStringspragma you can simplify further:

mydoc = doc $ header 1 "Hello!"
           <> para (emph "hello world" <> ".")

This is a more realistic example. Suppose your boss says: Write me a letter in Word listing all the gas stations in Chicago that accept Voyager cards. You’ll find some JSON data ( fuel.json ) in the following format:

[ {<!-- -->
  "state" : "IL",
  "city" : "Chicago",
  "fuel_type_code" : "CNG",
  "zip" : "60607",
  "station_name" : "Clean Energy - Yellow Cab",
  "cards_accepted" : "A D M V Voyager Wright_Exp CleanEnergy",
  "street_address" : "540 W Grenshaw"
}, ...

Then use aeson and pandoc to parse the JSON and create a Word document:

{<!-- -->-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.Builder
import Text.Pandoc
import Data.Monoid ((<>), mempty, mconcat)
import Data.Aeson
importControl.Applicative
importControl.Monad(mzero)
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text as T
import Data.List(intersperse)

data Station = Station{
    address::T.Text
  , name :: T.Text
  , cardsAccepted :: [T.Text]
  } deriving Show

instance FromJSON Station where
    parseJSON (Object v) = Station <$>
       v .: "street_address" <*>
       v .: "station_name" <*>
       (T.words <$> (v .:? "cards_accepted" .!= ""))
    parseJSON_ = mzero

createLetter :: [Station] -> Pandoc
createLetter stations = doc $
    para "Dear Boss:" <>
    para "Here are the CNG stations that accept Voyager cards:" <>
    simpleTable [plain "Station", plain "Address", plain "Cards accepted"]
           (map stationToRow stations) <>
    para "Your loyal servant," <>
    plain (image "JohnHancock.png" "" mempty)
  where
    stationToRow station =
      [ plain (text $ name station)
      , plain (text $ address station)
      , plain (mconcat $ intersperse linebreak
                       $ map text $ cardsAccepted station)
      ]

main::IO()
main=do
  json <- BL.readFile "fuel.json"
  let letter = case decode json of
                    Just stations -> createLetter [s | s <- stations,
                                        "Voyager" `elem` cardsAccepted s]
                    Nothing -> error "Could not decode JSON"
  docx <- runIO (writeDocx def letter) >>= handleError
  BL.writeFile "letter.docx" docx
  putStrLn "Created letter.docx"

Look! You wrote this letter without using Word and without viewing the data.

data file
Pandoc has a number of data files, which can be found in the data/ subdirectory of the repository. They are installed with pandoc (or, if pandoc was compiled with the embed_data_files flag, they are embedded in the binary). You can retrieve the data file using Text.Pandoc.Class readDataFile. First, the file will be searched in the “User Data Directory” ( , ) readDataFile. If not found, the file installed by default in the system will be returned. To force the default value, .setUserDataDirgetUserDataDirsetUserDataDir Nothing

metadata file
Pandoc can add metadata to documents, as described in the user guide. Similar to data files, metadata YAML files can be retrieved using readMetadataFileText.Pandoc.Class. The file will first be searched in the working directory. If not found, it will be searched in the subdirectory (, ) readMetadataFile of the user data directory. metadatasetUserDataDirgetUserDataDir

template
Pandoc has its own template system, which is described in the user guide. To retrieve the system’s default template, use getDefaultTemplate with Text.Pandoc.Templates. Note that this will first look in the templates subdirectory of the user data directory, allowing the user to override the system defaults. If you want to disable this behavior, use setUserDataDir Nothing.

To render a template, use renderTemplate’, which accepts two parameters: the template (Text) and the context (any instance of ToJSON). If you want to create a context from the metadata part of a Pandoc document, metaToJSON’ use Text.Pandoc.Writers.Shared. If you also want to combine values in variables, use metaToJSON instead and make sure writerVariables is set in WriterOptions.

Handle errors and warnings
runIO and runPure return an Either PandocError a. All errors raised when running a calculation by the PandocMonad will be caught and returned as Left values so they can be handled by the calling program. To see the constructor for , see the documentation for Text.Pandoc.ErrorPandocError.

PandocError To raise a PandocMonad from within a calculation, use throwError.

In addition to errors that stop execution of the transformation pipeline, informational messages can be generated. Emit ._LogMessage using reportText.Pandoc.Class. For a list of constructors for LogMessage, see Text.Pandoc.Logging. Note that each type of log message is associated with a verbosity level. The verbosity level ( setVerbosity/getVerbosity) determines whether the report will be printed to stderr (when running in PandocIO), but regardless of the verbosity level, all reported messages are stored internally and can be retrieved using getLog.

Take the AST
It’s often useful to walk the Pandoc AST to extract information (e.g., what are all the URLs linked to in this document? Do all code examples compile?) or to transform the document (e.g., increase the level of each section). title, remove emphasis, or replace specifically marked code blocks with images). To make this easier and more efficient, pandoc-types includes a module Text.Pandoc.Walk.

This is the necessary documentation:

class Walkable a b where
  -- | @walk f x@ walks the structure @x@ (bottom up) and replaces every
  -- occurrence of an @a@ with the result of applying @f@ to it.
  walk :: (a -> a) -> b -> b
  walk f = runIdentity . walkM (return . f)
  -- | A monadic version of 'walk'.
  walkM :: (Monad m, Functor m) => (a -> m a) -> b -> m b
  -- | @query f x@ walks the structure @x@ (bottom up) and applies @f@
  -- to every @a@, appending the results.
  query :: Monoid c => (a -> c) -> b -> c

Walkable instances are defined for most combinations of Pandoc types. For example, the Walkable Inline Block instance allows you to take a function Inline -> Inline and apply it to the Block. And the Walkable [Inline] Pandoc allows you to take a function [Inline] -> [Inline] and apply it to Inline in a List of max s per Pandoc.

Here’s a simple example of a function that raises the title level:

promoteHeaderLevels::Pandoc -> Pandoc
promoteHeaderLevels = walk promote
  where promote :: Block -> Block
        promote (Header lev attr ils) = Header (lev + 1) attr ils
        promote x = x

walkM is the unary version of walk; you can use it, for example, when you need a transformation to perform IO operations, use PandocMonad operations, or update internal state. Here’s an example of using the State monad to add a unique identifier to each block of code:

addCodeIdentifiers :: Pandoc -> Pandoc
addCodeIdentifiers doc = evalState (walkM addCodeId doc) 1
  where addCodeId :: Block -> State Int Block
        addCodeId (CodeBlock (_,classes,kvs) code) = do
          curId <- get
          put (curId + 1)
          return $CodeBlock (show curId,classes,kvs) code
        addCodeId x = return x

query is used to collect information from the AST. Its argument is a query function that produces some monoid type result (such as a list). The results are joined together. Here’s an example that returns a list of URLs linked in a document:

listURLs::Pandoc -> [Text]
listURLs = query urls
  where urls (Link _ _ (src, _)) = [src]
        urls_ = []

Create frontend
All functionality of the command line program pandoc has been abstracted into the Text.Pandoc.AppconvertWithOpts module. Therefore, creating a GUI frontend for pandoc is just a matter of filling in the structure and calling this function. Opts

Things to note when using pandoc in web applications
Pandoc’s parser may exhibit pathological behavior with certain inputs. System.Timeout.timeout Therefore, it is always a good idea to wrap the use of pandoc in a timeout function (e.g. from base) to prevent DoS attacks.

If pandoc generates HTML from untrusted user input, it’s always a good idea to filter the generated HTML through a sanitizer such as xss-sanitize to avoid security issues.

Using runPure instead of runIO ensures that pandoc’s functions do not perform IO operations (such as writing to files). If you need to make some resources available, provide a “fake environment” runPure within the available state (see Text.Pandoc.ClassPureState and its related functions). It is also possible to write a custom instance, for example, to make wiki resources available as files in a fake environment while isolating pandoc from the rest of the system. PandocMonad