Skip to content

Pandoc

Unstable API 0.10.0 @project-lakechain/pandoc-text-converter TypeScript

The Pandoc middleware converts at scale documents from a matrix of multiple formats using the Pandoc project. You can for example convert HTML, Docx, or Markdown documents into plain text to be able to run NLP analysis, or convert Markdown documents into PDF documents for creating nice-looking reports.


🔁 Converting Documents

To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.

💁 The below example takes supported input document types uploaded into a source S3 bucket, and converts them to plain text.

import { S3EventTrigger } from '@project-lakechain/s3-event-trigger';
import { PandocTextConverter } from '@project-lakechain/pandoc-text-converter';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// Create the S3 event trigger.
const trigger = new S3EventTrigger.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withBucket(bucket)
.build();
// Convert uploaded documents to plain text.
trigger.pipe(new PandocTextConverter.Builder()
.withScope(this)
.withIdentifier('PandocTextConverter')
.withCacheStorage(cache)
.withSource(trigger)
.build());
}
}


Conversion Matrix

By default, the Pandoc text converter will convert supported input documents to plain text. You can however explicitly specify a conversion matrix describing which inputs to converts into which outputs.

💁 The below example demonstrates how to convert Docx documents to both plain text and PDF, and Markdown documents to HTML.

import { PandocTextConverter, from } from '@project-lakechain/pandoc-text-converter';
const pandoc = new PandocTextConverter.Builder()
.withScope(this)
.withIdentifier('PandocTextConverter')
.withCacheStorage(cache)
.withSource(trigger)
.withConversions(
from('docx').to('plain', 'pdf'),
from('md').to('html')
)
.build();


🏗️ Architecture

This middleware is based on a Python Lambda compute running the Pandoc project using the pypandoc library packaged as a Lambda layer.

Pandoc Architecture



🏷️ Properties


Supported Inputs
Mime TypeDescription
application/epub+zipEPUB documents.
text/csvCSV documents.
text/tab-separated-valuesTSV documents.
application/vnd.openxmlformats-officedocument.wordprocessingml.documentWord documents.
text/markdownMarkdown documents.
text/htmlHTML documents.
application/vnd.oasis.opendocument.textOpenOffice documents.
application/rtfRTF documents.
application/x-texLaTeX documents.
text/x-rstRST documents.
text/x-textileTextile documents.
application/x-ipynb+jsonJupyter Notebook documents.
text/troffManual documents.
application/x-bibtexBibTex documents.
application/docbook+xmlDocbook documents.
application/x-fictionbook+xmlFictionBook documents.
text/x-opmlOPML documents.
application/x-texinfoTexinfo documents.
Supported Outputs
Mime TypeDescription
text/x-asciidocAsciidoc documents.
application/x-bibtexBibTex documents.
application/docbook+xmlDocbook documents.
application/vnd.openxmlformats-officedocument.wordprocessingml.documentWord documents.
application/epub+zipEPUB documents.
application/x-fictionbook+xmlFictionBook documents.
text/x-haskellHaskell documents.
text/htmlHTML documents.
application/xmlXML documents.
application/x-ipynb+jsonJupyter Notebook documents.
application/jsonJSON documents.
application/x-texLaTeX documents.
text/troffManual documents.
text/markdownMarkdown documents.
text/plainPlain text documents.
application/vnd.oasis.opendocument.textOpenOffice documents.
text/x-opmlOPML documents.
application/pdfPDF documents.
application/vnd.openxmlformats-officedocument.presentationml.presentationPowerPoint documents.
text/x-rstRST documents.
application/rtfRTF documents.
application/x-texinfoTexinfo documents.
text/x-textileTextile documents.
Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.


📖 Examples