Skip to content

Trafilatura

Unstable API 0.10.0 @project-lakechain/trafilatura TypeScript

The Trafilatura middleware is based on the Trafilatura library which provides one of the most accurate rule-based engine for HTML article text extraction. It provides capability to analyze and extract the substance of HTML documents on the Web and use that text as an input to other middlewares in your pipeline.


📰 Extracting Text

To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.

import { TrafilaturaParser } from '@project-lakechain/trafilatura';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// The trafilatura parser.
const trafilatura = new TrafilaturaParser.Builder()
.withScope(this)
.withIdentifier('Trafilatura')
.withCacheStorage(cache)
.withSource(source) // 👈 Specify a data source
.build();
}
}


🏗️ Architecture

This middleware is based on a Lambda compute running the Trafilatura library packaged as a Docker container.

Trafilatura Architecture



🏷️ Properties


Supported Inputs
Mime TypeDescription
text/htmlHTML documents.
Supported Outputs
Mime TypeDescription
text/plainPlain text documents.
Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.