Skip to content

Translate

Unstable API 0.7.0 @project-lakechain/translate-text-processor TypeScript

The Translate text processor makes it possible to translate documents from one language to a set of languages. at scale, using the Amazon Translate service. It supports various document formats such as Text, HTML, Docx, PowerPoint, Excel, and Xliff.

Using Amazon Translate, the input documents formatting and structure is preserved during the translation process, and the output documents are stored in the same format as the input documents.


đŸ’Ŧ Translating Documents

To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.

💁 The below example takes supported input document uploaded into a source S3 bucket, and translates them to French and Spanish.

import { S3EventTrigger } from '@project-lakechain/s3-event-trigger';
import { TranslateTextProcessor } from '@project-lakechain/translate-text-processor';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// Create the S3 event trigger.
const trigger = new S3EventTrigger.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withBucket(bucket)
.build();
// Translate uploaded text documents.
const translate = new TranslateTextProcessor.Builder()
.withScope(this)
.withIdentifier('TranslateTextProcessor')
.withCacheStorage(cache)
.withSource(trigger)
.withOutputLanguages(['fr', 'es'])
.build();
}
}


Profanity Detection

Amazon Translate supports masking profane words and sentences from translation results. To enable profanity detection, you can use the .withProfanityRedaction method.

const translate = new TranslateTextProcessor.Builder()
.withScope(this)
.withIdentifier('TranslateTextProcessor')
.withCacheStorage(cache)
.withSource(source)
.withOutputLanguages(['fr', 'es'])
.withProfanityRedaction(true) // 👈 Enable profanity detection
.build();


Tone Formality

You can also adapt the tone formality of the translation results using the .withFormality method across FORMAL and INFORMAL tones.

const translate = new TranslateTextProcessor.Builder()
.withScope(this)
.withIdentifier('TranslateTextProcessor')
.withCacheStorage(cache)
.withSource(source)
.withOutputLanguages(['fr', 'es'])
.withFormalityTone('FORMAL') // 👈 Set the tone formality
.build();


⏱ī¸ Sync vs Async Jobs

ℹī¸ This middleware uses both the real-time synchronous API, and the asynchronous batch translation API provided by Amazon Translate to translate documents.

Synchronous translations are faster, but have a limit of 100KB per document with support for Text, Docx, and HTML documents. Asynchronous batch jobs on the other hand support much larger documents sizes (up to 20MB per document) and a wider array of document types, but are significantly slower than synchronous translations.

This middleware will intelligently determines the right job type to use for each input document based on its size and format in order to optimize the translation process.



ℹī¸ Limits

Using Amazon Translate as a backbone, the Translate middleware can translate between 70+ different languages. Please note though that Amazon Translate supports specific language-to-language translation pairs (e.g English to French).

As such, it is possible that not all combinations of languages are supported given the original language of the document. In such a case, an exception will be raised within the pipeline at runtime and the execution for that specific document will fail.



🏗ī¸ Architecture

The processing flow implemented by this middleware depends on whether synchronous or asynchronous jobs are used to translate documents.

When using synchronous translations, the middleware uses the Amazon Translate real-time API to translate documents using a Lambda function which waits for the translations to be completed before forwarding them to the next middlewares in the pipeline.

When using asynchronous translations, This middleware uses an event-driven architecture leveraging Amazon Translate batch jobs, DynamoDB to maintain a mapping between jobs, and runs several Lambda computes based on the ARM64 architecture to orchestrate the overall translations.

Architecture



🏷ī¸ Properties


Supported Inputs
Mime TypeDescription
text/plainPlain text documents.
text/htmlHTML documents.
application/vnd.openxmlformats-officedocument.wordprocessingml.documentWord documents.
application/vnd.openxmlformats-officedocument.presentationml.presentationPowerPoint documents.
application/vnd.openxmlformats-officedocument.spreadsheetml.sheetExcel documents.
application/x-xliff+xmlXLIFF documents.
Supported Outputs

This middleware supports the same output types as its input types.

Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.


📖 Examples