Skip to content

S3 Trigger

Unstable API 0.10.0 @project-lakechain/s3-event-trigger TypeScript

The S3 trigger starts processing pipelines based on Amazon S3 object events. Specifically, it monitors the creation, modification and deletion of objects in monitored bucket(s).


đŸĒŖ Monitoring Buckets

To use this middleware, you import it in your CDK stack and specify the bucket(s) you want to monitor.

ℹī¸ The below example monitors a single bucket.

import { S3EventTrigger } from '@project-lakechain/s3-event-trigger';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// Sample bucket.
const bucket = new s3.Bucket(this, 'Bucket', {});
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// Create the S3 event trigger.
const trigger = new S3EventTrigger.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withBucket(bucket)
.build();
}
}

You can also specify multiple buckets to be monitored by the S3 trigger by passing an array of S3 buckets to the .withBuckets method.

const trigger = new S3EventTrigger.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withBuckets([bucket1, bucket2])
.build();


Filtering

It is also possible to provide finer grained filtering instructions to the withBucket method to monitor specific prefixes and/or suffixes.

const trigger = new S3EventTrigger.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withBucket({
bucket,
prefix: 'data/',
suffix: '.csv',
})
.build();


🗂ī¸ Metadata

The S3 event trigger middleware makes it optionally possible to fetch the metadata associated with the S3 object, and enrich the created cloud event with the object metadata.

💁 Metadata retrieval is disabled by default, and can be enabled by using the .withFetchMetadata API.

const trigger = new S3EventTrigger.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withBucket(bucket)
.withFetchMetadata(true)
.build();


👨‍đŸ’ģ Algorithm

The S3 event trigger middleware converts S3 native events into the CloudEvents specification and enriches the document description with required metadata, such as the mime-type, the size, and the Etag associated with the document.

All those information cannot be inferred from the S3 event alone, and to efficiently compile those metadata, this middleware uses the following algorithm.

  1. The Size, Etag, and URL of the S3 object are taken from the S3 event and added to the Cloud Event.
  2. If the object is a directory, it is ignored, as this middleware only processes documents.
  3. The middleware tries to infer the mime-type of the document from the object extension.
  4. If the mime-type cannot be inferred from the extension, we try to infer it from the S3 reported content type.
  5. If the mime-type cannot be inferred from the S3 reported content type, we try to infer it from the first bytes of the document using a chunked request.
  6. If the mime-type cannot be inferred at all, we set the mime-type to application/octet-stream.
  7. If S3 object metadata retrieval is enabled, the middleware will issue a request to S3 and enrich the Cloud Event with the object metadata.


📤 Events

This middleware emits Cloud Events whenever an object is created, modified, or deleted in the monitored bucket(s). Below is an example of events emitted by the S3 trigger middleware upon a creation (or modification), and a deletion of an object.

💁 Click to expand example
Event Type Example
Document Creation or Update
{
"specversion": "1.0",
"id": "1780d5de-fd6f-4530-98d7-82ebee85ea39",
"type": "document-created",
"time": "2023-10-22T13:19:10.657Z",
"data": {
"chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79",
"source": {
"url": "s3://bucket/document.txt",
"type": "text/plain",
"size": 26378,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"document": {
"url": "s3://bucket/document.txt",
"type": "text/plain",
"size": 26378,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"metadata": {},
"callStack": [
"s3-event-trigger"
]
}
}
Document Deletion
{
"specversion": "1.0",
"id": "2f20a29d-c96f-4e2f-a64e-855a9c1e14bb",
"type": "document-deleted",
"time": "2023-10-22T13:20:00.657Z",
"data": {
"chainId": "dd50a7f2-4263-4266-bb5f-dea2ab8970c3",
"source": {
"url": "s3://bucket/document.txt",
"type": "text/plain"
},
"document": {
"url": "s3://bucket/document.txt",
"type": "text/plain"
},
"metadata": {},
"callStack": [
"s3-event-trigger"
]
}
}


🏗ī¸ Architecture

The S3 trigger receives S3 events from subscribed buckets on its SQS input queue. They are consumed by a Lambda function used to translate S3 events into a CloudEvent. The Lambda function also takes care of identifying the mime-type of a document based on its extension, the S3 reported mime-type, or the content of the document itself.

Architecture



🏷ī¸ Properties


Supported Inputs

This middleware does not accept any inputs from other middlewares.

Supported Outputs
Mime TypeDescription
*/*The S3 event trigger middleware can produce any type of document.
Supported Compute Types
TypeDescription
CPUThis middleware is based on a Lambda architecture.


📖 Examples

  • Face Detection Pipeline - An example showcasing how to build face detection pipelines using Project Lakechain.
  • NLP Pipeline - Builds a pipeline for extracting metadata from text-oriented documents.
  • E-mail NLP Pipeline - An example showcasing how to analyze e-mails using E-mail parsing and Amazon Comprehend.