API
In Lakechain, there is no control plane API in the traditional sense (e.g REST or GraphQL API) that allows to create, mutate and delete pipelines. In fact, the control plane is directly exposed as Infrastructure as Code using the AWS CDK. This enables developers to use the flexibility of a programming language like TypeScript to author pipelines, while being able to implement versioned, auditable and repeatable infrastructure.
đ§âđŗ API Cookbook
All middlewares inherit the Middleware class part of Lakechain Core. This class exposes high-level methods usable by developers and implemented by each middleware. In this section, we describe how to instantiate a typical middleware, and the common methods made available by the Middleware API.
Instantiation
Below is a minimalistic example of how to instantiate an example middleware.
In its most basic form, a middleware requires at least three components.
- A CDK scope used to create the middleware resources.
- A string identifier used to identify the middleware.
- A cache storage used to share data between middlewares.
Source & Destination
As described in the architecture section, middlewares can consume documents from other middlewares in a pipeline. Similarly, they produce documents that can be consumed by other middlewares.
In the below example, we show how to connect 2 middlewares together in a sequential manner. This means that whenever A
produces a document, B
will receive it as input.
âšī¸ You can also connect multiple middlewares to the same destination using
.withSources
.
The .withSource
API provides a simple but very powerful way to express the most complex pipelines and middleware relationships.
Pipe API
Another way to connect middlewares together is to use the .pipe
API. This API makes it more convenient to connect middlewares parts of a sequential pipeline, and draws its inspiration from the Node.js Stream API.
âšī¸ For complex pipelines involving maps, and branches, it is advised to use the
.withSource
API.
Filters
The Middleware API exposes a built-in filtering API that can be applied on a middleware connection. The filtering API makes it possible to describe a condition that gets compiled at deployment time into an SNS filtering rule. Use filters to filter documents coming from another middleware.
âšī¸ The below example keeps only documents that have a size inferior to 1MB.
Filters can be expressed with the when
API, and are applied on the structure of the input CloudEvent document using different operators such as lt
, gt
, lte
, gte
, equals
, includes
and startsWith
.
You can also combine multiple filters using and
statements.
âšī¸ The below example keeps only images that have a size superior to 1MB, and a width equals to 1920px.
âšī¸
or
statements are not supported yet.
If you need to express more complex conditions, that also applies on the content of the document, see the Condition middleware.
Compute Types
Some middlewares support running their processing jobs on different types of compute. Using the .withComputeType
API, you can specify the type of compute (CPU, GPU, ACCELERATOR) that you want the middleware to run on.
âšī¸ Note that middlewares use the most efficient (cost vs. performance) compute by default. If a middleware does not support the given compute type, an exception will be thrown at deployment time.
Retries
When documents are being processed, errors may arise for various reasons. You can control the maximum amount of times a middleware should retry processing a document from its input queue using the .withMaxRetry
API.
âšī¸ The default number of retries is set to 5. If a middleware fails to process a document past that threshold, the document will be moved to the middleware dead-letter queue.
Batch Size
You can control the maximum amount of documents a given middleware can process in a single execution. This gives developers a certain level of control over the performance of middlewares.
âšī¸ Note that this is only a hint that a middleware may choose to ignore given its implementation details.
Batching Window
A batching window represents the amount of time a middleware can wait before pulling documents from its input queue. This is used by some middlewares to optimize the cost of running compute instances by batching as many documents as possible in a single execution.
âšī¸ Note that this is only a hint that a middleware may choose to ignore given its implementation details.
Concurrency
Some middlewares do apply a concurrency limit to the number of documents that they consume from their input queue to implement a throttling mechanism. This is especially useful if they invoke external services that have a limited throughput. You can use the .withMaxConcurrency
API to tune the performance of a middleware by either increasing or lowering this value.
đ Note that middlewares already use an optimized concurrency limit by default, or no concurrency limit at all, depending on the nature of their processing logic.
CloudWatch Insights
You can enable additional insights for a given middleware to monitor its performance. In practice, this means that middlewares based on AWS Lambda will use Lambda Insights, and middlewares based on AWS ECS will use Container Insights.
âšī¸ This option is turned off by default as it comes with additional costs.
Log Retention
Every middleware in a pipeline is integrated with AWS CloudWatch Logs. You can control the duration of the log retention for each middleware using the .withLogRetention
API. This can come in handy to reduce costs, or meet compliance requirements.
âšī¸ The default log retention is set to 7 days.
Memory Size
You can manually tweak the maximum amount of memory (in MB) that a middleware can use to process a document. This could be lowering this value to reduce costs, or increasing it to improve performance.
âšī¸ Note that this is only a hint that a middleware may choose to ignore given its implementation details.
Metrics
You can retrieve important CloudWatch metrics related to a given middleware. There are 5 metrics made available by the Middleware API.
Permissions
Each middleware implements the CDK IGrantable
interface and can be granted IAM permissions by other constructs. Similarly, each middleware can also grant IAM permissions to other IGrantable
constructs. In the below example we show how to explicitly grant a middleware read-only access to an S3 bucket.
In the below example, we show how we can provide a Lambda function that is external to a pipeline, read-only access to the documents processed by a middleware.
Encryption
By default, Lakechain uses AWS managed keys for encryption at rest, and encryption in-transit provided by AWS services. Customers can also provide their own Customer Managed Keys (CMK) to encrypt data at rest and create end-to-end encrypted pipelines.
đ See the Encryption Security Model for more information.
You can use the .withKmsKey
API to provide a CMK to a middleware.
Events
Each middleware emit events at deploy-time that can be captured whenever a producer or a consumer is connected to it. You can capture these events in your code using the event emitter API.
đ Note that those events are triggered at deploy time, they are not runtime events.
VPC
Some middlewares require a VPC to run (such as those based on ECS), while others, based on AWS Lambda do not. You specify a VPC to be used by a middleware by using the withVpc
API.
đ Customers having strong security and compliance requirements can also pass a VPC instance to middlewares (including those based on AWS Lambda) that are configured with specific VPC Endpoints. For more information please read our Using VPC Endpoints guide.