Skip to content

RSS Feeds

Unstable API 0.10.0 @project-lakechain/syndication-feed-processor TypeScript

The Syndication feed parser makes it possible to parse RSS and Atom feeds from upstream documents, extract each feed item from the feeds, and forward them, along with their metadata to other middlewares in the pipeline.


📰 Parsing Feeds

To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.

import { SyndicationFeedProcessor } from '@project-lakechain/syndication-feed-processor';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
const cache = new CacheStorage(this, 'Cache');
// Create the syndication feed processor.
const syndicationProcessor = new SyndicationFeedProcessor.Builder()
.withScope(this)
.withIdentifier('SyndicationProcessor')
.withCacheStorage(cache)
.withSource(source) // 👈 Specify a data source
.build();
}
}


📝 Metadata

This middleware will automatically extract feed item metadata and make them available as part of the output CloudEvents. The following metadata are extracted, when available, from feed items.

MetadataDescription
titleThe title of the feed item.
descriptionThe description of the feed item.
createdAtThe creation date of the feed item.
updatedAtThe last update date of the feed item.
authorsThe authors associated with the feed item.
keywordsThe keywords associated with the feed item.
languageThe language of the feed item.


📄 Output

This middleware takes as an input RSS or Atom syndication feeds, and outputs multiple HTML documents that are associated with each extracted feeds. This makes it possible for downstream middlewares to process each HTML document that is part of the original feed in parallel.

Below is an example of an output HTML document extracted from a feed item by the syndication feed processor.

💁 Click to expand example
{
"specversion": "1.0",
"id": "1780d5de-fd6f-4530-98d7-82ebee85ea39",
"type": "document-created",
"time": "2023-10-22T13:19:10.657Z",
"data": {
"chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79",
"source": {
"url": "https://aws.amazon.com/blogs/aws/feed/",
"type": "application/rss+xml",
"size": 24536,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"document": {
"url": "https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-ecs-rds-for-mysql-emr-studio-aws-community-and-more-january-22-2024/",
"type": "text/html",
"size": 19526,
"etag": "2a3b4c5d6e7f8d9e0a1b2c3d4e5f6a7b"
},
"metadata": {
"title": "AWS Weekly Roundup: Amazon ECS, RDS for MySQL, and More – January 22, 2024",
"description": "Check out the latest announcements from AWS in the AWS Weekly Roundup.",
"createdAt": "2024-01-22T00:00:00.000Z",
"updatedAt": "2024-01-22T00:00:00.000Z",
"authors": ["Jeff Barr"],
"keywords": ["Amazon ECS", "RDS for MySQL", "EMR Studio", "AWS Community"],
"properties": {
"kind": "text",
"attrs": {
"language": "en"
}
}
},
"callStack": []
}
}


ℹī¸ Limits

This middleware will not attempt to request via HTTP the feed items to compute their size. Therefore, the size property on the document event for feed items is not specified on output events.

Another limitation lies in that this middleware only outputs HTML documents, and does not currently forward RSS Enclosures to downstream middlewares (e.g associated images or video documents).



🏗ī¸ Architecture

This middleware is based on a Lambda compute using the feedparser Python library to parse the feeds and extract the feed items.

Syndication Feed Processor



🏷ī¸ Properties


Supported Inputs
Mime TypeDescription
application/rss+xmlRSS feeds.
application/atom+xmlAtom feeds.
Supported Outputs
Mime TypeDescription
text/htmlHTML documents.
Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.


📖 Examples