Solution end-to-end architecture
Deploying this solution with the default parameters builds the following environment in AWS:
This solution deploys the Amazon CloudFormation template in your AWS account and completes the following settings.
- Amazon CloudFront distributes the frontend web UI assets hosted in the Amazon S3 bucket, and the backend APIs hosted with Amazon API Gateway and AWS Lambda.
- The Amazon Cognito user pool or OpenID Connect (OIDC) is used for authentication.
- The web UI console uses Amazon DynamoDB to store persistent data.
- AWS Step Functions, AWS CloudFormation, AWS Lambda, and Amazon EventBridge are used for orchestrating the lifecycle management of data pipelines.
- The data pipeline is provisioned in the region specified by the system operator. It consists of Application Load Balancer (ALB), Amazon ECS, Amazon Managed Streaming for Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon S3, Amazon EMR Serverless, Amazon Redshift, and Amazon QuickSight.
Data Pipeline
The key functionality of this solution is to build a data pipeline to collect, process, and analyze their clickstream data. The data pipeline consists of four modules:
- ingestion module
- data processing module
- data modeling module
- reporting module
The following introduces the architecture diagram for each module.
Ingestion module
Suppose you create a data pipeline in the solution. This solution deploys the Amazon CloudFormation template in your AWS account and completes the following settings.
Note
The ingestion module supports three types of data sinks.
- (Optional) The ingestion module creates an AWS global accelerator endpoint to reduce the latency of sending events from your clients (web applications or mobile applications).
- Elastic Load Balancing (ELB) is used for load balancing ingestion web servers.
- (Optional) If you enable the authenticating feature, the ALB will communicate with the OIDC provider to authenticate the requests.
- ALB forwards all authenticated and valid requests to the ingestion servers.
- Amazon ECS cluster is hosting the ingestion fleet servers. Each server consists of a proxy and a worker service. The proxy is a facade of the HTTP protocol, and the worker will send the events to a data sink based on your choice.
- Amazon Kinesis Data Streams is used as a buffer. AWS Lambda consumes the events in Kinesis Data Streams and then sinks them to Amazon S3 in batches.
- Amazon MSK or self-built Kafka is used as a buffer. MSK Connector is provisioned with an S3 connector plugin that sinks the events to Amazon S3 in batches.
- The ingestion server will buffer a batch of events and sink them to Amazon S3.
Data processing module
Suppose you create a data pipeline in the solution and enable ETL. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.
- Amazon EventBridge is used to trigger the ETL jobs periodically.
- The configurable time-based scheduler invokes an AWS Lambda function.
- The Lambda function kicks off an EMR Serverless application based on Spark to process a batch of clickstream events.
- The EMR Serverless application uses the configurable transformer and enrichment plug-ins to process the clickstream events from the source S3 bucket.
- After processing the clickstream events, the EMR Serverless application sinks the processed events to the sink S3 bucket.
Data modeling module
Suppose you create a data pipeline in the solution and enable data modeling in Amazon Redshift. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.
- After the processed clickstream events data is written in the Amazon S3 bucket, the
Object Created Event
is emitted. - An Amazon EventBridge rule is created for the event emitted in step 1, and an AWS Lambda function is invoked when the event happens.
- The Lambda function persists the source event to be loaded in an Amazon DynamoDB table.
- When data processing job is done, an event is emitted to Amazon EventBridge.
- The pre-defined event rule of Amazon EventBridge processes the
EMR job success event
. - The rule invokes the AWS Step Functions workflow.
- The workflow invokes the
list objects
Lambda function that queries the DynamoDB table to find out the data to be loaded, then creates a manifest file for a batch of event data to optimize the load performance. - After a few seconds, the
check status
Lambda function starts to check the status of loading job. - If the load is still in progress, the
check status
Lambda function waits a few more seconds. - After all objects are loaded, the workflow ends.
- Once the load data workflow is completed, the scan metadata workflow will be triggered.
- The Lambda function checks whether the workflow should be started or not. If the interval since the last workflow initiation is less than one day or if the previous workflow has not yet finished, the current workflow is skipped.
- If it is necessary to start the current workflow, the
submit job
Lambda function is triggered. - The Lambda function submits the stored procedure of scan metadata job, initiating the metadata scanning process.
- After a few seconds, the
check status
Lambda function starts to check the status of the scan job. - If the scan is still in progress, the
check status
Lambda function waits for a few more seconds. - Once the data scanning is completed, the
store metadata
Lambda function is triggered. - The Lambda function saves the metadata to the DynamoDB table, the workflow ends.
Suppose you create a data pipeline in the solution and enable data modeling in Amazon Athena. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.
- Amazon EventBridge initiates the data load into Amazon Athena periodically.
- The configurable time-based scheduler invokes an AWS Lambda function.
- The AWS Lambda function creates the partitions of the AWS Glue table for the processed clickstream data.
- Amazon Athena is used for interactive querying of clickstream events.
- The processed clickstream data is scanned via the Glue table.
Reporting module
Suppose you create a data pipeline in the solution, enable data modeling in Amazon Redshift, and enable reporting in Amazon QuickSight. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.
- VPC connection in Amazon QuickSight is used for securely connecting your Redshift within VPC.
- The data source, data sets, template, analysis, and dashboard are created in Amazon QuickSight for out-of-the-box analysis and visualization.
Analytics Studio
Analytics Studio is a unified web interface for business analysts or data analysts to view and create dashboards, query and explore clickstream data, and manage metadata.
- When analysts access Analytics Studio, requests are sent to Amazon CloudFront, which distributes the web application.
- When the analysts log in to Analytics Studio, the requests are redirected to the Amazon Cognito user pool or OpenID Connect (OIDC) for authentication.
- Amazon API Gateway hosts the backend API requests and uses the custom Lambda authorizer to authorize the requests with the public key of OIDC.
- API Gateway integrates with AWS Lambda to serve the API requests.
- The Lambda function uses Amazon DynamoDB to retrieve and persist the data.
- When analysts create analyses, the Lambda function requests Amazon QuickSight to create assets and get the embed URL in the data pipeline region.
- The browser of analysts access the QuickSight embed URL to view the QuickSight dashboards and visuals.