Quick start
:heavy_exclamation_mark: If you're new to AWS CDK, we recommend going through a few basic examples first.
The DSF on AWS library is available in Typescript or Python, select the right tab for code examples in your preferred language.
In this quickstart we will show you how you can use DSF to deploy EMR Serverless, deploy an S3 configured with AWS best practices, execute a Spark application for word counts and store the result in the created S3 bucket. You can find the full quick start example here.
The sections below will take you through the steps of creating the CDK application and use it to deploy the infrastructure.
Create a CDK app
mkdir dsf-example && cd dsf-example
- TypeScript
- Python
cdk init app --language typescript
cdk init app --language python
# Once you create the app, active the Python virtual environment:
source .venv/bin/activate
We can now install DSF on AWS:
- TypeScript
- Python
npm i @cdklabs/aws-data-solutions-framework --save
# Add DSF on AWS to requirements.txt
# requirements.txt:
...
cdklabs.aws_data_solutions_framework
...
# Then you can install CDK app requirements:
python -m pip install -r requirements.txt
Create a data lake storage
We will now use DataLakeStorage to create a storage layer for our data lake on AWS.
- TypeScript
- Python
In lib/dsf-example-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as dsf from '@cdklabs/aws-data-solutions-framework';
import { Key } from 'aws-cdk-lib/aws-kms';
import { Policy, PolicyStatement} from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';
export class DsfExampleStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const storage = new dsf.storage.AnalyticsBucket(this, 'AnalyticsBucket', {
encryptionKey: new Key(this, 'DataKey', {
enableKeyRotation: true,
removalPolicy: cdk.RemovalPolicy.DESTROY
}),
});
}
In dsf_example/dsf_example_stack.py
import cdklabs.aws_data_solutions_framework as dsf
from aws_cdk.aws_s3 import Bucket
from aws_cdk import Stack, RemovalPolicy, CfnOutput
from aws_cdk.aws_iam import Policy, PolicyStatement
from aws_cdk.aws_kms import Key
from constructs import Construct
class DsfExampleStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
storage = dsf.storage.AnalyticsBucket(
self, "DataLakeStorage",
removal_policy=RemovalPolicy.DESTROY,
encryption_key= Key(self, "StorageEncryptionKey",
removal_policy=RemovalPolicy.DESTROY,
enable_key_rotation=True
)
)
Create the EMR Serverless Application and execution role
We will now use SparkEmrServerlessRuntime. In this step we create an EMR Serverless application, create an execution IAM role, to which we will grant read write access to the created S3 bucket.
- TypeScript
In lib/dsf-example-stack.ts
// Use DSF on AWS to create Spark EMR serverless runtime
const runtimeServerless = new dsf.processing.SparkEmrServerlessRuntime(this, 'SparkRuntimeServerless', {
name: 'WordCount',
});
// Define policy the execution role to read the data transformation script
const s3ReadPolicy = new Policy(this, 's3ReadPolicy' , {
statements: [
new PolicyStatement({
actions: ['s3:GetObject', 's3:ListBucket'],
resources: ['arn:aws:s3:::*.elasticmapreduce/*', 'arn:aws:s3:::*.elasticmapreduce'],
}),
],
});
// Use DSF on AWS to create Spark EMR serverless runtime
const executionRole = dsf.processing.SparkEmrServerlessRuntime.createExecutionRole(this, 'ProcessingExecRole');
// Provide access for the execution role to read the data transformation script
executionRole.attachInlinePolicy(s3ReadPolicy);
// Provide access for the execution role to write data to the created bucket
storage.grantReadWrite(executionRole);
</TabItem>
<TabItem value="python" label="Python">
In `dsf_example/dsf_example_stack.py`
```python
# Use DSF on AWS to create Spark EMR serverless runtime
spark_runtime = dsf.processing.SparkEmrServerlessRuntime(
self, "SparkProcessingRuntime", name="WordCount",
removal_policy=RemovalPolicy.DESTROY,
)
# Define policy the execution role to read the transformation script from the S3 bucket where its stored
s3_read_policy = Policy(self, 'S3ReadPolicy',
statements=[
PolicyStatement(
actions = ["s3:GetObject", "s3:ListBucket"],
resources = ["arn:aws:s3:::*.elasticmapreduce/*", "arn:aws:s3:::*.elasticmapreduce"]
)
]
)
# Use DSF on AWS to create Spark EMR serverless runtime
processing_exec_role = dsf.processing.SparkEmrServerlessRuntime.create_execution_role(self, "ProcessingExecRole")
# Provide access for the execution role to read the data transformation script
processing_exec_role.attach_inline_policy(s3_read_policy)
# Provide access for the execution role to write data to the created bucket
storage.grant_read_write(processing_exec_role)
Output resource IDs and ARNs
Last we will output the ARNs for the role and EMR serverless app, the ID of the EMR serverless application. These will be passed to the AWS cli when executing StartJobRun
command.
- TypeScript
- Python
In lib/dsf-example-stack.ts
new cdk.CfnOutput(this, "EMRServerlessApplicationId", { value : runtimeServerless.application.attrApplicationId });
new cdk.CfnOutput(this, "EMRServerlessApplicationARN", { value : runtimeServerless.application.attrArn });
new cdk.CfnOutput(this, "EMRServelessExecutionRoleARN", { value : executionRole.roleArn });
new cdk.CfnOutput(this, "BucketURI", { value : `s3://${storage.bucketName}` });
In dsf_example/dsf_example_stack.py
CfnOutput(self, "EMRServerlessApplicationId", value=spark_runtime.application.attr_application_id)
CfnOutput(self, "EMRServerlessApplicationARN", value=spark_runtime.application.attr_arn)
CfnOutput(self, "EMRServelessExecutionRoleARN", value=processing_exec_role.role_arn)
CfnOutput(self, "BucketURI", value=f"s3://{storage.bucket_name}")
Deploy the CDK app
If this is the first time you deploy an AWS CDK app into an environment (account/region), you can install a “bootstrap stack”.
cdk bootstrap
After the bootstrap is completed you can now deploy the stack.
cdk deploy
Submit a job
aws emr-serverless start-job-run \
--application-id EMRServerlessApplicationId \
--execution-role-arn EMRServelessExecutionRoleARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py",
"entryPointArguments": [
"s3://BucketURI/wordcount_output/"
],
"sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}'
Congrats, you created your first CDK app using DSF on AWS! Go ahead and explore all available constructs and examples.