Skip to main content

Quick start

❗ If you're new to AWS CDK, we recommend going through a few basic examples first.

The DSF on AWS library is available in Typescript or Python, select the right tab for code examples in your preferred language.

In this quickstart we will show you how you can use DSF to deploy EMR Serverless, deploy an S3 configured with AWS best practices, execute a Spark application for word counts and store the result in the created S3 bucket. You can find the full quick start example here.

The sections below will take you through the steps of creating the CDK application and use it to deploy the infrastructure.

Create a CDK app

mkdir dsf-example && cd dsf-example
cdk init app --language typescript

We can now install DSF on AWS:

npm i @cdklabs/aws-data-solutions-framework --save

Create a data lake storage

We will now use DataLakeStorage to create a storage layer for our data lake on AWS.

In lib/dsf-example-stack.ts

import * as cdk from 'aws-cdk-lib';
import * as dsf from '@cdklabs/aws-data-solutions-framework';
import { Key } from 'aws-cdk-lib/aws-kms';
import { Policy, PolicyStatement} from 'aws-cdk-lib/aws-iam';

export class DsfExampleStack extends cdk.Stack {
constructor(scope: cdk.Construt, id: string, props?: cdk.StackProps) {
super(scope, id, props);

const storage = new dsf.storage.AnalyticsBucket(this, 'AnalyticsBucket', {
encryptionKey: new Key(this, 'DataKey', {
enableKeyRotation: true,
removalPolicy: cdk.RemovalPolicy.DESTROY
}),
});
}

Create the EMR Serverless Application and execution role

We will now use SparkEmrServerlessRuntime. In this step we create an EMR Serverless application, create an execution IAM role, to which we will grant read write access to the created S3 bucket.

In lib/dsf-example-stack.ts


// Use DSF on AWS to create Spark EMR serverless runtime
const runtimeServerless = new dsf.processing.SparkEmrServerlessRuntime(this, 'SparkRuntimeServerless', {
name: 'WordCount',
});

// Define policy the execution role to read the data transformation script
const s3ReadPolicy = new Policy(this, 's3ReadPolicy' , {
statements: [
new PolicyStatement({
actions: ['s3:GetObject', 's3:ListBucket'],
resources: ['arn:aws:s3:::*.elasticmapreduce/*', 'arn:aws:s3:::*.elasticmapreduce'],
}),
],
});

// Use DSF on AWS to create Spark EMR serverless runtime
const executionRole = dsf.processing.SparkEmrServerlessRuntime.createExecutionRole(this, 'ProcessingExecRole');

// Provide access for the execution role to read the data transformation script
executionRole.attachInlinePolicy(s3ReadPolicy);

// Provide access for the execution role to write data to the created bucket
storage.grantReadWrite(executionRole);

Output resource IDs and ARNs

Last we will output the ARNs for the role and EMR serverless app, the ID of the EMR serverless application. These will be passed to the AWS cli when executing StartJobRun command.

In lib/dsf-example-stack.ts


new cdk.CfnOutput(this, "EMRServerlessApplicationId", { value : runtimeServerless.application.attrApplicationId });
new cdk.CfnOutput(this, "EMRServerlessApplicationARN", { value : runtimeServerless.application.attrArn });
new cdk.CfnOutput(this, "EMRServelessExecutionRoleARN", { value : executionRole.roleArn });
new cdk.CfnOutput(this, "BucketURI", { value : `s3://${storage.bucketName}` });

Deploy the CDK app

If this is the first time you deploy an AWS CDK app into an environment (account/region), you can install a “bootstrap stack”.

cdk bootstrap

After the bootstrap is completed you can now deploy the stack.

cdk deploy

Submit a job

aws emr-serverless start-job-run \
--application-id EMRServerlessApplicationId \
--execution-role-arn EMRServelessExecutionRoleARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py",
"entryPointArguments": [
"s3://BucketURI/wordcount_output/"
],
"sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}'

Congrats, you created your first CDK app using DSF on AWS! Go ahead and explore all available constructs and examples.