Spark EMR Serverless job
An Amazon EMR Serverless Spark job orchestrated through AWS Step Functions state machine.
Overview
The construct creates an AWS Step Functions state machine that is used to submit a Spark job and orchestrate the lifecycle of the job. The construct leverages the AWS SDK service integrations to submit the jobs. The state machine can take a cron expression to trigger the job at a given interval. The schema below shows the state machine:
Usage
The example stack below shows how to use EmrServerlessSparkJob
construct. The stack also contains a SparkEmrServerlessRuntime
to show how to create an EMR Serverless Application and pass it as an argument to the Spark job
and use it as a runtime for the job.
- TypeScript
- Python
class ExampleSparkJobEmrServerlessStack extends cdk.Stack {
constructor(scope: Construct, id: string) {
super(scope, id);
const nightJob = new dsf.processing.SparkEmrServerlessJob(this, 'PiJob', {
applicationId: runtime.application.attrApplicationId,
name: 'PiCalculation',
executionRole: executionRole,
executionTimeout: cdk.Duration.minutes(15),
s3LogBucket: Bucket.fromBucketName(this, 'LogBucket', 'emr-job-logs-EXAMPLE'),
s3LogPrefix: 'logs',
sparkSubmitEntryPoint: 'local:///usr/lib/spark/examples/src/main/python/pi.py',
});
new CfnOutput(this, 'job-state-machine', {
value: nightJob.stateMachine!.stateMachineArn,
});
cdk.Stack):
scope, id):
super().__init__(scope, id)night_job = dsf.processing.SparkEmrServerlessJob(self, "PiJob",
application_id=runtime.application.attr_application_id,
name="PiCalculation",
execution_role=execution_role,
execution_timeout=cdk.Duration.minutes(15),
s3_log_bucket=Bucket.from_bucket_name(self, "LogBucket", "emr-job-logs-EXAMPLE"),
s3_log_prefix="logs",
spark_submit_entry_point="local:///usr/lib/spark/examples/src/main/python/pi.py"
)
CfnOutput(self, "job-state-machine",
value=night_job.state_machine.state_machine_arn
)
Using the EMR Serverless StartJobRun
parameters
The SparkEmrServerlessJobProps
interface provides a simple abstraction to create an EMR Serverless Job. For deeper control on the job configuration, you can also use the SparkEmrServerlessJobApiProps
inteface which provide the same interface as the StartJobRun API from EMR Serverless.