Mountpoint S3 Express
This example demonstrates ultra-high performance S3 Express One Zone storage with single-digit millisecond latency for high-throughput Spark analytics workloads.
Overview
S3 Express One Zone is a high-performance storage class that delivers consistent single-digit millisecond data access for frequently accessed data. Combined with Mountpoint for S3, it provides the fastest possible S3 integration for Spark workloads.
Key Features
- Ultra-Low Latency: Single-digit millisecond access times
- High Throughput: Up to 10x faster than standard S3
- Zone-Optimized: Co-located with compute for minimal network latency
- Cost Effective: Pay only for what you use with no minimum commitments
- POSIX Interface: Standard file system semantics via Mountpoint
Prerequisites
Before deploying this example, ensure you have:
- ✅ Infrastructure deployed
- ✅
kubectlconfigured for your EKS cluster - ✅ S3 Express One Zone bucket created in same AZ as EKS nodes
- ✅ IAM roles with S3 Express permissions
Quick Deploy & Test
1. Create S3 Express One Zone Bucket
# Create S3 Express bucket in same AZ as your EKS cluster
export AZ="us-west-2a"
export BUCKET_NAME="spark-express-${RANDOM}--${AZ}--x-s3"
aws s3api create-bucket \
--bucket $BUCKET_NAME \
--create-bucket-configuration LocationConstraint=us-west-2 \
--bucket-type Directory
# Enable S3 Express One Zone
aws s3api put-bucket-accelerate-configuration \
--bucket $BUCKET_NAME \
--accelerate-configuration Status=Enabled
2. Deploy Mountpoint for S3 Express
# Navigate to data-stacks directory
cd data-stacks/spark-on-eks
# Update bucket name in configuration
export S3_EXPRESS_BUCKET=$BUCKET_NAME
envsubst < examples/mountpoint-s3-spark/mountpoint-s3express-daemonset.yaml | kubectl apply -f -
3. Deploy Spark Application
# Deploy S3 Express optimized Spark job
envsubst < examples/mountpoint-s3-spark/spark-s3express-job.yaml | kubectl apply -f -
4. Verify Performance
# Monitor Spark application
kubectl get sparkapplications -n spark-team-a
# Check latency metrics
kubectl logs -n spark-team-a -l spark-role=driver | grep -i "s3.*time"
# View detailed performance metrics
kubectl port-forward -n spark-history svc/spark-history-server 18080:80
Configuration Details
S3 Express One Zone Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: s3express-config
namespace: spark-team-a
data:
bucket: "spark-express-123--us-west-2a--x-s3"
region: "us-west-2"
availability-zone: "us-west-2a"
endpoint: "https://s3express-control.us-west-2.amazonaws.com"
Mountpoint S3 Express DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: mountpoint-s3express
namespace: kube-system
spec:
template:
spec:
containers:
- name: mountpoint-s3express
image: public.ecr.aws/mountpoint-s3/mountpoint-s3-csi-driver:latest
args:
- "--cache-size=20GB"
- "--part-size=8MB"
- "--max-concurrent-requests=64"
- "--read-timeout=5s"
- "--write-timeout=10s"
- "--s3-express-one-zone"
env:
- name: S3_EXPRESS_BUCKET
valueFrom:
configMapKeyRef:
name: s3express-config
key: bucket
Spark Application Optimization
spec:
sparkConf:
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"spark.hadoop.fs.s3a.s3express.create.session": "true"
"spark.hadoop.fs.s3a.s3express.session.duration": "300"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.multipart.size": "8388608"
"spark.hadoop.fs.s3a.multipart.threshold": "8388608"
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.sql.adaptive.advisoryPartitionSizeInBytes": "134217728"
Performance Validation
Latency Benchmarking
# Run latency-sensitive workload
kubectl apply -f - <<EOF
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: s3express-latency-test
namespace: spark-team-a
spec:
type: Python
mode: cluster
image: public.ecr.aws/spark/spark-py:3.5.0
mainApplicationFile: "s3a://${S3_EXPRESS_BUCKET}/scripts/latency-test.py"
sparkConf:
"spark.hadoop.fs.s3a.s3express.create.session": "true"
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://${S3_EXPRESS_BUCKET}/spark-logs/"
driver:
cores: 2
memory: "4g"
executor:
cores: 4
instances: 8
memory: "8g"
EOF
Throughput Testing
# Monitor I/O performance
kubectl exec -n spark-team-a <driver-pod> -- iostat -x 1 10
# Check S3 Express metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/S3 \
--metric-name NumberOfObjects \
--dimensions Name=BucketName,Value=$S3_EXPRESS_BUCKET \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average
Expected Results
✅ Ultra-Low Latency: Under 10ms access times for hot data ✅ High Throughput: 10x faster than standard S3 ✅ Consistent Performance: Predictable latency under load ✅ Cost Optimization: Pay-per-request with no minimums ✅ Spark Integration: Seamless integration with existing Spark code
Troubleshooting
Common Issues
S3 Express session creation failures:
# Check IAM permissions for S3 Express
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::ACCOUNT:role/spark-team-a \
--action-names s3express:CreateSession \
--resource-arns arn:aws:s3express:us-west-2:ACCOUNT:bucket/$S3_EXPRESS_BUCKET
# Verify bucket configuration
aws s3api describe-bucket --bucket $S3_EXPRESS_BUCKET
High latency despite S3 Express:
# Check AZ alignment
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.'topology\.kubernetes\.io/zone'
# Verify S3 Express endpoint
kubectl exec -n spark-team-a <pod-name> -- nslookup s3express-control.us-west-2.amazonaws.com
# Monitor network latency
kubectl exec -n spark-team-a <pod-name> -- ping -c 10 s3express-control.us-west-2.amazonaws.com
Performance degradation:
# Check Mountpoint cache utilization
kubectl exec -n kube-system <mountpoint-pod> -- df -h /mnt/s3express-cache
# Monitor concurrent request limits
kubectl logs -n kube-system -l app=mountpoint-s3express | grep -i "throttl\|limit"
# Verify S3 Express capacity
aws s3api get-bucket-location --bucket $S3_EXPRESS_BUCKET
Advanced Configuration
Multi-AZ Deployment
# Deploy across multiple AZs for HA
apiVersion: v1
kind: ConfigMap
metadata:
name: s3express-multi-az
data:
primary-bucket: "spark-express-123--us-west-2a--x-s3"
secondary-bucket: "spark-express-456--us-west-2b--x-s3"
tertiary-bucket: "spark-express-789--us-west-2c--x-s3"
Session Management
# Optimize S3 Express session configuration
spec:
sparkConf:
"spark.hadoop.fs.s3a.s3express.create.session": "true"
"spark.hadoop.fs.s3a.s3express.session.duration": "900"
"spark.hadoop.fs.s3a.s3express.session.cache.size": "1000"
"spark.hadoop.fs.s3a.s3express.session.refresh.interval": "600"
Request Optimization
# Tune for maximum performance
spec:
sparkConf:
"spark.hadoop.fs.s3a.connection.maximum": "500"
"spark.hadoop.fs.s3a.threads.max": "64"
"spark.hadoop.fs.s3a.max.total.tasks": "100"
"spark.hadoop.fs.s3a.multipart.uploads.enabled": "true"
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer"
Cost Optimization
Request Pattern Analysis
# Monitor request patterns
aws cloudwatch get-metric-statistics \
--namespace AWS/S3 \
--metric-name AllRequests \
--dimensions Name=BucketName,Value=$S3_EXPRESS_BUCKET \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum
# Analyze cost per request
aws ce get-cost-and-usage \
--time-period Start=$(date -u -d '1 day ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
Data Lifecycle Management
# Configure automated cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
name: s3express-cleanup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: amazon/aws-cli:latest
command:
- /bin/sh
- -c
- |
aws s3api list-objects-v2 --bucket $S3_EXPRESS_BUCKET \
--query 'Contents[?LastModified<`$(date -u -d "7 days ago" +%Y-%m-%d)`].Key' \
--output text | xargs -I {} aws s3api delete-object --bucket $S3_EXPRESS_BUCKET --key {}
Related Examples
- Mountpoint S3 - Standard S3 with Mountpoint
- S3 Tables - Iceberg integration with S3 Express
- Observability - Performance monitoring