Mountpoint-S3 for Spark Workloads

When working with the SparkApplication Custom Resource Definition (CRD) managed by the SparkOperator, handling multiple dependency JAR files can become a significant challenge. Traditionally, these JAR files are bundled within the container image, leading to several inefficiencies:

Increased Build Time: Downloading and adding JAR files during the build process significantly inflates the build time of the container image.
Larger Image Size: Including JAR files directly in the container image increases its size, resulting in longer download times when pulling the image to execute jobs.
Frequent Rebuilds: Any updates or additions to the dependency JAR files necessitate rebuilding and redeploying the container image, further increasing operational overhead.

Mountpoint for Amazon S3 offers an effective solution to these challenges. As an open-source file client, Mountpoint-S3 allows you to mount an S3 bucket on your compute instance, making it accessible as a local virtual file system. It automatically translates local file system API calls into REST API calls on S3 objects, providing seamless integration with Spark jobs.

What is Mountpoint-S3?

Mountpoint-S3 is an open-source file client developed by AWS that translates file operations into S3 API calls, enabling your applications to interact with Amazon S3 buckets as if they were local disks. Mountpoint for Amazon S3 is optimized for applications that need high read throughput to large objects, potentially from many clients at once, and to write new objects sequentially from a single client at a time. It offers significant performance gains compared to traditional S3 access methods, making it ideal for data-intensive workloads or AI/ML training.

Mountpoint for Amazon S3 is optimized for high-throughput performance, largely due to its foundation on the AWS Common Runtime (CRT) library. The CRT library is a collection of libraries and modules designed to deliver high performance and low resource usage, specifically tailored for AWS services. Key features of the CRT library that enable high-throughput performance include:

Efficient I/O Management: The CRT library is optimized for non-blocking I/O operations, reducing latency and maximizing the utilization of network bandwidth.
Lightweight and Modular Design: The library is designed to be lightweight, with minimal overhead, allowing it to perform efficiently even under high load. Its modular architecture ensures that only the necessary components are loaded, further enhancing performance.
Advanced Memory Management: CRT employs advanced memory management techniques to minimize memory usage and reduce garbage collection overhead, leading to faster data processing and reduced latency.
Optimized Network Protocols: The CRT library includes optimized implementations of network protocols, such as HTTP/2, that are specifically tuned for AWS environments. These optimizations ensure rapid data transfer between S3 and your compute instances, which is critical for large-scale Spark workloads.

Using Mountpoint-S3 with EKS

For Spark workloads, we'll specifically focus on loading external JARs located in S3 for Spark Applications. We’ll examine two primary deployment strategies for Mountpoint-S3;

Leveraging the EKS Managed Addon CSI driver with Persistent Volumes (PV) and Persistent Volume Claims (PVC)
Deploying Mountpoint-S3 at the Node level using either USERDATA scripts or DaemonSets.

The first approach is considered mounting at a Pod level because the PV created is available to individual pods. The second Approach is considered mounting at a Node level because the S3 is mounted on the host Node itself. Each approach is discussed in detail below, highlighting their respective strengths and considerations to help you determine the most effective solution for your specific use case.

Metric	Pod Level	Node Level
Access Control	Provides fine-grained access control through service roles and RBAC, limiting PVC access to specific Pods. This is not possible with host-level mounts, where the mounted S3 bucket is accessible to all Pods on the Node.	Simplifies configuration but lacks the granular control offered by Pod-level mounting.
Scalabbility and Overhead	Involves managing individual PVCs, which can increase overhead in large-scale environments.	Reduces configuration complexity but provides less isolation between Pods.
Performance Considerations	Offers predictable and isolated performance for individual Pods.	May lead to contention if multiple Pods on the same Node access the same S3 bucket.
Flexibility and Use Cases	Best suited for use cases where different Pods require access to different datasets or where strict security and compliance controls are necessary.	Ideal for environments where all Pods on a Node can share the same dataset, such as when running batch processing jobs or Spark jobs that require common dependencies.

Resource Allocation

Before being able to implement the Mountpoint-s3 solution provided, AWS cloud resources need to be allocated. To do deploy the Terraform stack following the instructions below. After allocating the resources and setting up the EKS environment, you can explore the two different approaches of utilizing Mountpoint-S3 in detail.

Deploy Solution Resources

👈

Approach 1: Deploy Mountpoint-S3 on EKS at Pod level

Deploying Mountpoint-S3 at the Pod level involves using the EKS Managed Addon CSI driver with Persistent Volumes (PV) and Persistent Volume Claims (PVC) to mount an S3 bucket directly within a Pod. This method allows for fine-grained control over which Pods can access specific S3 buckets, ensuring that only the necessary workloads have access to the required data.

Once Mountpoint-S3 is enabled and the PV is created, the S3 bucket becomes a cluster-level resource, allowing any Pod to request access by creating a PVC that references the PV. To achieve fine-grained control over which Pods can access specific PVCs, you can use service roles within namespaces. By assigning specific service accounts to Pods and defining Role-Based Access Control (RBAC) policies, you can limit which Pods can bind to certain PVCs. This ensures that only authorized Pods can mount the S3 bucket, providing tighter security and access control compared to a host-level mount, where the hostPath is accessible to all Pods on the Node.

Using this approach can also be simplified using the EKS Managed Addon CSI driver. However, this does not support taints/tolerations and therefore cannot be used with GPUs. Additionally, because the Pods are not sharing the mount and therefore not sharing the cache it would lead to more S3 API calls.

For more information on how to deploy this approach refer to the deployment instructions

Approach 2: Deploy Mountpoint-S3 on EKS at Node level

Mounting a S3 Bucket at a Node level can streamline the management of dependency JAR files for SparkApplications by reducing build times and speeding up deployment. It can be implemented using either USERDATA or DaemonSet. USERDATA is the preferred method for implementing Mountpoint-S3. However, if you have static Nodes in your EKS cluster that you cannot bring down, the DaemonSet approach provides an alternative. Make sure to understand all of the security mechanisms that need to be enabled in order to utilize the DaemonSet approach before implementing it.

Approach 2.1: Using USERDATA

This approach is recommended for new clusters or where auto-scaling is customized to run workloads as the user-data script is run when a Node is initialized. Using the below script, the Node can be updated to have the S3 bucket mounted upon initialization in the EKS cluster that hosts the Pods. The below script outlines downloading, installing, and running the Mountpoint S3 package. There are a couple of arguments that are set for this application and defined below that can be altered depending on the use case. More information about these arguments and others can be found here

metadata-ttl: this is set to indefinite because the jar files are meant to be used as read only and will not change.
allow-others: this is set so that the Node can have access to the mounted volume when using SSM
cache: this is set to enable caching and limit the S3 API calls that need to be made by storing the files in cache for consecutive re-reads.

note

These same arguments can also be used in the DaemonSet approach. In addition to these arguments that are set by this example, there are also a number of other options for additional logging and debugging

When autoscaling with Karpenter, this method allows for more flexibility and performance. For example when configuring Karpenter in the terraform code, the user data for different types of Nodes can be unique with different buckets depending on the workload so when Pods are scheduled and need a certain set of dependencies, Taints and Tolerations will allow Karpenter to allocate the specific instance type with the unique user data to ensure the correct bucket with the dependent files is mounted on the Node so that Pods can access is. Additionally, the user script will depend on the OS that the newly allocated Node is configured with.

USERDATA script:

#!/bin/bash
yum update -y
yum install -y wget
wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.rpm
yum install -y mount-s3.rpm mkdir -p /mnt/s3
/opt/aws/mountpoint-s3/bin/mount-s3 --metadata-ttl indefinite --allow-other --cache /tmp <S3_BUCKET_NAME> /mnt/s3

Approach 2.2: Using DaemonSet

This approach is recommended for existing clusters. This approach is made up of 2 resources, a ConfigMap with a script that maintains the S3 Mount Point package onto the Node and a DaemonSet that runs a Pod on every Node in the cluster which will execute the script on the Node.

The ConfigMap script will run a loop to check the mountPoint every 60 seconds and remount it if there are any issues. There are multiple environment variables that can be altered for the mount location, cache location, S3 bucket name, log file location, and the URL of the package installation and the location of the of the installed package. These variables can be left as default as only the S3 bucket name is required to run.

The DaemonSet Pods will copy the script onto the Node, alter the permissions to allow execution, and then finally run the script. The Pod installs util-linux in order to have access to nsenter, which allows the Pod to execute the script in the Node space which allows the S3 Bucket to be mounted on to the Node directly by the Pod.

danger

The DaemonSet Pod requires the securityContext to be privileged as well as hostPID, hostIPC, and hostNetwork to be set to true. Review below why these are required to be configured for this solution and their security implications.

securityContext: privileged
- Purpose: privileged mode gives the container full access to all host resources, similar to root access on the host.
- To install software packages, configure the system, and mount the S3 bucket onto the host, your container will likely need elevated permissions. Without privileged mode, the container might not have sufficient permissions to perform these actions on the host filesystem and network interfaces.
hostPID
- Purpose: nsenter allows you to enter various namespaces, including the PID namespace of the host.
- When using nsenter to enter the host’s PID namespace, the container needs access to the host’s PID namespace. Thus, enabling hostPID: true is necessary to interact with processes on the host, which is crucial for operations like installing packages or running commands that require host-level process visibility like mountpoint-s3.
hostIPC
- Purpose: hostIPC enables your container to share the host’s inter-process communication namespace, which includes shared memory.
- If nsenter commands or the script to run involves shared memory or other IPC mechanisms on the host, hostIPC: true will be necessary. While it’s less common than hostPID, it’s often enabled alongside it when nsenter is involved, especially if the script needs to interact with host processes that rely on IPC.
hostNetwork
- Purpose: hostNetwork allows the container to use the host’s network namespace, giving the container access to the host’s IP address and network interfaces.
- During the installation process, the script will likely need to download packages from the internet (e.g., from repositories hosting the mountpoint-s3 package). By enabling hostNetwork with hostNetwork: true, you ensure that the download processes have direct access to the host’s network interface, avoiding issues with network isolation.

warning

This sample code uses the spark-team-a namespace to run the job and host the DaemonSet. This is primarily because the Terraform stack already sets up IRSA for this namespace and allows the service account to access any S3 bucket. When using in production make sure create your own separate namespace, service account, and IAM role that follows the policy of least-privilege permissions and follows IAM role best practice

To view the DaemonSet, Click to toggle content!

apiVersion: v1
kind: ConfigMap
metadata:
  name: s3-mount-script
  namespace: spark-team-a
data:
  monitor_s3_mount.sh: |
    #!/bin/bash

    set -e  # Exit immediately if a command exits with a non-zero status

    # ENVIRONMENT VARIABLES
    LOG_FILE="/var/log/s3-mount.log"
    S3_BUCKET_NAME="<S3_BUCKET_NAME>"  # Replace with your S3 Bucket Name before applying to EKS cluster
    MOUNT_POINT="/mnt/s3"
    CACHE_DIR="/tmp"
    MOUNT_S3_BIN="/usr/bin/mount-s3"
    MOUNT_S3_URL="https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.rpm"

    # Function to install mount-s3
    install_mount_s3() {
      echo "$(date): Installing mount-s3" | tee -a $LOG_FILE
      yum update -y | tee -a $LOG_FILE
      yum install -y wget util-linux | tee -a $LOG_FILE
      wget $MOUNT_S3_URL -O /tmp/mount-s3.rpm | tee -a $LOG_FILE
      yum install -y /tmp/mount-s3.rpm | tee -a $LOG_FILE
    }

    # Function to mount S3 bucket
    mount_s3_bucket() {
      echo "$(date): Mounting S3 bucket: $S3_BUCKET_NAME to $MOUNT_POINT" | tee -a $LOG_FILE
      $MOUNT_S3_BIN --metadata-ttl indefinite --allow-other --cache $CACHE_DIR $S3_BUCKET_NAME $MOUNT_POINT | tee -a $LOG_FILE
      if [ $? -ne 0 ]; then
        echo "$(date): Failed to mount S3 bucket: $S3_BUCKET_NAME" | tee -a $LOG_FILE
        exit 1
      fi
    }

    # Ensure the mount point directory exists
    ensure_mount_point() {
      if [ ! -d $MOUNT_POINT ]; then
        echo "$(date): Creating mount point directory: $MOUNT_POINT" | tee -a $LOG_FILE
        mkdir -p $MOUNT_POINT
      fi
    }

    # Install mount-s3
    install_mount_s3

    # Continuous monitoring and remounting loop
    while true; do
      echo "$(date): Checking if S3 bucket is mounted" | tee -a $LOG_FILE
      ensure_mount_point
      if mount | grep $MOUNT_POINT > /dev/null; then
        echo "$(date): S3 bucket is already mounted" | tee -a $LOG_FILE
        if ! ls $MOUNT_POINT > /dev/null 2>&1; then
          echo "$(date): Transport endpoint is not connected, remounting S3 bucket" | tee -a $LOG_FILE
          fusermount -u $MOUNT_POINT || echo "$(date): Failed to unmount S3 bucket" | tee -a $LOG_FILE
          rm -rf $MOUNT_POINT || echo "$(date): Failed to remove mount point directory" | tee -a $LOG_FILE
          ensure_mount_point
          mount_s3_bucket
        fi
      else
        echo "$(date): S3 bucket is not mounted, mounting now" | tee -a $LOG_FILE
        mount_s3_bucket
      fi
      sleep 60  # Check every 60 seconds
    done

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: s3-mount-daemonset
  namespace: spark-team-a
spec:
  selector:
    matchLabels:
      name: s3-mount-daemonset
  template:
    metadata:
      labels:
        name: s3-mount-daemonset
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      volumes:
      - name: script
        configMap:
          name: s3-mount-script
      - name: host-root
        hostPath:
          path: /
          type: Directory
      restartPolicy: Always
      containers:
      - name: s3-mount
        image: amazonlinux:2
        volumeMounts:
        - name: script
          mountPath: /config
        - name: host-root
          mountPath: /host
          mountPropagation: Bidirectional
        securityContext:
          privileged: true
        command:
        - /bin/bash
        - -c
        - |
          set -e
          echo "Starting s3-mount"
          yum install -y util-linux
          echo "Copying script to /usr/bin"
          cp /config/monitor_s3_mount.sh /host/usr/bin/monitor_s3_mount.sh
          chmod +x /host/usr/bin/monitor_s3_mount.sh
          echo "Verifying the copied script"
          ls -lha /host/usr/bin/monitor_s3_mount.sh
          echo "Running the script in Host space"
          nsenter --target 1 --mount --uts --ipc --net --pid ./usr/bin/monitor_s3_mount.sh
          echo "Done"

Executing Spark Job

Here are the steps to test the scenario using Approach 2 with DaemonSet:

Deploy Spark Operator Resources
Prepare the S3 Bucket
1. cd ${DOEKS_HOME}/analytics/terraform/spark-k8s-operator/examples/mountpoint-s3-spark/
2. chmod +x copy-jars-to-s3.sh
3. ./copy-jars-to-s3.sh
Set-up Kubeconfig
1. aws eks update-kubeconfig --name spark-operator-doeks
Apply DaemonSet
1. kubectl apply -f mountpoint-s3-daemonset.yaml
Apply Spark Job sample
1. 1. kubectl apply -f mountpoint-s3-spark-job.yaml
View Job Running
- There are a couple different resources of which we can view logs of as this SparkApplication CRD is running. Each of these logs should be in a separate terminal to view all of the logs simultaneously.
  1. spark operator
    1. kubectl -n spark-operator get pods
    2. copy the name of the spark operator pod
    3. kubectl -n spark-operator logs -f <POD_NAME>
  2. spark-team-a Pods
    1. In order to get the logs for the driver and exec Pods for the SparkApplication, we need to first verify that the Pods are running. Using wide output we should be able to see the Node that the Pods are running on and using -w we can see the status updates for each of the Pods.
    2. kubectl -n spark-team-a get pods -o wide -w
  3. driver Pod
    1. Once the driver Pod is in the running state, which will be visible in the previous terminal, we can get the logs for the driver Pod
    2. kubectl -n spark-team-a logs -f taxi-trip
  4. exec Pod
    1. Once the exec Pod is in the running state which will be visible in the previous terminal, we can get the logs for the exec Pod. Make sure that the exec-1 is running before getting the logs, otherwise use another exec Pod that is in the running state.
    2. 1. kubectl -n spark-team-a logs -f taxi-trip-exec-1

Verification

Once the job is done running you can see in the exec logs that the files are being copied from the local mountpoint-s3 location on the Node to the spark Pod in order to do the processing.

24/08/13 00:08:46 INFO Utils: Copying /mnt/s3/jars/hadoop-aws-3.3.1.jar to /var/data/spark-5eae56b3-3999-4c2f-8004-afc46d1c82ba/spark-a433e7ce-db5d-4fd5-b344-abf751f43bd3/-14716855631723507720806_cache
24/08/13 00:08:46 INFO Utils: Copying /var/data/spark-5eae56b3-3999-4c2f-8004-afc46d1c82ba/spark-a433e7ce-db5d-4fd5-b344-abf751f43bd3/-14716855631723507720806_cache to /opt/spark/work-dir/./hadoop-aws-3.3.1.jar
24/08/13 00:08:46 INFO Executor: Adding file:/opt/spark/work-dir/./hadoop-aws-3.3.1.jar to class loader
24/08/13 00:08:46 INFO Executor: Fetching file:/mnt/s3/jars/aws-java-sdk-bundle-1.12.647.jar with timestamp 1723507720806
24/08/13 00:08:46 INFO Utils: Copying /mnt/s3/jars/aws-java-sdk-bundle-1.12.647.jar to /var/data/spark-5eae56b3-3999-4c2f-8004-afc46d1c82ba/spark-a433e7ce-db5d-4fd5-b344-abf751f43bd3/14156613201723507720806_cache
24/08/13 00:08:47 INFO Utils: Copying /var/data/spark-5eae56b3-3999-4c2f-8004-afc46d1c82ba/spark-a433e7ce-db5d-4fd5-b344-abf751f43bd3/14156613201723507720806_cache to /opt/spark/work-dir/./aws-java-sdk-bundle-1.12.647.jar

Additionally, when viewing status of the spark-team-a Pods, you would notice that another Node comes online, this Node is is optimized to the run the SparkApplication and as soon as it comes online the DaemonSet Pod will also spin up and start running on the new Node so that any Pods that are run that new Node will also have access to the S3 Bucket. Using Systems Sessions Manager (SSM), you can connect any of the Nodes and verify the that the mountpoint-s3 package has been downloaded and installed by running:

mount-s3 --version

The largest advantage to using the mountpoint-S3 on the Node level for multiple Pods is that the data can be cached to allow other Pods to access the same data without having to make their own API calls. Once the karpenter-spark-compute-optimized optimized Node is allocated you can use Sessions Manager (SSM) to connect to the Node and verify that the files will be cached on the Node when the job is run and the volume is mounted. you can see the cache at:

sudo ls /tmp/mountpoint-cache/

Conclusion

By leveraging the CRT library, Mountpoint for Amazon S3 can deliver the high throughput and low latency needed to efficiently manage and access large volumes of data stored in S3. This allows dependency JAR files to be stored and managed externally from the container image, decoupling them from the Spark jobs. Additionally, storing JARs in S3 enables multiple Pods to consume them, leading to cost savings as S3 provides a cost-effective storage solution compared to larger container images. S3 also offers virtually unlimited storage, making it easy to scale and manage dependencies.

Mountpoint-S3 offers a versatile and powerful way to integrate S3 storage with EKS for data and AI/ML workloads. Whether you choose to deploy it at the Pod level using PVs and PVCs, or at the Node level using USERDATA or DaemonSets, each approach has its own set of advantages and trade-offs. By understanding these options, you can make informed decisions to optimize your data and AI/ML workflows on EKS.

What is Mountpoint-S3?​

Using Mountpoint-S3 with EKS​

Resource Allocation​

Deploy Solution Resources

Approach 1: Deploy Mountpoint-S3 on EKS at Pod level​

Approach 2: Deploy Mountpoint-S3 on EKS at Node level​

Approach 2.1: Using USERDATA​

USERDATA script:​

Approach 2.2: Using DaemonSet​

Executing Spark Job​

Verification​

Conclusion​