FSx for Lustre

Use Amazon FSx as persistent storage with Kubeflow on AWS

This guide describes how to use Amazon FSx as Persistent storage on top of an existing Kubeflow deployment.

Note: For Terraform deployment users, some steps that should be skipped will have a note indicating such below.

1.0 Prerequisites

For this guide, we assume that you already have an EKS Cluster with Kubeflow installed. The FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. See the deployment options and general prerequisites for more information.

Check that you have the necessary prerequisites.

Important: You must make sure you have an OIDC provider for your cluster and that it was added from eksctl >= 0.56 or if you already have an OIDC provider in place, then you must make sure you have the tag alpha.eksctl.io/cluster-name with the cluster name as its value. If you don’t have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags.

At this point, you have likely cloned the necessary repository and checked out the right branch. Save this path to help us navigate to different paths in the rest of this guide.

export GITHUB_ROOT=$(pwd)
export GITHUB_STORAGE_DIR="$GITHUB_ROOT/deployments/add-ons/storage/"

Make sure the following environment variables are set.

export CLUSTER_NAME=<clustername>
export CLUSTER_REGION=<clusterregion>

Based on your setup, export the name of the user namespace you are planning to use.

export PVC_NAMESPACE=kubeflow-user-example-com

Choose a name for the FSx claim that we will create. In this guide, we will use this variable as the name for the PV as well the PVC.

export CLAIM_NAME=<fsx-claim>

2.0 Setup FSx for Lustre

Setup for Manifest deployments

You can either use Automated or Manual setup. We currently only support Static provisioning for FSx.

Setup for Terraform deployments

Follow the Manual setup. We currently only support Static provisioning for FSx.

2.1 [Option 1] Automated setup

Important: Terraform deployment users should not follow these Automated setup instructions and should follow the Manual setup instructions.

The script automates all the manual resource creation steps but is currently only available for Static Provisioning option.
It performs the required cluster configuration, creates an FSx file system and it also takes care of creating a storage class for static provisioning. Once done, move to section 3.0.

Run the following commands from the tests/e2e directory:

cd $GITHUB_ROOT/tests/e2e

Install the script dependencies

pip install -r requirements.txt

Run the automated script as follows.

Note: If you want the script to create a new security group for FSx, specify a name for SECURITY_GROUP_TO_CREATE. On the other hand, if you want to use an existing Security group, you can specify that name too.

export SECURITY_GROUP_TO_CREATE=$CLAIM_NAME

python utils/auto-fsx-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --fsx_file_system_name $CLAIM_NAME --fsx_security_group_name $SECURITY_GROUP_TO_CREATE

The script above takes care of creating the PersistentVolume (PV) which is a cluster scoped resource. In order to create the PersistentVolumeClaim (PVC) you can either use the yaml file provided in this directory or use the Kubeflow UI directly but the PVC needs to be in the user namespace you will be accessing it from.

yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml
yq e '.spec.volumeName = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml

kubectl apply -f $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml

Advanced customization

The script applies some default values for the file system name, performance mode etc. If you know what you are doing, you can see which options are customizable by executing python utils/auto-fsx-setup.py --help.

2.2 [Option 2] Manual setup

If you prefer to manually setup each component then you can follow this manual guide.

1. Driver install and IAM configuration

Important: Skip this step if you are using a Terraform deployment since EFS CSI driver is installed by default unless you set enable_aws_fsx_csi_driver = false.

1. Install the FSx CSI Driver

We recommend installing the FSx CSI Driver v0.9.0 directly from the the aws-fsx-csi-driver GitHub repository as follows:

kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=tags/v0.9.0"

You can confirm that FSx CSI Driver was installed using the following command:

kubectl get csidriver -A

NAME              ATTACHREQUIRED   PODINFOONMOUNT   MODES        AGE
fsx.csi.aws.com   false            false            Persistent   14s

2. Create the IAM Policy for the CSI Driver

The CSI driver’s service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account fsx-csi-controller-sa with an IAM Role which has the required permissions.

Create the policy using the json file provided as follows:

aws iam create-policy \
    --policy-name Amazon_FSx_Lustre_CSI_Driver \
    --policy-document file://fsx-for-lustre/fsx-csi-driver-policy.json

Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows:

eksctl create iamserviceaccount \
    --name fsx-csi-controller-sa \
    --namespace kube-system \
    --cluster $CLUSTER_NAME \
    --attach-policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/Amazon_FSx_Lustre_CSI_Driver \
    --region $CLUSTER_REGION \
    --approve \
    --override-existing-serviceaccounts

You can verify by describing the specified service account to check if it has been correctly annotated:

kubectl describe -n kube-system serviceaccount fsx-csi-controller-sa

2. Create an instance of the FSx Filesystem

Please refer to the official AWS FSx CSI documentation for detailed instructions on creating an FSx filesystem.

Note: For this guide, we assume that you are creating your FSx Filesystem in the same VPC as your EKS Cluster.

3. Static provisioning

Using this sample from official Kubeflow Docs

Use the AWS Console to get the filesystem id of the FSx volume you want to use. You could also use the following command to list all the volumes available in your region. Either way, make sure that file_system_id is set.

aws fsx describe-file-systems --query "FileSystems[*].FileSystemId" --output text --region $CLUSTER_REGION

export file_system_id=<fsx-id-to-use>

Once you have the filesystem id, Use the following command to retrieve DNSName, and MountName values.

export dns_name=$(aws fsx describe-file-systems --file-system-ids $file_system_id --query "FileSystems[0].DNSName" --output text --region $CLUSTER_REGION)

export mount_name=$(aws fsx describe-file-systems --file-system-ids $file_system_id --query "FileSystems[0].LustreConfiguration.MountName" --output text --region $CLUSTER_REGION)

Now edit the fsx-for-lustre/static-provisioning/pv.yaml to replace <file_system_id>, <dns_name>, and <mount_name> with your values.

yq e '.spec.csi.volumeHandle = env(file_system_id)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pv.yaml
yq e '.spec.csi.volumeAttributes.dnsname = env(dns_name)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pv.yaml
yq e '.spec.csi.volumeAttributes.mountname = env(mount_name)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pv.yaml

The PersistentVolume is a cluster scoped resource but the PersistentVolumeClaim needs to be in the namespace you will be accessing it from. Replace the kubeflow-user-example-com namespace specified the below with the namespace for your kubeflow user and edit the fsx-for-lustre/static-provisioning/pvc.yaml file accordingly.

yq e '.spec.volumeName = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml

Now create the required PersistentVolume and PersistentVolumeClaim resources as -

kubectl apply -f $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pv.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/fsx-for-lustre/static-provisioning/pvc.yaml

2.3 Check your setup

Use the following commands to ensure all resources have been deployed as expected and the PersistentVolume is correctly bound to the PersistentVolumeClaim

kubectl get pv

NAME    CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                 STORAGECLASS   REASON   AGE
fsx-pv  1200Gi     RWX            Recycle          Bound    kubeflow-user-example-com/fsx-claim                           11s

kubectl get pvc -n $PVC_NAMESPACE

NAME        STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fsx-claim   Bound    fsx-pv   1200Gi     RWX                           83s

3.0 Using FSx storage in Kubeflow

In the following two sections we will be using this PVC to create a notebook server with Amazon FSx mounted as the workspace volume, download training data into this filesystem and then deploy a TFJob to train a model using this data.

3.1 Connect to the Kubeflow dashboard

Once you have everything setup, Port Forward as needed and Login to the Kubeflow dashboard. At this point, you can also check the Volumes tab in Kubeflow and you should be able to see your PVC is available for use within Kubeflow. For more details on how to access your Kubeflow dashboard, refer to one of the deployment option guides based on your setup. If you used the vanilla deployment, see Connect to your Kubeflow cluster.

3.2 Note about permissions

This step may not be necessary but you might need to specify some additional directory permissions on your worker node before you can use these as mount points. By default, new Amazon FSx file systems are owned by root:root, and only the root user (UID 0) has read-write-execute permissions. If your containers are not running as root, you must change the Amazon FSx file system permissions to allow other users to modify the file system. The set-permission-job.yaml is an example of how you could set these permissions to be able to use the fsx as your workspace in your kubeflow notebook. Modify it accordingly if you run into similar permission issues with any other job pod.

yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
yq e '.spec.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml

kubectl apply -f $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml

3.2 Using FSx volume as workspace or data volume for a notebook server

Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. For our example here, we are using the AWS-optimized Tensorflow 2.6 CPU image provided in the Notebook configuration options (public.ecr.aws/kubeflow-on-aws/notebook-servers/jupyter-tensorflow). Additionally, use the existing PVC as the workspace volume at the default mount point /home/jovyan. The server might take a few minutes to come up.

In case the server does not start up in the expected time, do make sure to check:

The Notebook Controller Logs
The specific notebook server instance pod’s logs

3.3 Using FSx volume for a TrainingJob using TFJob Operator

The following section re-uses the PVC and the Tensorflow Kubeflow Notebook created in the previous steps to download a dataset to the FSx Volume. Then we spin up a TFjob which runs a image classification job using the data from the shared volume. Source: https://www.tensorflow.org/tutorials/load_data/images

Note: The following steps are run from the terminal on your gateway node connected to your EKS cluster and not from the Kubeflow Notebook to test the PVC allowed sharing of data as expected.

1. Download the dataset to the FSx Volume

In the Kubeflow Notebook created above, use the following snippet to download the data into the /home/jovyan/.keras directory (which is mounted onto the FSx Volume).

import pathlib
import tensorflow as tf
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
                                   fname='flower_photos',
                                   untar=True)
data_dir = pathlib.Path(data_dir)

2. Build and push the Docker image

In the training-sample directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the $IMAGE_URI to your registry and specify an appropriate tag:

export IMAGE_URI=<dockerimage:tag>
cd training-sample

# You will need to login to ECR for the following steps
docker build -t $IMAGE_URI .
docker push $IMAGE_URI
cd -

3. Configure the TFjob spec file

Once the docker image is built, replace the <dockerimage:tag> in the tfjob.yaml file, line #17.

yq e '.spec.tfReplicaSpecs.Worker.template.spec.containers[0].image = env(IMAGE_URI)' -i training-sample/tfjob.yaml

Also, specify the name of the PVC you created:

export CLAIM_NAME=fsx-claim
yq e '.spec.tfReplicaSpecs.Worker.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i training-sample/tfjob.yaml

Make sure to run it in the same namespace as the claim:

yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i training-sample/tfjob.yaml

4. Create the TFjob and use the provided commands to check the training logs

At this point, we are ready to train the model using the training-sample/training.py script and the data available on the shared volume with the Kubeflow TFJob operator.

kubectl apply -f training-sample/tfjob.yaml

In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs.

kubectl describe tfjob image-classification-pvc -n $PVC_NAMESPACE
kubectl logs -n $PVC_NAMESPACE image-classification-pvc-worker-0 -f

4.0 Cleanup

This section cleans up the resources created in this guide. To clean up additional deployment-specific resources, see the uninstall section in the deployment guide for the deployment option of your choice.

4.1 Clean up the TFJob

kubectl delete tfjob -n $PVC_NAMESPACE image-classification-pvc

4.2 Delete the Kubeflow Notebook

Log in to the dashboard to stop and/or terminate any Kubeflow Notebooks that you created for this session or use the following commands:

kubectl delete notebook -n $PVC_NAMESPACE <notebook-name>

kubectl delete pod -n $PVC_NAMESPACE $CLAIM_NAME

4.3 Delete PVC, PV, and SC in the following order

kubectl delete pvc -n $PVC_NAMESPACE $CLAIM_NAME
kubectl delete pv fsx-pv

4.4 Delete the FSx filesystem

aws fsx delete-file-system --file-system-id $file_system_id

Make sure to delete any other resources that you have created such as security groups via the AWS Console or using the AWS CLI.

5.0 Known issues

When you re-run the eksctl create iamserviceaccount to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In this case, you may need to do one or both of the following: 1. Delete the CloudFormation stack associated with this add-on role. 2. Delete the fsx-csi-controller-sa service account and then re-run the required steps. If you used the auto-script, you can re-run it by specifying the same filesystem-name so that a new one is not created.

When using an FSx volume in a Kubeflow Notebook, the same PVC claim can be mounted to the same Notebook only once as either the workspace volume or the data volume. Create two seperate PVCs on your FSx volume if you need to attach it twice to the Notebook.

Last modified April 24, 2023: Minor validation and docs update (#695) (bba1bc91)