EFS
This guide describes how to use Amazon EFS as Persistent storage on top of an existing Kubeflow deployment.
Note: For Terraform deployment users, some steps that should be skipped will have a note indicating such below.
1.0 Prerequisites
For this guide, we assume that you already have an EKS Cluster with Kubeflow installed. The FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. See the deployment options and general prerequisites for more information.
- Check that you have the necessary prerequisites.
Important: You must make sure you have an OIDC provider for your cluster and that it was added from
eksctl
>=0.56
or if you already have an OIDC provider in place, then you must make sure you have the tagalpha.eksctl.io/cluster-name
with the cluster name as its value. If you don’t have the tag, you can add it via the AWS Console by navigating to IAM->Identity providers->Your OIDC->Tags.
- At this point, you have likely cloned the necessary repository and checked out the right branch. Save this path to help naviagte to different paths in the rest of this guide.
export GITHUB_ROOT=$(pwd)
export GITHUB_STORAGE_DIR="$GITHUB_ROOT/deployments/add-ons/storage/"
- Make sure that the following environment variables are set.
export CLUSTER_NAME=<clustername>
export CLUSTER_REGION=<clusterregion>
- Based on your setup, export the name of the user namespace you are planning to use.
export PVC_NAMESPACE=kubeflow-user-example-com
- Choose a name for the EFS claim that we will create. In this guide we will use this variable as the name for the PV as well the PVC.
export CLAIM_NAME=<efs-claim>
2.0 Set up EFS
You can either use Automated or Manual setup to set up the resources required. If you choose the manual route, you get another choice between static and dynamic provisioning, so pick whichever suits you. On the other hand, for the automated script we currently only support dynamic provisioning. Whichever combination you pick, be sure to continue picking the appropriate sections through the rest of this guide.
2.1 [Option 1] Automated setup
The script automates all the manual resource creation steps but is currently only available for Dynamic Provisioning option.
It performs the required cluster configuration, creates an EFS file system and it also takes care of creating a storage class for dynamic provisioning. Once done, move to section 3.0.
- Run the following commands from the
tests/e2e
directory:
cd $GITHUB_ROOT/tests/e2e
- Install the script dependencies.
pip install -r requirements.txt
- Run the appropriate automated script command for your deployment type.
Note: If you want the script to create a new security group for EFS, specify a name here. On the other hand, if you want to use an existing Security group, you can specify that name too. We have used the same name as the claim we are going to create.
export SECURITY_GROUP_TO_CREATE=$CLAIM_NAME
python utils/auto-efs-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --efs_file_system_name $CLAIM_NAME --efs_security_group_name $SECURITY_GROUP_TO_CREATE
export SECURITY_GROUP_TO_CREATE=$CLAIM_NAME
python utils/auto-efs-setup.py --region $CLUSTER_REGION --cluster $CLUSTER_NAME --efs_file_system_name $CLAIM_NAME --efs_security_group_name $SECURITY_GROUP_TO_CREATE --skip-driver-installation
- The script above takes care of creating the
StorageClass (SC)
which is a cluster scoped resource. In order to create thePersistentVolumeClaim (PVC)
you can either use the yaml file provided in this directory or use the Kubeflow UI directly. The PVC needs to be in the namespace you will be accessing it from. Replace thekubeflow-user-example-com
namespace specified the below with the namespace for your kubeflow user and edit theefs/dynamic-provisioning/pvc.yaml
file accordingly.
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
Advanced customization
The script applies some default values for the file system name, performance mode etc. If you know what you are doing, you can see which options are customizable by executing python utils/auto-efs-setup.py --help
.
2.2 [Option 2] Manual setup
If you prefer to manually setup each component then you can follow this manual guide. As mentioned, it you have two options between Static and Dynamic provisioing later in step 4 of this section.
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
1. Driver install and IAM configuration
Important: Skip this step if you are using a Terraform deployment since EFS CSI driver is installed by default unless you set
enable_aws_efs_csi_driver = false
.
1.1 Install the EFS CSI driver
We recommend installing the EFS CSI Driver v1.5.4 directly from the the aws-efs-csi-driver github repo as follows:
kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=tags/v1.5.4"
You can confirm that EFS CSI Driver was installed into the default kube-system namespace for you. You can check using the following command:
kubectl get csidriver
NAME ATTACHREQUIRED PODINFOONMOUNT MODES AGE
efs.csi.aws.com false false Persistent 5d17h
1.2. Create the IAM Policy for the CSI driver
The CSI driver’s service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account efs-csi-controller-sa
with an IAM Role which has the required permissions.
- Download the IAM policy document from GitHub as follows.
curl -o iam-policy-example.json https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/v1.5.4/docs/iam-policy-example.json
- Create the policy.
aws iam create-policy \
--policy-name AmazonEKS_EFS_CSI_Driver_Policy \
--policy-document file://iam-policy-example.json
- Create an IAM role and attach the IAM policy to it. Annotate the Kubernetes service account with the IAM role ARN and the IAM role with the Kubernetes service account name. You can create the role using eksctl as follows:
eksctl create iamserviceaccount \
--name efs-csi-controller-sa \
--namespace kube-system \
--cluster $CLUSTER_NAME \
--attach-policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/AmazonEKS_EFS_CSI_Driver_Policy \
--approve \
--override-existing-serviceaccounts \
--region $CLUSTER_REGION
- You can verify by describing the specified service account to check if it has been correctly annotated:
kubectl describe -n kube-system serviceaccount efs-csi-controller-sa
2. Manually create an instance of the EFS filesystem
Please refer to the official AWS EFS CSI Document for detailed instructions on creating an EFS filesystem.
Note: For this guide, we assume that you are creating your EFS Filesystem in the same VPC as your EKS Cluster.
Choose between dynamic and static provisioning
In the following section, you have to choose between setting up dynamic provisioning or setting up static provisioning.
3. [Option 1] Dynamic provisioning
- Use the
$file_system_id
you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit thedynamic-provisioning/sc.yaml
file by chaning<YOUR_FILE_SYSTEM_ID>
with yourfs-xxxxxx
file system id. You can also change it using the following command :
file_system_id=$file_system_id yq e '.parameters.fileSystemId = env(file_system_id)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/sc.yaml
- Create the storage class using the following command :
kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/sc.yaml
- Verify your setup by checking which storage classes are created for your cluster. You can use the following command
kubectl get sc
- The
StorageClass
is a cluster scoped resources but thePersistentVolumeClaim
needs to be in the namespace you will be accessing it from. Let’s edit the pvc.yaml accordingly
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
Note : The StorageClass
is a cluster scoped resource which means we only need to do this step once per cluster.
3. [Option 2] Static Provisioning
Using this sample, we provided the required spec files in the sample subdirectory. However, you can create the PVC another way.
- Use the
$file_system_id
you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit the last line of the static-provisioning/pv.yaml file to specify thevolumeHandle
field to point to your EFS filesystem. Replace$file_system_id
if it is not already set.
file_system_id=$file_system_id yq e '.spec.csi.volumeHandle = env(file_system_id)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml
- The
PersistentVolume
andStorageClass
are cluster scoped resources but thePersistentVolumeClaim
needs to be in the namespace you will be accessing it from. Replace thekubeflow-user-example-com
namespace specified the below with the namespace for your kubeflow user and edit thestatic-provisioning/pvc.yaml
file accordingly.
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml
- Now create the required persistentvolume, persistentvolumeclaim and storageclass resources as -
kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/sc.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/pv.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/efs/static-provisioning/pvc.yaml
2.3 Check your setup
Use the following commands to ensure all resources have been deployed as expected and the PersistentVolume is correctly bound to the PersistentVolumeClaim
# Only for Static Provisioning
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
efs-pv 5Gi RWX Retain Bound kubeflow-user-example-com/efs-claim efs-sc 5d16h
# Both Static and Dynamic Provisioning
kubectl get pvc -n $PVC_NAMESPACE
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
efs-claim Bound efs-pv 5Gi RWX efs-sc 5d16h
3.0 Using EFS storage in Kubeflow
In the following two sections we will be using this PVC to create a notebook server with Amazon EFS mounted as the workspace volume, download training data into this filesystem and then deploy a TFJob to train a model using this data.
3.1 Connect to the Kubeflow dashboard
Once you have everything setup, Port Forward as needed and Login to the Kubeflow dashboard. At this point, you can also check the Volumes
tab in Kubeflow and you should be able to see your PVC is available for use within Kubeflow.
For more details on how to access your Kubeflow dashboard, refer to one of the deployment option guides based on your setup. If you used the vanilla deployment, see Connect to your Kubeflow cluster.
3.2 Changing the default Storage Class
After installing Kubeflow, you can change the default Storage Class from gp2
to the efs storage class you created during the setup. For instance, if you followed the automatic or manual steps, you should have a storage class named efs-sc
. You can check your storage classes by running kubectl get sc
.
This is can be useful if your notebook configuration is set to use the default storage class (it is the case by default). By changing the default storage class, when creating workspace volumes for your notebooks, it will use your EFS storage class automatically. This is not mandatory as you can also manually create a PVC and select the efs-sc
class via the Volume UI but can facilitate the notebook creation process and automatically select this class when creating volume in the UI. You can also decide to keep using gp2
for workspace volumes and keep the EFS storage class for datasets/data volumes only.
To learn more about how to change the default Storage Class, you can refer to the official Kubernetes documentation.
For instance, if you have a default class set to gp2
and another class efs-sc
, then you would need to do the following :
- Remove
gp2
as your default storage class
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
- Set
efs-sc
as your default storage class
kubectl patch storageclass efs-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Note: As mentioned, make sure to change your default storage class only after you have completed your Kubeflow deployment. The default Kubeflow components may not work well with a different storage class.
3.3 Note about permissions
This step may not be necessary but you might need to specify some additional directory permissions on your worker node before you can use these as mount points. By default, new Amazon EFS file systems are owned by root:root, and only the root user (UID 0) has read-write-execute permissions. If your containers are not running as root, you must change the Amazon EFS file system permissions to allow other users to modify the file system. The set-permission-job.yaml is an example of how you could set these permissions to be able to use the efs as your workspace in your kubeflow notebook. Modify it accordingly if you run into similar permission issues with any other job pod.
yq e '.metadata.name = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
yq e '.spec.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
kubectl apply -f $GITHUB_STORAGE_DIR/notebook-sample/set-permission-job.yaml
Dynamic provisioning default owner
For dynamic provisioning (manual and automated setup), we already set the default Kubeflow Notebook user (Jovyan) as owner of the EFS file system by default.
Changing the default values
You can always change the uid
and gid
used for the setup.
For the manual setup, you need to edit the uid
and gid
in the storage class inside dynamic-provisioning/sc.yaml
.
For the automated setup, you can specify the uid
and gid
as arguments to the script, see Advanced Customization for more details on the different parameters that are available.
3.4 Use existing EFS volume as workspace or data volume for a Notebook
Spin up a new Kubeflow notebook server and specify the name of the PVC to be used as the workspace volume or the data volume and specify your desired mount point. We’ll assume you created a PVC with the name efs-claim
via Kubeflow Volumes UI or via the manual setup step Static Provisioning. For our example here, we are using the AWS Optimized Tensorflow 2.6 CPU image provided in the Notebook configuration options (public.ecr.aws/kubeflow-on-aws/notebook-servers/jupyter-tensorflow
). Additionally, use the existing efs-claim
volume as the workspace volume at the default mount point /home/jovyan
. The server might take a few minutes to come up.
In case the server does not start up in the expected time, do make sure to check:
- The Notebook Controller Logs
- The specific notebook server instance pod’s logs
3.6 Use EFS volume for a TrainingJob using TFJob Operator
The following section re-uses the PVC and the Tensorflow Kubeflow Notebook created in the previous steps to download a dataset to the EFS Volume. Then we spin up a TFjob which runs a image classification job using the data from the shared volume. Source: https://www.tensorflow.org/tutorials/load_data/images
Note: The following steps are run from the terminal on your gateway node connected to your EKS cluster and not from the Kubeflow Notebook to test the PVC allowed sharing of data as expected.
1. Download the dataset to the EFS Volume
In the Kubeflow Notebook created above, use the following snippet to download the data into the /home/jovyan/.keras
directory (which is mounted onto the EFS Volume).
import pathlib
import tensorflow as tf
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
fname='flower_photos',
untar=True)
data_dir = pathlib.Path(data_dir)
2. Build and push the Docker image
In the training-sample
directory, we have provided a sample training script and Dockerfile which you can use as follows to build a docker image. Be sure to point the $IMAGE_URI
to your registry and specify an appropriate tag.
export IMAGE_URI=<dockerimage:tag>
cd training-sample
# You will need to login to ECR for the following steps
docker build -t $IMAGE_URI .
docker push $IMAGE_URI
cd -
3. Configure the TFjob spec file
Once the docker image is built, replace the <dockerimage:tag>
in the tfjob.yaml
file, line #17.
yq e '.spec.tfReplicaSpecs.Worker.template.spec.containers[0].image = env(IMAGE_URI)' -i training-sample/tfjob.yaml
Also, specify the name of the PVC you created.
yq e '.spec.tfReplicaSpecs.Worker.template.spec.volumes[0].persistentVolumeClaim.claimName = env(CLAIM_NAME)' -i training-sample/tfjob.yaml
Make sure to run it in the same namespace as the claim:
yq e '.metadata.namespace = env(PVC_NAMESPACE)' -i training-sample/tfjob.yaml
4. Create the TFjob and use the provided commands to check the training logs
At this point, we are ready to train the model using the training-sample/training.py
script and the data available on the shared volume with the Kubeflow TFJob operator as -
kubectl apply -f training-sample/tfjob.yaml
In order to check that the training job is running as expected, you can check the events in the TFJob describe response as well as the job logs.
kubectl describe tfjob image-classification-pvc -n $PVC_NAMESPACE
kubectl logs -n $PVC_NAMESPACE image-classification-pvc-worker-0 -f
4.0 Cleanup
This section cleans up the resources created in this guide. To clean up additional deployment-specific resources, see the uninstall section in the deployment guide for the deployment option of your choice.
4.1 Clean up the TFJob
kubectl delete tfjob -n $PVC_NAMESPACE image-classification-pvc
4.2 Delete the Kubeflow Notebook
Login to the dashboard to stop and/or terminate any kubeflow notebooks you created for this session or use the following command:
kubectl delete notebook -n $PVC_NAMESPACE <notebook-name>
Use the following command to delete the permissions job:
kubectl delete pod -n $PVC_NAMESPACE $CLAIM_NAME
4.3 Delete PVC, PV, and SC in the following order
kubectl delete pvc -n $PVC_NAMESPACE $CLAIM_NAME
kubectl delete pv efs-pv
kubectl delete sc efs-sc
4.4 Delete the EFS mount targets, filesystem, and security group
Use the steps in this AWS Guide to delete the EFS filesystem that you created.
5.0 Known issues
- When you re-run the
eksctl create iamserviceaccount
to create and annotate the same service account multiple times, sometimes the role does not get overwritten. In this case, you may need to do one or both of the following: a. Delete the CloudFormation stack associated with this add-on role. b. Delete theefs-csi-controller-sa
service account and then re-run the required steps. If you used the auto-script, you can re-run it by specifying the samefilesystem-name
such that a new one is not created.