Launch a job with FSx for Lustre
What is FSx¶
Amazon FSx provides you with the native compatibility of third-party file systems with feature sets for workloads such as high-performance computing (HPC), machine learning and electronic design automation (EDA). You don’t have to worry about managing file servers and storage, as Amazon FSx automates the time-consuming administration tasks such as hardware provisioning, software configuration, patching, and backups. Amazon FSx provides FSx for Lustre for compute-intensive workloads.
Please note the following when using FSx on Scale-Out Computing on AWS
- FSx is supported natively (Linux clients, security groups and backend configuration is automatically managed by Scale-Out Computing on AWS)
- You can launch an ephemeral FSx filesystem for your job
- You can connect to an existing FSx filesystem
- You can dynamically adjust the storage capacity of your FSx filesystem
- Exported files (if any) from FSx to S3 will be stored under
s3://<YOUR_BUCKET_NAME>/<CLUSTER_ID>-fsxoutput/job-<JOB_ID>/
by default (you can change it if needed)
Scale-Out automatically determines the actions to be taken based on the fsx_lustre
value you specified during job submission
-
If value is
yes/true/on
, a standard FSx for Lustre will be provisioned -
If value starts with
s3://
or is a string, SOCA will try to mount the S3 bucket automatically as part of the FSx deployment -
If value starts with
fs-xxx
, SOCA will try to mount an existing FSx automatically
How to provision an ephemeral FSx¶
To provision an FSx for Lustre without S3 backend, simply specify -l fsx_lustre=True
at job submission.
If -l fsx_lustre_capacity
is not set, the default storage provisioned will be 1.2 TB. The FSx will be mounted under "/fsx" by default, you can change this value by referring at the section at the end of this doc.
How to provision an ephemeral FSx with S3 backend¶
Pre-requisite¶
S3 Backend
This section is only required if you are planning to use S3 as a data backend for FSx
You need to give Scale-Out Computing on AWS the permission to map the S3 bucket you want to mount on FSx. To do that, add a new inline policy to the scheduler IAM role. The Scheduler IAM role can be found on the IAM bash and is named <SOCA_AWS_STACK_NAME>-Security-<UUID>-SchedulerIAMRole-<UUID>
.
To create an inline policy, select your IAM role, click "Add Inline Policy":
Select "JSON" tab
Finally copy/paste the JSON policy listed below (make sure to adjust to your bucket name), click "Review" and "Create Policy".
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAccessFSxtoS3",
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<YOUR_BUCKET_NAME>",
"arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
]
}
]
}
To validate your policy is effective, access the scheduler host and run the following commmand:
## Example when IAM policy is not correct
user@host: aws s3 ls s3://<YOUR_BUCKET_NAME>
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
## Example when IAM policy is valid (output will list content of your bucket)
user@host: aws s3 ls s3://<YOUR_BUCKET_NAME>
2019-11-02 04:26:27 2209 dataset1.txt
2019-11-02 04:26:39 10285 dataset2.csv
Warning
This permission will give scheduler host access to your S3 bucket, therefore you want to limit access to this host to approved users only. DCV sessions or other compute nodes will not have access to the S3 bucket.
Setup¶
For this example, let's say I have my dataset available on S3 and I want to access them for my simulation.
Submit a job using -l fsx_lustre=s3://<YOUR_BUCKET_NAME>
. The bucket will then be mounted on all nodes provisioned for the job under /fsx
mountpoint.
user@host: qsub -l fsx_lustre=s3://<YOUR_BUCKET_NAME> -- /bin/sleep 600
This command will provision a new 1200 GB (smallest capacity available) FSx filesystem for your job:
Your job will automatically start as soon as both your FSx filesystem and compute nodes are available. Your filesystem will be available on all nodes allocated to your job under /fsx
user@host: df -h /fsx
Filesystem Size Used Avail Use% Mounted on
200.0.170.60@tcp:/fsx 1.1T 4.4M 1.1T 1% /fsx
## Verify the content of your bucket is accessible
user@host: ls -ltr /fsx
total 1
-rwxr-xr-x 1 root root 2209 Nov 2 04:26 dataset1.txt
-rwxr-xr-x 1 root root 10285 Nov 2 04:26 dataset2.csv
You can change the ImportPath / ExportPath by using the following syntax: -l fsx_lustre=<BUCKET>+<EXPORT_PATH>+<IMPORT_PATH>
.
If <IMPORT_PATH>
is not set, value defaults to the bucket root level.
The default <EXPORT_PATH>
is <BUCKET>/<CLUSTER_ID>-fsxoutput/<JOBID>
Your FSx filesystem will automatically be terminated when your job complete. Refer to this link to learn how to interact with FSx data repositories.
How to connect to a permanent/existing FSx¶
If you already have a running FSx, you can mount it using -l fsx_lustre
variable.
user@host: qsub -l fsx_lustre=<MY_FSX_DNS> -- /bin/sleep 60
To retrieve your FSx DNS, select your filesystem and select "Network & Security"
Warning
- Make sure your FSx is running on the same VPC as Scale-Out Computing on AWS
- Make sure your FSx security group allow traffic from/to Scale-Out Computing on AWS ComputeNodes SG
- If you specify both "fsx_lustre" and "fsx_lustre", only "fsx_lustre" will be mounted.
Change FSx capacity¶
Use -l fsx_lustre_size=<SIZE_IN_GB>
to specify the size of your FSx filesystem. Please note the following informations:
- If not specified, Scale-Out Computing on AWS deploy the smallest possible capacity (1200GB)
- Valid sizes (in GB) are 1200, 2400, 3600 and increments of 3600
user@host: qsub -l fsx_lustre_size=3600 -l fsx_lustre=s3://<YOUR_S3_BUCKET> -- /bin/sleep 600
This command will mount a 3.6TB FSx filesystem on all nodes provisioned for your simulation.
How to change the mountpoint¶
By default Scale-Out Computing on AWS mounts fsx on /fsx
. If you need to change this value, edit /apps/soca/$SOCA_CONFIGURATION/cluster_node_bootstrap/ComputeNodePostReboot.sh
update the value of FSX_MOUNTPOINT
.
...
if [[ $SOCA_AWS_fsx_lustre != 'false' ]]; then
echo "FSx request detected, installing FSX Lustre client ... "
FSX_MOUNTPOINT="/fsx" ## <-- Update mountpoint here
mkdir -p $FSX_MOUNTPOINT
...
Learn about the other storage options on Scale-Out Computing on AWS¶
Click here to learn about the other storage options offered by Scale-Out Computing on AWS.
Troubleshooting and most common errors¶
Like any other parameter, FSx options can be debugged using /apps/soca/$SOCA_CONFIGURATION/cluster_manager/logs/<QUEUE_NAME>.log
[Error while trying to create ASG: Scale-Out Computing on AWS does not have access to this bucket.
Update IAM policy as described on the documentation]
Resolution: Scale-Out Computing on AWS does not have access to this S3 bucket. Update your IAM role with the policy listed above
[Error while trying to create ASG: fsx_lustre_size must be: 1200, 2400, 3600, 7200, 10800]