Introduction
The AIoEKS foundational infrastructure lives in the infra/base
directory. This directory contains the base
infrastructure and all its modules that allow composing an environment that supports experimentation, AI/ML training,
LLM inference, model tracking, and more.
In the directory is a variables.tf
which contains all the parameters used to enable or disable desired modules (set to
false
by default). This enables the ability to deploy a bare environment with Karpenter and GPU and AWS Neuron
NodePool
s to enable accelerator use and for further customization.
The reference jark-stack
deploys an environment that facilitates quick AI/ML development by enabling Jupyterhub for
experimentation, the KubeRay operator for training and inference
using Ray Clusters, Argo Workflows for automating
workflows, and storage controllers and volumes. This allows deploying the notebooks
, training
, and inference
blueprints in the blueprints
folder.
Other blueprints use the same base infrastructure and selectively enable other components based on the needs of the blueprint.
Resources
Each stack inherits the base
stack's components. These components include:
- VPC with subnets in 2 availability zones
- EKS cluster with 1 core nodegroup with 2 nodes to run the minimum infrastructure
- Karpenter for autoscaling with CPU, GPU, and AWS Neuron NodePools
- GPU/Neuron device drivers
- GPU/Neuron monitoring agents
Variables
Deployment
Variable Name | Description | Default |
---|---|---|
name | The name of the Kubernetes cluster | ai-stack |
region | The region for the cluster | us-east-1 |
eks_cluster_version | The version of EKS to use | 1.32 |
vpc_cidr | The CIDR used for the VPC | 10.1.0.0/21 |
secondary_cidr_blocks | Secondary CIDR for the VPC | 100.64.0.0/16 |
enable_aws_cloudwatch_metrics | Enable the AWS Cloudwatch Metrics addon | false |
bottlerocket_data_disk_snapshot_id | Attach a snapshot ID to the deployed nodes | "" |
enable_aws_efs_csi_driver | Enable the AWS EFS CSI driver | false |
enable_aws_efa_k8s_device_plugin | Enable the AWS EFA device plugin | false |
enable_aws_fsx_csi_driver | Enable the FSx device plugin | false |
deploy_fsx_volume | Deploy a simple FSx volume | false |
enable_amazon_prometheus | Enable Amazon Managed Prometheus | false |
enable_amazon_emr | Set up Amazon EMR | false |
enable_kube_prometheus_stack | Enable the Kube Prometheus addon | false |
enable_kubecost | Enable Kubecost | false |
enable_argo_workflows | Enable Argo Workflow | false |
enable_argo_events | Enable Argo Events | false |
enable_mlflow_tracking | Enable MLFlow Tracking | false |
enable_jupyterhub | Enable JupyterHub | false |
enable_volcano | Enable Volcano | false |
enable_kuberay_operator | Enable KubeRay | false |
huggingface_token | Hugging Face token to use in environment | DUMMY_TOKEN_REPLACE_ME |
enable_rayserve_ha_elastic_cache_redis | Enable Rayserve high availability using ElastiCache | false |
enable_torchx_etcd | Enable etcd for torchx | false |
enable_mpi_operator | Enable the MPIO perator | false |
JupyterHub
Variable Name | Description | Default |
---|---|---|
jupyter_hub_auth_mechanism | Which authorization mechanism to use for JupyterHub [dummy | cognito | oauth ] | dummy |
cognito_custom_domain | Cognito domain prefix for Hosted UI authentication endpoints | eks |
acm_certificate_domain | Domain name used for the ACM certificate | "" |
jupyterhub_domain | Domain name for JupyterHub (only used for cognito or oauth) | "" |
oauth_jupyter_client_id | oauth clientid for JupyterHub. Only used for oauth | "" |
oauth_jupyter_client_secret | oauth client secret. Only used for oauth | "" |
oauth_username_key | oauth field for username (e.g. preferred_username ). Only needed for oauth | "" |
Custom Stacks
With the variables above, it's very easy to compose a new environment tailored to your own needs. A custom
folder is
available in the infra
folder with a simple blueprint.tfvars
. By adding the variables above with the appropriate
value, you are able to customize which addons you would like deployed to create an environment to support your
preferences. Once the variables are added, run the install.sh
at the root of infra/custom