Skip to main content

Introduction

The AIoEKS foundational infrastructure lives in the infra/base directory. This directory contains the base infrastructure and all its modules that allow composing an environment that supports experimentation, AI/ML training, LLM inference, model tracking, and more.

In the directory is a variables.tf which contains all the parameters used to enable or disable desired modules (set to false by default). This enables the ability to deploy a bare environment with Karpenter and GPU and AWS Neuron NodePools to enable accelerator use and for further customization.

The reference jark-stack deploys an environment that facilitates quick AI/ML development by enabling Jupyterhub for experimentation, the KubeRay operator for training and inference using Ray Clusters, Argo Workflows for automating workflows, and storage controllers and volumes. This allows deploying the notebooks, training, and inference blueprints in the blueprints folder.

Other blueprints use the same base infrastructure and selectively enable other components based on the needs of the blueprint.

Resources

Each stack inherits the base stack's components. These components include:

  • VPC with subnets in 2 availability zones
  • EKS cluster with 1 core nodegroup with 2 nodes to run the minimum infrastructure
  • Karpenter for autoscaling with CPU, GPU, and AWS Neuron NodePools
  • GPU/Neuron device drivers
  • GPU/Neuron monitoring agents

Variables

Deployment

Variable NameDescriptionDefault
nameThe name of the Kubernetes clusterai-stack
regionThe region for the clusterus-east-1
eks_cluster_versionThe version of EKS to use1.32
vpc_cidrThe CIDR used for the VPC10.1.0.0/21
secondary_cidr_blocksSecondary CIDR for the VPC100.64.0.0/16
enable_aws_cloudwatch_metricsEnable the AWS Cloudwatch Metrics addonfalse
bottlerocket_data_disk_snapshot_idAttach a snapshot ID to the deployed nodes""
enable_aws_efs_csi_driverEnable the AWS EFS CSI driverfalse
enable_aws_efa_k8s_device_pluginEnable the AWS EFA device pluginfalse
enable_aws_fsx_csi_driverEnable the FSx device pluginfalse
deploy_fsx_volumeDeploy a simple FSx volumefalse
enable_amazon_prometheusEnable Amazon Managed Prometheusfalse
enable_amazon_emrSet up Amazon EMRfalse
enable_kube_prometheus_stackEnable the Kube Prometheus addonfalse
enable_kubecostEnable Kubecostfalse
enable_argo_workflowsEnable Argo Workflowfalse
enable_argo_eventsEnable Argo Eventsfalse
enable_mlflow_trackingEnable MLFlow Trackingfalse
enable_jupyterhubEnable JupyterHubfalse
enable_volcanoEnable Volcanofalse
enable_kuberay_operatorEnable KubeRayfalse
huggingface_tokenHugging Face token to use in environmentDUMMY_TOKEN_REPLACE_ME
enable_rayserve_ha_elastic_cache_redisEnable Rayserve high availability using ElastiCachefalse
enable_torchx_etcdEnable etcd for torchxfalse
enable_mpi_operatorEnable the MPIO peratorfalse

JupyterHub

Variable NameDescriptionDefault
jupyter_hub_auth_mechanismWhich authorization mechanism to use for JupyterHub [dummy | cognito | oauth]dummy
cognito_custom_domainCognito domain prefix for Hosted UI authentication endpointseks
acm_certificate_domainDomain name used for the ACM certificate""
jupyterhub_domainDomain name for JupyterHub (only used for cognito or oauth)""
oauth_jupyter_client_idoauth clientid for JupyterHub. Only used for oauth""
oauth_jupyter_client_secretoauth client secret. Only used for oauth""
oauth_username_keyoauth field for username (e.g. preferred_username). Only needed for oauth""

Custom Stacks

With the variables above, it's very easy to compose a new environment tailored to your own needs. A custom folder is available in the infra folder with a simple blueprint.tfvars. By adding the variables above with the appropriate value, you are able to customize which addons you would like deployed to create an environment to support your preferences. Once the variables are added, run the install.sh at the root of infra/custom