Skip to main content

Introduction

The AI on EKS foundational infrastructure lives in the infra/base directory. This directory contains the base infrastructure and all its modules that allow composing an environment that supports experimentation, AI/ML training, LLM inference, model tracking, and more.

In the directory is a variables.tf which contains all the parameters used to enable or disable desired modules (set to false by default). This enables the ability to deploy a bare environment with Karpenter and GPU and AWS Neuron NodePools to enable accelerator use and for further customization.

The reference jark-stack deploys an environment that facilitates quick AI/ML development by enabling Jupyterhub for experimentation, the KubeRay operator for training and inference using Ray Clusters, Argo Workflows for automating workflows, and storage controllers and volumes. This allows deploying the notebooks, training, and inference blueprints in the blueprints folder.

Other blueprints use the same base infrastructure and selectively enable other components based on the needs of the blueprint.

Overview

AI on EKS provides comprehensive infrastructure solutions for deploying AI/ML workloads on Amazon EKS. Choose from pre-configured solutions optimized for training, inference, or general-purpose AI/ML workloads.

Training Infrastructure

Infrastructure solutions optimized for AI/ML model training workloads:

  • JARK Stack on EKS - Complete stack for AI workloads with NVIDIA GPUs, including JupyterHub, Ray, and Kubeflow
  • JupyterHub on EKS - Interactive development environment for data science and ML

Inference Infrastructure

Infrastructure solutions optimized for AI/ML model inference workloads:

Miscellaneous

Additional infrastructure solutions and utilities:

Getting Started

  1. Choose Your Use Case: Select training or inference based on your workload requirements
  2. Deploy Infrastructure: Follow the deployment guide for your chosen solution
  3. Deploy Workloads: Use the Blueprints to deploy your AI/ML workloads
  4. Optimize: Apply Guidance best practices

Architecture Patterns

All infrastructure solutions follow these core principles:

  • Modular Design: Compose solutions from reusable modules
  • Best Practices: Built-in security, observability, and scalability
  • Cloud Native: Leverage Kubernetes and AWS services
  • Validated: Tested and validated for enterprise workloads

Resources

Each stack inherits the base stack's components. These components include:

  • VPC with subnets in 2 availability zones
  • EKS cluster with 1 core nodegroup with 2 nodes to run the minimum infrastructure
  • Karpenter for autoscaling with CPU, GPU, and AWS Neuron NodePools
  • GPU/Neuron device drivers
  • GPU/Neuron monitoring agents

Variables

Deployment

Variable NameDescriptionDefault
nameThe name of the Kubernetes clusterai-stack
regionThe region for the clusterus-east-1
eks_cluster_versionThe version of EKS to use1.32
vpc_cidrThe CIDR used for the VPC10.1.0.0/21
secondary_cidr_blocksSecondary CIDR for the VPC100.64.0.0/16
enable_database_subnetsWhether or not to enable the database subnetsfalse
enable_aws_cloudwatch_metricsEnable the AWS Cloudwatch Metrics addonfalse
bottlerocket_data_disk_snapshot_idAttach a snapshot ID to the deployed nodes""
enable_aws_efs_csi_driverEnable the AWS EFS CSI driverfalse
enable_aws_efa_k8s_device_pluginEnable the AWS EFA device pluginfalse
enable_aws_fsx_csi_driverEnable the FSx device pluginfalse
deploy_fsx_volumeDeploy a simple FSx volumefalse
fsx_pvc_namespaceNamespace to provision the FSx PVCdefault
enable_amazon_prometheusEnable Amazon Managed Prometheusfalse
enable_amazon_emrSet up Amazon EMRfalse
enable_kube_prometheus_stackEnable the Kube Prometheus addonfalse
enable_kubecostEnable Kubecostfalse
enable_ai_ml_observability_stackEnable AI/ML observability addonfalse
enable_argo_workflowsEnable Argo Workflowfalse
enable_argo_eventsEnable Argo Eventsfalse
enable_argocdEnable ArgoCD addonfalse
enable_mlflow_trackingEnable MLFlow Trackingfalse
enable_jupyterhubEnable JupyterHubfalse
enable_volcanoEnable Volcanofalse
enable_kuberay_operatorEnable KubeRayfalse
huggingface_tokenHugging Face token to use in environmentDUMMY_TOKEN_REPLACE_ME
enable_rayserve_ha_elastic_cache_redisEnable Rayserve high availability using ElastiCachefalse
enable_torchx_etcdEnable etcd for torchxfalse
enable_mpi_operatorEnable the MPI Operatorfalse
enable_aibrix_stackEnable the AIBrix stackfalse
aibrix_stack_versionAIBrix Stack versionv0.2.1
enable_aws_load_balancer_controllerEnable the AWS Load Balancer Controllertrue
enable_service_mutator_webhookEnable service-mutator webhook for AWS Load Balancer Controllerfalse
enable_ingress_nginxEnable ingress-nginx addontrue
enable_cert_managerEnable Cert Managerfalse
enable_slurm_operatorEnable the Slinky Slurm Operator (with Cert Manager)false

JupyterHub

Variable NameDescriptionDefault
jupyter_hub_auth_mechanismWhich authorization mechanism to use for JupyterHub [dummy | cognito | oauth]dummy
cognito_custom_domainCognito domain prefix for Hosted UI authentication endpointseks
acm_certificate_domainDomain name used for the ACM certificate""
jupyterhub_domainDomain name for JupyterHub (only used for cognito or oauth)""
oauth_jupyter_client_idoauth clientid for JupyterHub. Only used for oauth""
oauth_jupyter_client_secretoauth client secret. Only used for oauth""
oauth_username_keyoauth field for username (e.g. preferred_username). Only needed for oauth""

Custom Stacks

With the variables above, it's very easy to compose a new environment tailored to your own needs. A custom folder is available in the infra folder with a simple blueprint.tfvars. By adding the variables above with the appropriate value, you are able to customize which addons you would like deployed to create an environment to support your preferences. Once the variables are added, run the install.sh at the root of infra/custom