Skip to main content

JARK on EKS

warning

Deployment of ML models on EKS requires access to GPUs or Neuron instances. If your deployment isn't working, it’s often due to missing access to these resources. Also, some deployment patterns rely on Karpenter autoscaling and static node groups; if nodes aren't initializing, check the logs for Karpenter or Node groups to resolve the issue.

info

These instructions only deploy the JARK cluster as a base. If you are looking to deploy specific models for inference or training, please refer to this Gen AI page for end-to-end instructions.

What is JARK?

JARK is a powerful stack composed of JupyterHub, Argo Workflows, Ray, and Kubernetes, designed to streamline the deployment and management of Generative AI models on Amazon EKS. This stack brings together some of the most effective tools in the AI and Kubernetes ecosystem, offering a robust solution for training, fine-tuning, and inference large Gen AI models.

Key Features and Benefits

JupyterHub: Provides a collaborative environment for running notebooks, crucial for model development and prompt engineering.

Argo Workflows: Automates the entire AI model pipeline—from data preparation to model deployment—ensuring a consistent and efficient process.

Ray: Scales AI model training and inference across multiple nodes, making it easier to handle large datasets and reduce training time.

Kubernetes: Powers the stack by providing the necessary orchestration to run, scale, and manage containerized AI models with high availability and resource efficiency.

Why Use JARK?

The JARK stack is ideal for teams and organizations looking to simplify the complex process of deploying and managing AI models. Whether you're working on cutting-edge generative models or scaling existing AI workloads, JARK on Amazon EKS offers the flexibility, scalability, and control you need to succeed.

alt text

Ray on Kubernetes

Ray is an open-source framework for building scalable and distributed applications. It is designed to make it easy to write parallel and distributed Python applications by providing a simple and intuitive API for distributed computing. It has a growing community of users and contributors, and is actively maintained and developed by the Ray team at Anyscale, Inc.

RayCluster

Source: https://docs.ray.io/en/latest/cluster/key-concepts.html

To deploy Ray in production across multiple machines users must first deploy Ray Cluster. A Ray Cluster consists of head nodes and worker nodes which can be autoscaled using the built-in Ray Autoscaler.

Deploying Ray Cluster on Kubernetes including on Amazon EKS is supported via the KubeRay Operator. The operator provides a Kubernetes-native way to manage Ray clusters. The installation of KubeRay Operator involves deploying the operator and the CRDs for RayCluster, RayJob and RayService as documented here.

Deploying Ray on Kubernetes can provide several benefits:

  1. Scalability: Kubernetes allows you to scale your Ray cluster up or down based on your workload requirements, making it easy to manage large-scale distributed applications.

  2. Fault tolerance: Kubernetes provides built-in mechanisms for handling node failures and ensuring high availability of your Ray cluster.

  3. Resource allocation: With Kubernetes, you can easily allocate and manage resources for your Ray workloads, ensuring that they have access to the necessary resources for optimal performance.

  4. Portability: By deploying Ray on Kubernetes, you can run your workloads across multiple clouds and on-premises data centers, making it easy to move your applications as needed.

  5. Monitoring: Kubernetes provides rich monitoring capabilities, including metrics and logging, making it easy to troubleshoot issues and optimize performance.

Overall, deploying Ray on Kubernetes can simplify the deployment and management of distributed applications, making it a popular choice for many organizations that need to run large-scale machine learning workloads.

Before moving forward with the deployment please make sure you have read the pertinent sections of the official documentation.

RayonK8s

Source: https://docs.ray.io/en/latest/cluster/kubernetes/index.html

Deploying the Solution

👈

Verify Deployment

👈

Clean Up

👈