Welcome to the blog section of Kubeflow on AWS!

In this section, we share posts, articles, and videos illustrating how some of our customers have leveraged AWS distribution of Kubeflow to optimize large-scale and compute-intensive machine learning workloads, we provide ML infrastructure recommendations and best practices, and we inform you on Kubelow on AWS coming training sessions and available recordings.

Want to try the latest version of Kubeflow on AWS? Follow the steps in Kubeflow on AWS documentation.

Read from our curated list of posts or Watch a recorded session.

Read

A curated list of blogs illustrating how to build flexible, scalable ML workflows on Kubernetes using AWS distribution of Kubeflow. Find case studies, use cases, best practices, benchmarks, and more.

  1. Enabling hybrid ML workflows on Amazon EKS and Amazon SageMaker with one-click Kubeflow on AWS deployment - In this blog, we discuss Kubeflow on AWS v1.6.1 features and highlight three important integrations that have been bundled on one platform to offer:

    • Infrastructure as Code (IaaC) one-click solution that automates the end-to-end installation of Kubeflow, including EKS cluster creation.
    • Support for running machine learning workloads on Amazon SageMaker from Kubeflow.
    • Enhanced monitoring and observability for ML workloads using Amazon Managed Prometheus (AMP) and Amazon Managed Grafana.
  2. Build and deploy a scalable machine learning system on Kubernetes with Kubeflow on AWS. This blog post is based on Kubeflow-1.4 and is listed here for educational purpose. AWS distribution of Kubeflow might have changed since then. We recommend to install the latest version.

  3. Find out how Athenahealth uses Kubeflow on AWS to build and streamline an end-to-end data science workflow that preserves essential tooling, optimizes operational efficiency, increases data scientist productivity, and sets the stage for extending their ML capabilities more easily.

  4. Build an hybrid distributed training architecture using Kubeflow on AWS and Amazon SageMaker. This post further illustrates how you can use open-source libraries in your deep learning training script and still make it compatible to run on both Kubernetes and SageMaker in a platform agnostic way.

  5. Use Amazon SageMaker Operators for Kubernetes (ACK) to train and deploy machine learning models.

Watch

This section curates a list of recorded virtual workshops, demos, and general presentations illustrating how to leverage Kubeflow on AWS and SageMaker to build, run, and monitor scalable ML workflows on Kubernetes.

  1. Train and deploy your deep learning models with AWS distribution of Kubeflow integrated with Amazon SageMaker - This workshop covers Kubeflow on AWS v1.6 architecture and is using one-click solution that is now provided in AWS Kubeflow distribution to automate end to end deployment of Amazon EKS and Kubeflow. The workshop has further used this automated setup to demonstrate how Amazon Sagemaker could be leveraged from Kubeflow Pipelines using latest SageMaker components that now supports SageMaker ACK Operators to run PyTorch based distributed model training.

  2. AWS Virtual Workshop “Distributed Training using PyTorch with Kubeflow on AWS and AWS DLC” - Demonstrates how Kubeflow on AWS integration with AWS Deep Learning Containers and Amazon Elastic File System (Amazon EFS) allows building and training PyTorch based deep learning models on both Amazon SageMaker and Amazon Elastic Kubernetes Service (EKS) with flexibility and scale in a hybrid architecture. Attended by over 50 customers. Aug 23rd, 2022

  3. re:Inforce Session “Hybrid distributed training architecture using Kubeflow and SageMaker” - Attended by over 900 customers. July 27, 2022

  4. Live at re:Mars Session “Deploy and scale your ML workflow on Kubernetes with Kubeflow on AWS”. You will be prompted to log into LinkedIn. July, 2022

  5. re:Mars Session “AWS Distribution of Kubeflow supporting Kubeflow v1.4” - Kubeflow on AWS value proposition and demo. July 8, 2022.

Last modified September 1, 2023: v1.7.0-aws-b1.0.3 website changes (#791) (7faf1a5)