Skip to main content

Observability

Observability for AI/ML workloads requires a holistic view of multiple hardware/software components alongside multiple sources of data such as logs, metrics, and traces. Piecing together these components is challenging and time-consuming; therefore, we leverage the AI/ML Observability available in Github to bootstrap this environment.

Architecture

Architecture

What's Included

  • Prometheus
  • OpenSearch
  • FluentBit
  • Kube State Metrics
  • Metrics Server
  • Alertmanager
  • Grafana
  • Pod/Service monitors for AI/ML workloads
  • AI/ML Dashboards

Why

Understanding the performance of AI/ML workloads is challenging: Is the GPU getting data fast enough? Is the CPU the bottleneck? Is the storage fast enough? These are questions that are hard to answer in isolation. The more of the picture one is able to see, the more clarity there is in identifying performance bottlenecks.

How

The JARK infrastructure already comes with this architecture enabled by default, if you would like to add it to your infrastructure, you need to ensure 2 variables are set to true in blueprint.tfvars:

enable_argocd                    = true
enable_ai_ml_observability_stack = true

The first variable deploys ArgoCD, which is used to deploy the observability architecture, the second variable deploys the architecture.

Usage

The architecture is entirely deployed into the monitoring namespace. To access Grafana: kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80. You can then open https://localhost:3000 to log into grafana with username admin and password prom-operator. You can refer to the security section in the Readme to see how to change the username/password

Training

Ray training job logs and metrics will be automatically collected by the Observability architecture and can be found in the training dashboard.

Example

A full example of this can be found in the AI/ML observability repo. We will also be updating the Blueprints here to make use of this architecture.

Inference

Ray inference metrics should be automatically picked up by the observability infrastructure and can be found in the inference dashboard. To instrument your inference workloads for logging, you will need to add a few items:

FluentBit Config

apiVersion: v1
kind: ConfigMap
metadata:
name: fluentbit-config
namespace: default
data:
fluent-bit.conf: |-
[INPUT]
Name tail
Path /tmp/ray/session_latest/logs/*
Tag ray
Path_Key true
Refresh_Interval 5
[FILTER]
Name modify
Match ray
Add POD_LABELS ${POD_LABELS}
[OUTPUT]
Name stdout
Format json

Deploy this into the namespace in which you intend to run your inference workload. You only need one in each namespace to tell the FluentBit sidecar how to output the logs.

FluentBit Sidecar

We will need to add a sidecar to the Ray inference service so FluentBit can write the logs to STDOUT

              - name: fluentbit
image: fluent/fluent-bit:3.2.2
env:
- name: POD_LABELS
valueFrom:
fieldRef:
fieldPath: metadata.labels['ray.io/cluster']
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128Mi
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /fluent-bit/etc/fluent-bit.conf
subPath: fluent-bit.conf
name: fluentbit-config

Add this section to the workerGroupSpecs containers

FluentBit Volume

Finally, we need to add the configmap volume to our volumes section:

              - name: fluentbit-config
configMap:
name: fluentbit-config

vLLM Metrics

vLLM also outputs useful metrics like the Time to First Token, throughput, latencies, cache utilization and more. To get access to these metrics, we need to add a route to our pod for the metrics path:

# Imports
import re
from prometheus_client import make_asgi_app
from fastapi import FastAPI
from starlette.routing import Mount

app = FastAPI()

class Deployment:
def _init__(selfself, **kwargs):
...
route = Mount("/metrics", make_asgi_app())
# Workaround for 307 Redirect for /metrics
route.path_regex = re.compile('^/metrics(?P<path>.*)$')
app.routes.append(route)

This will allow the deployed monitor to collect the vLLM metrics and display them in the inference dashboard.

Example

A full example of this can be found in the AI/ML observability repo. We will also be updating the Blueprints here to make use of this architecture.