Observability

Observability for AI/ML workloads requires a holistic view of multiple hardware/software components alongside multiple sources of data such as logs, metrics, and traces. Piecing together these components is challenging and time-consuming; therefore, we leverage the AI/ML Observability available in Github to bootstrap this environment.

Architecture

What's Included

Prometheus
OpenSearch
FluentBit
Kube State Metrics
Metrics Server
Alertmanager
Grafana
Pod/Service monitors for AI/ML workloads
AI/ML Dashboards

Why

Understanding the performance of AI/ML workloads is challenging: Is the GPU getting data fast enough? Is the CPU the bottleneck? Is the storage fast enough? These are questions that are hard to answer in isolation. The more of the picture one is able to see, the more clarity there is in identifying performance bottlenecks.

How

The JARK infrastructure already comes with this architecture enabled by default, if you would like to add it to your infrastructure, you need to ensure 2 variables are set to true in blueprint.tfvars:

enable_argocd                    = true
enable_ai_ml_observability_stack = true

The first variable deploys ArgoCD, which is used to deploy the observability architecture, the second variable deploys the architecture.

Usage

The architecture is entirely deployed into the monitoring namespace. To access Grafana: kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80. You can then open https://localhost:3000 to log into grafana with username admin and password prom-operator. You can refer to the security section in the Readme to see how to change the username/password

Training

Ray training job logs and metrics will be automatically collected by the Observability architecture and can be found in the training dashboard.

Example

A full example of this can be found in the AI/ML observability repo. We will also be updating the Blueprints here to make use of this architecture.

Inference

Ray inference metrics should be automatically picked up by the observability infrastructure and can be found in the inference dashboard. To instrument your inference workloads for logging, you will need to add a few items:

FluentBit Config

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentbit-config
  namespace: default
data:
  fluent-bit.conf: |-
    [INPUT]
        Name tail
        Path /tmp/ray/session_latest/logs/*
        Tag ray
        Path_Key true
        Refresh_Interval 5
    [FILTER]
        Name modify
        Match ray
        Add POD_LABELS ${POD_LABELS}
    [OUTPUT]
        Name stdout
        Format json

Deploy this into the namespace in which you intend to run your inference workload. You only need one in each namespace to tell the FluentBit sidecar how to output the logs.

FluentBit Sidecar

We will need to add a sidecar to the Ray inference service so FluentBit can write the logs to STDOUT

              - name: fluentbit
                image: fluent/fluent-bit:3.2.2
                env:
                  - name: POD_LABELS
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.labels['ray.io/cluster']
                resources:
                  requests:
                    cpu: 100m
                    memory: 128Mi
                  limits:
                    cpu: 100m
                    memory: 128Mi
                volumeMounts:
                  - mountPath: /tmp/ray
                    name: ray-logs
                  - mountPath: /fluent-bit/etc/fluent-bit.conf
                    subPath: fluent-bit.conf
                    name: fluentbit-config

Add this section to the workerGroupSpecs containers

FluentBit Volume

Finally, we need to add the configmap volume to our volumes section:

              - name: fluentbit-config
                configMap:
                  name: fluentbit-config

vLLM Metrics

vLLM also outputs useful metrics like the Time to First Token, throughput, latencies, cache utilization and more. To get access to these metrics, we need to add a route to our pod for the metrics path:

# Imports
import re
from prometheus_client import make_asgi_app
from fastapi import FastAPI
from starlette.routing import Mount

app = FastAPI()

class Deployment:
    def _init__(selfself, **kwargs):
        ...
        route = Mount("/metrics", make_asgi_app())
        # Workaround for 307 Redirect for /metrics
        route.path_regex = re.compile('^/metrics(?P<path>.*)$')
        app.routes.append(route)

This will allow the deployed monitor to collect the vLLM metrics and display them in the inference dashboard.

Example

A full example of this can be found in the AI/ML observability repo. We will also be updating the Blueprints here to make use of this architecture.

Architecture​

What's Included​

Why​

How​

Usage​

Training​

Example​

Inference​

FluentBit Config​

FluentBit Sidecar​

FluentBit Volume​

vLLM Metrics​

Example​

Architecture

What's Included

Why

How

Usage

Training

Example

Inference

FluentBit Config

FluentBit Sidecar

FluentBit Volume

vLLM Metrics

Example