Envoy gateway

Organizations deploying AI applications face a fundamental challenge: no single model serves all needs. Developers may choose Claude for long-context analysis, OpenAI for reasoning tasks, and DeepSeek for cost-sensitive workloads. The problem is that each model provider uses different APIs. Without centralized control, teams can't easily switch providers, get visibility into utilization, or enforce quotas.

Envoy AI Gateway is an open source project that solves this challenge by providing a single, scalable OpenAI-compatible endpoint that routes to multiple supported LLM providers. It gives Platform teams cost controls and observability, while developers never touch provider-specific SDKs.

Key objectives of Envoy AI Gateway

Provide a unified layer for routing and managing LLM/AI traffic
Support automative failover mechanisms to ensure service reliability
Ensure end-to-end security, including upstream authorization for LLM/AI traffic
Implement a policy framework to support usage limiting use cases
Foster an open-source community to address GenAI-specific routing and quality of service needs

Envoy Gateway Fundamentals

If you're already familiar with Envoy Gateway, you can skip this section.

As Envoy AI Gateway builds on top of the standard Kubernetes Gateway API and Envoy Gateway extensions, it's necessary to familiarize yourself with the underlying Envoy Gateway primitives:

GatewayClass - Defines which controller manages the Gateway. Envoy AI Gateway uses the same GatewayClass as Envoy Gateway.
Gateway - The entry point for traffic. A Gateway resource defines listeners (HTTP/HTTPS ports). When you create a Gateway, Envoy Gateway deploys the actual Envoy proxy pods and a corresponding Kubernetes Service (typically a LoadBalancer). This is like a Network Load Balancer (although technically you'd still need to attach an NLB to an Envoy Gateway to accept traffic that's external the Kubernetes cluster.)
HTTPRoute - The instruction for routing traffic HTTP based on hostnames, paths, or headers. Conceptually, this is similar to ingress rules or listener rules in ALB.
Backend - A Kubernetes Service or an external endpoint.
BackendTrafficPolicy - Configures connection behavior like timeouts, retries, and rate limiting of an HTTPRoute.
ClientTrafficPolicy - Configures how the Envoy proxy server behaves with downstream clients.
EnvoyExtensionPolicy - A way to extend Envoy's traffic processing capabilities.

Envoy AI Gateway introduces following CRDs:

AIGatewayRoute - Defines unified API and routing rules for AI traffic
AIServiceBackend - Represents individual AI service backends like Bedrock
BackendSecurityPolicy - Configures authentication for backend access
BackendTLSPolicy - Defines TLS parameters for backend connections

This envoy gateway blueprint deploys Envoy AI Gateway on Amazon EKS and supports two use cases:

Multi-model routing
Rate limiting

Deploying the Solution

👈

Deploying self-hosted models

Envoy AI gateway currently supports self-hosted models that can speak OpenAI API schema. We will deploy two models using AI on EKS Inference Charts.

1. Create Hugging Face Token Secret

Create a Kubernetes secret with your Hugging Face token:

kubectl create secret generic hf-token --from-literal=token=your_huggingface_token

2. Deploy Pre-configured Models

Choose from the available pre-configured models and deploy:

warning

These deployments will need GPU/Neuron resources which need to be enabled and cost more than CPU only instances.

# Add helm chart repository
helm repo add ai-on-eks https://awslabs.github.io/ai-on-eks-charts/
helm repo update

# Model 1: Deploy qwen3 model
helm install qwen3-1.7b ai-on-eks/inference-charts -f https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/refs/heads/main/charts/inference-charts/values-qwen3-1.7b-vllm.yaml \
  --set nameOverride=qwen3 \
  --set fullnameOverride=qwen3 \
  --set inference.serviceName=qwen3

# Model 2: Deploy gpt oss model
helm install gpt-oss ai-on-eks/inference-charts -f https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/refs/heads/main/charts/inference-charts/values-gpt-oss-20b-vllm.yaml \
  --set nameOverride=gpt-oss \
  --set fullnameOverride=gpt-oss \
  --set inference.serviceName=gpt-oss

Verify models up and running

kubectl get pod

NAME                       READY   STATUS    RESTARTS   AGE
gpt-oss-67c5bcdd5c-lx9vh   1/1     Running   0          44h
qwen3-b5fdf6bd5-cxkwd      1/1     Running   0          44h

Enabling access to AWS Bedrock models

Access to all Amazon Bedrock foundation models is enabled by default with the correct AWS Marketplace permissions. For more details on managing access to models in Amazon Bedrock, review Access Amazon Bedrock foundation models

Review Supported foundation models in Amazon Bedrock for a list of foundation models available via Amazon Bedrock and relevant model information including Model ID.

Multi-model routing

Route requests to different AI models based on the x-ai-eg-model header. This header enables Envoy AI gateway to identify appropriate route configured within the gateway and routes client traffic to relevant backend kubernetes service. In this case, it's a service that exposes a self-hosted model or Amazon Bedrock model.

Deploy common gateway infrastructure

cd ../../blueprints/gateways/envoy-ai-gateway
kubectl apply -f gateway.yaml

serviceaccount/ai-gateway-dataplane-aws created
gatewayclass.gateway.networking.k8s.io/envoy-gateway created
envoyproxy.gateway.envoyproxy.io/ai-gateway created
gateway.gateway.networking.k8s.io/ai-gateway created
clienttrafficpolicy.gateway.envoyproxy.io/ai-gateway-buffer-limit created

Configure model backends

kubectl apply -f model-backends.yaml

backend.gateway.envoyproxy.io/gpt-oss-backend created
aiservicebackend.aigateway.envoyproxy.io/gpt-oss created
backend.gateway.envoyproxy.io/qwen3-backend created
aiservicebackend.aigateway.envoyproxy.io/qwen3 created
backend.gateway.envoyproxy.io/bedrock-backend created
aiservicebackend.aigateway.envoyproxy.io/bedrock created
backendsecuritypolicy.aigateway.envoyproxy.io/bedrock-policy created
backendtlspolicy.gateway.networking.k8s.io/bedrock-tls created

Configure model routes

kubectl apply -f multi-model-routing/ai-gateway-route.yaml

aigatewayroute.aigateway.envoyproxy.io/multi-model-route created

Test

python3 multi-model-routing/client.py

Expected Output:

🚀 AI Gateway Multi-Model Routing Test
============================================================
Gateway URL: http://k8s-envoygat-envoydef-xxxxxxxxxx-xxxxxxxxxxxxxxxx.elb.us-west-2.amazonaws.com

=== Testing Qwen3 1.7B ===
Status Code: 200
✅ SUCCESS: Qwen3 - [response content]

=== Testing Self-hosted GPT ===
Status Code: 200
✅ SUCCESS: GPT - [response content]

=== Testing Bedrock Claude ===
Status Code: 200
✅ SUCCESS: Bedrock Claude - [response content]

🎯 Final Results:
• Qwen3 1.7B: ✅ PASS
• GPT OSS 20B: ✅ PASS
• Bedrock Claude: ✅ PASS

📊 Summary: 3/3 models working

Rate limiting

Token-based rate limiting with automatic tracking for AI workloads.

Features:

Token-based rate limiting (input, output, and total tokens)
User-based rate limiting using x-user-id header
Redis backend for distributed rate limiting (automatically deployed)
Configurable limits per user per time window

Configure rate limiting

kubectl apply -f rate-limiting/ai-gateway-route.yaml
kubectl apply -f rate-limiting/ai-gateway-rate-limit.yaml
kubectl apply -f rate-limiting/backend-traffic-policy.yaml

Test rate limiting

python3 rate-limiting/client.py

Configuration Details

Routing Configuration

The AI Gateway routes requests based on the x-ai-eg-model header:

Header Value	Backend	Endpoint	Model Type
`Qwen/Qwen3-1.7B`	qwen3	`/v1/chat/completions`	Self-hosted
`openai/gpt-oss-20b`	gpt-oss	`/v1/chat/completions`	Self-hosted
`anthropic.claude-3-haiku-20240307-v1:0`	bedrock	`/anthropic/v1/messages`	AWS Bedrock

Bedrock Integration Details

Authentication: Pod Identity (automatically configured via installation script)
Schema: AWSAnthropic for native Bedrock support
Endpoint: /anthropic/v1/messages (Anthropic Messages API format)
Region: Configurable in backend-security-policy.yaml (default: us-west-2)

Resources

Important Notes

Multi-Model Routing: Requires deployed AI model services and AWS Bedrock access
Rate Limiting: Requires actual AI models that return real token usage data and Redis for storage
Bedrock Integration: Requires AWS Bedrock API access, proper IAM setup, and Pod Identity configuration
Authentication: Pod Identity for Bedrock is automatically configured when deploying via the installation script

These are working configuration examples that demonstrate AI Gateway capabilities with real AI model deployments and AWS Bedrock integration.

Key objectives of Envoy AI Gateway​

Envoy Gateway Fundamentals​

Deploying the Solution

Deploying self-hosted models​

1. Create Hugging Face Token Secret​

2. Deploy Pre-configured Models​

Enabling access to AWS Bedrock models​

Multi-model routing​

Deploy common gateway infrastructure​

Configure model backends​

Configure model routes​

Test​

Rate limiting​

Configure rate limiting​

Test rate limiting​

Configuration Details​

Routing Configuration​

Bedrock Integration Details​

Resources​

Important Notes​