Skip to main content

Replication Cluster — Verification and Testing

The default deployment is a Valkey replication topology: one primary that accepts writes, and N replicas that asynchronously replicate the primary's keyspace and serve reads. The chart ships three Services for client traffic — pick the right one for read versus write paths.

Endpoints

ServiceTypeSelectorUse for
valkeyClusterIPprimary onlyWrites
valkey-readClusterIPall 4 pods (primary + 3 replicas)Reads (load-balanced)
valkey-headlessClusterIP Noneall pods, per-pod DNSCluster-internal replication, application clients that need stable per-pod addressing
valkey-metricsClusterIPall podsPrometheus scrape on :9121

The chart wires the Service-to-pod selector via the app.kubernetes.io/component=master|replica label, so the valkey Service follows the primary even after a manual failover (REPLICAOF NO ONE flips the label).

Verification

1. Pods are 2/2 Running

kubectl -n valkey get pods -o wide

Each pod should report 2/2 Running. The two containers are valkey (the data plane) and metrics (the oliver006/redis_exporter sidecar).

2. Replication health from the primary

PASS=$(kubectl -n valkey get secret valkey-auth -o jsonpath='{.data.default}' | base64 -d)

kubectl -n valkey exec valkey-0 -c valkey -- \
valkey-cli -a "$PASS" --no-auth-warning INFO replication

Expected output:

# Replication
role:master
connected_slaves:3
min_slaves_good_slaves:3
slave0:ip=valkey-1.valkey-headless.valkey.svc.cluster.local,port=6379,state=online,offset=...,lag=0,type=replica
slave1:ip=valkey-2.valkey-headless.valkey.svc.cluster.local,port=6379,state=online,offset=...,lag=0,type=replica
slave2:ip=valkey-3.valkey-headless.valkey.svc.cluster.local,port=6379,state=online,offset=...,lag=0,type=replica
master_failover_state:no-failover
master_replid:<40-char hex>

Three properties to assert:

  • role:master on valkey-0.
  • connected_slaves matches replica.replicas from helm-values/valkey.yaml.
  • Every slave reports state=online with lagreplica.minReplicasMaxLag (default 10s).

3. AZ spread

kubectl -n valkey get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}' | \
while read p n; do
az=$(kubectl get node "$n" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}')
echo "$p $az"
done

With replica.replicas=3 (4 pods total) on a 3-AZ cluster, expect three pods across distinct AZs and the fourth in the AZ with the most spare capacity. The chart's topologySpreadConstraints use whenUnsatisfiable: DoNotSchedule, so any pod failing to find an AZ stays Pending rather than colocating.

4. ServiceMonitor is registered

kubectl -n monitoring get servicemonitor valkey -o yaml | head -30

kube-prometheus-stack discovers any ServiceMonitor whose namespaceSelector matches its release namespace. Confirm Prometheus is scraping :9121:

kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 &
curl -s 'http://localhost:9090/api/v1/targets?state=active' | \
jq '.data.activeTargets[] | select(.labels.job=="valkey") | .health'
# "up" (× number of pods)

Smoke Test

Apply the bundled smoke-test Job, which writes against the primary service and reads back from the load-balanced read service:

kubectl apply -f data-stacks/valkey-on-eks/examples/valkey-replication-smoke.yaml
kubectl -n valkey logs -f job/valkey-smoke

Expected log output:

=== Valkey replication smoke test ===
Write target: valkey.valkey.svc.cluster.local:6379
Read target: valkey-read.valkey.svc.cluster.local:6379

--- PING primary and read service ---
primary: PONG
read : PONG

--- WRITE three keys to the primary ---
SET user:1:profile -> OK
SET user:42:profile -> OK
SET user:9999:profile -> OK

--- READ keys back from valkey-read (with brief retry for lag) ---
GET user:1:profile -> value-for-user:1:profile
GET user:42:profile -> value-for-user:42:profile
GET user:9999:profile -> value-for-user:9999:profile

--- Replication health (from primary) ---
role:master
connected_slaves:3
min_slaves_good_slaves:3
slave0:ip=...,state=online,...
slave1:ip=...,state=online,...
slave2:ip=...,state=online,...

smoke test PASSED

The Job retries each read up to five times (1-second backoff) to absorb transient replication lag. If a single GET miss occurs after retries, the Job exits non-zero — that signals a real replication problem worth investigating.

The Job is idempotent; the keys are overwritten on each run. It auto-cleans 1 hour after completion (ttlSecondsAfterFinished: 3600).

Smoke test from your application's namespace

The bundled Job runs in the valkey namespace. To test from your application's namespace, copy the manifest and change metadata.namespace plus the AUTH secretKeyRef to point at a same-namespace copy of the secret. The DNS names (valkey.valkey.svc.cluster.local, valkey-read.valkey.svc.cluster.local) are valid from any namespace as long as your NetworkPolicy allows egress on TCP 6379 to the valkey namespace.

Read/Write Split for Application Clients

The two-Service pattern is intentional. Configure your client library:

Client libraryConfiguration
redis-py / redis-py-clusterRedis(host="valkey.valkey.svc.cluster.local", ...) for writes; second Redis(host="valkey-read...", ...) for reads.
ioredisUse Redis (single endpoint); set enableReadyCheck: false and route reads through a separate connection pool to valkey-read.
Lettuce (Java)RedisStaticMasterReplicaClient-style connections; primary URI to valkey, replica URIs to valkey-read, set ReadFrom.REPLICA_PREFERRED.
JedisJedisPooled for primary; JedisPooled for replicas via valkey-read. Multi-pool routing is application-side.
go-redis (v9+)redis.NewFailoverClient with explicit master/replica URIs.
valkey-glideNative primary/replica routing — point at valkey-headless and let the client handle role discovery via REPLICAOF introspection.

The valkey-read Service is a kube-proxy round-robin across all four pods (including the primary, since the primary serves reads consistently with replicas). To exclude the primary from reads, switch your client to valkey-headless and filter by the role field of INFO replication — most cluster-aware libraries do this automatically.

Manual Primary Failover

To promote a replica to primary (e.g., before restarting valkey-0 for an upgrade):

PASS=$(kubectl -n valkey get secret valkey-auth -o jsonpath='{.data.default}' | base64 -d)

# Promote valkey-1 to primary
kubectl -n valkey exec valkey-1 -c valkey -- \
valkey-cli -a "$PASS" --no-auth-warning REPLICAOF NO ONE

# Re-point remaining replicas at the new primary
for ord in 2 3; do
kubectl -n valkey exec valkey-${ord} -c valkey -- \
valkey-cli -a "$PASS" --no-auth-warning REPLICAOF \
valkey-1.valkey-headless.valkey.svc.cluster.local 6379
done

# Verify
kubectl -n valkey exec valkey-1 -c valkey -- \
valkey-cli -a "$PASS" --no-auth-warning INFO replication | grep -E '^(role|connected_slaves):'
# role:master
# connected_slaves:2 (will become 3 once valkey-0 catches up)

The valkey Service follows the primary via the chart's component-label selector, so client traffic re-routes automatically once the replica's app.kubernetes.io/component label flips to master. The chart writes a 0-second TTL valkey.conf reload on REPLICAOF NO ONE — no pod restart required.

To restore the original layout (valkey-0 as primary), run the same REPLICAOF sequence in reverse.

warning

Async replication means writes accepted by the old primary in the seconds before the failover may not have reached the new primary. For a planned failover, pause writes for ~5 seconds (min_slaves_good_slaves provides a partial guard) before issuing REPLICAOF NO ONE. For an unplanned failover (primary pod crash), the data loss window is bounded by replica.minReplicasMaxLag.

Resilience Behavior

FailureBehaviorRecovery
Replica pod crashStatefulSet recreates the pod; replica re-syncs (full or partial PSYNC depending on backlog)Automatic
Primary pod crashNo automatic failover (this is replication mode, not Sentinel/cluster). Writes are rejected (min_slaves_good_slaves stops gating; clients see LOADING or connection refused).Manual REPLICAOF NO ONE on a replica
Node failure (Karpenter-provisioned)Karpenter provisions a replacement node in the same AZ when possible; pod is rescheduled with the same PVCAutomatic, ~2–5 min
AZ failurePods in the failed AZ stay Pending. The two healthy AZs continue to serve reads from in-sync replicas. Writes via the primary continue if the primary is in a healthy AZ.Manual failover if the primary was in the failed AZ
Cross-AZ replication lag spikemin_slaves_good_slaves drops; primary may reject writes if minReplicasToWrite is setWait for replicas to catch up; investigate network

For automatic primary failover, the Valkey project's options are Sentinel (not yet in the official chart) or cluster mode (tracked at valkey-helm #18). Until either ships, the recommended pattern is to monitor valkey_master_link_up per replica and trigger REPLICAOF NO ONE from your alerting pipeline.

Observability

The data stack ships kube-prometheus-stack in the monitoring namespace. Both Valkey clusters (replication and cluster mode) are scraped automatically via their ServiceMonitors — no extra wiring needed. The exporter is oliver006/redis_exporter (sidecar in each pod), so metrics use the redis_* prefix even though the server is Valkey.

Key metrics and alerts

MetricMeaningSuggested alert
redis_upExporter could scrape the podfor 2m, severity critical
redis_master_link_upReplica's link to primary is healthy== 0 for 2m, severity critical
redis_master_last_io_seconds_agoSeconds since the replica last received from the primary> 30 for 5m, severity warning
redis_connected_slavesReplicas the primary sees as connected< replica.replicas for 5m, severity warning
redis_memory_used_bytes / redis_memory_max_bytesMemory pressure> 0.9 for 10m, severity warning
redis_evicted_keys_totalKeys evicted under memory pressureincrease > 0 for 5m, severity warning

Verify the scrape is healthy

# Both Valkey clusters should return redis_up=1.
kubectl -n monitoring exec prometheus-prometheus-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=redis_up' | jq '.data.result[].metric | {pod, namespace}'

Expect one entry per Valkey pod, e.g. {pod: valkey-0, namespace: valkey} and (if cluster mode is also installed) {pod: valkey-cluster-0, namespace: valkey-cluster}.

Access Grafana

Grafana is deployed as monitoring/monitoring-grafana (ClusterIP). The data stack creates an admin secret with a randomly generated password.

# 1. Pull the admin password from the cluster.
kubectl -n monitoring get secret grafana-admin-secret \
-o jsonpath='{.data.admin-password}' | base64 -d; echo

# 2. Port-forward Grafana to localhost.
kubectl -n monitoring port-forward svc/monitoring-grafana 3000:80

Open http://localhost:3000 and log in:

  • User: admin
  • Password: the value printed by step 1

The Prometheus data source is already wired. Open Explore (compass icon) and run redis_up to confirm metrics are flowing. Useful starting queries:

# Active connections per pod
redis_connected_clients{namespace="valkey"}

# Throughput in ops/sec
sum by (pod) (rate(redis_commands_processed_total{namespace="valkey"}[1m]))

# Cache hit rate
sum(rate(redis_keyspace_hits_total{namespace="valkey"}[5m]))
/ sum(rate(redis_keyspace_hits_total{namespace="valkey"}[5m]) + rate(redis_keyspace_misses_total{namespace="valkey"}[5m]))

# Replication lag in seconds (per replica)
redis_master_last_io_seconds_ago{namespace="valkey"}

# Memory headroom
redis_memory_used_bytes{namespace="valkey"} / redis_memory_max_bytes{namespace="valkey"}

Import the bundled dashboard

A starter dashboard ships at data-stacks/valkey-on-eks/examples/grafana-valkey-dashboard.json. Two ways to load it.

Option 1 — auto-import via ConfigMap (recommended for repeatable installs).

The Grafana sidecar (grafana-sc-dashboard, label selector grafana_dashboard=1) watches every namespace and auto-imports any matching ConfigMap.

kubectl create configmap valkey-grafana-dashboard \
--namespace monitoring \
--from-file=valkey.json=data-stacks/valkey-on-eks/examples/grafana-valkey-dashboard.json
kubectl label configmap valkey-grafana-dashboard \
--namespace monitoring \
grafana_dashboard=1

The dashboard appears under Dashboards → Browse within ~30 seconds. Use the namespace template variable at the top to switch between valkey (replication) and valkey-cluster (cluster mode).

Option 2 — manual import via the UI.

Grafana → Dashboards → New → Import → paste the JSON file's contents → pick the Prometheus data source → Import. Faster for ad-hoc inspection, doesn't survive a Grafana restart.

Option 3 — community dashboard. If the bundled dashboard panels show "No data" because of a metric-prefix mismatch on older exporter versions, dashboard 11835 ("Redis Dashboard for Prometheus Redis Exporter 1.x") is a known-good fallback for the same exporter family.

Direct Prometheus access

For PromQL queries against raw metrics or to inspect target health:

kubectl -n monitoring port-forward svc/prometheus-prometheus 9090:9090
# http://localhost:9090

Check Status → Targets and filter on valkey — both valkey/valkey/0 (replication) and valkey-cluster/valkey-cluster/0 (cluster mode, if installed) should show UP.