Valkey Cluster Mode on EKS

Cluster mode shards data across multiple primaries using hash-slot partitioning and gossip-based failure detection. This stack ships a local Helm chart (data-stacks/valkey-on-eks/examples/cluster-mode-helm-chart/) that deploys a production-grade cluster onto the data-on-eks Valkey NodePool, with no operator and no external chart dependency.

The chart will be retired in favor of the upstream valkey-io/valkey-helm chart once cluster mode lands there (valkey-helm #18).

This guide covers:

When to use cluster mode and how it differs from replication mode.
How the cluster communicates internally — gossip, replication, slot routing, failover.
Choosing instance types, storage, and scale.
Deployment, verification, and day-2 operations.

When to Use Cluster Mode

Use cluster mode when at least one of these holds:

Working set > single node memory. Your data exceeds what a single r7g.16xlarge (~512 GiB) can hold, or you want to keep per-node memory pressure down for fork-time CoW headroom during BGSAVE.
Write throughput > single primary. Replication mode has one primary; all writes serialize through it. Cluster mode shards writes across N primaries.
Multi-key operations are NOT a hard requirement. Cluster mode rejects multi-key commands (MGET, MSET, RENAME, transactions, Lua) when keys span different slots — the application must use hash tags ({user:1}:profile and {user:1}:settings hash to the same slot) or accept per-key operations.

If your dataset fits a single large-memory node and the workload is read-heavy, stay on replication mode. Cluster mode adds operational complexity that's only worth it for scale.

Factor	Replication Mode	Cluster Mode
Topology	1 primary + N replicas	3+ primaries × 1+ replica each (minimum 6 pods)
Write scaling	✗ — single primary	✓ — sharded by hash slot
Multi-key ops	✓ — single keyspace	Same-slot only (use hash tags)
Failover	Manual `REPLICAOF NO ONE`	Automatic via gossip (~10s detect + promote)
Min nodes (HA)	2	6 (3 primaries + 3 replicas)
Operational surface	Lower	Higher (gossip, slot rebalancing, MEET/FORGET)
Chart support	✓ Official `valkey-io/valkey-helm`	This stack's local chart until #18 lands

Matches your EC2 setup

If you're migrating a self-managed Valkey or Redis cluster from EC2, this table maps the knobs you tune today to where they live in this chart. Most defaults are already aligned with the AWS reference architecture.

Your EC2 knob	This chart's value	Notes
Cluster mode (sharded)	`replicaCount` + `replicasPerPrimary`	Default `6 = 3 primaries × (1+1)`
Valkey version (e.g., 8.0.6 → 9)	`image.tag`	Default `9.0.2`
Primary/replica ratio (read-heavy: more replicas)	`replicasPerPrimary: 2` (or higher)	Set total `replicaCount` to `primaries × (1+N)`
`maxmemory = 75% of host RAM`	`valkeyConfig: maxmemory 12gb` + `resources.limits.memory: 16Gi`	12 / 16 = 75% — same ratio
`maxmemory-policy noeviction` (datastore)	`valkeyConfig: maxmemory-policy noeviction`	Default; cluster mode prefers write rejection over eviction
RDB-only persistence	Comment out `appendonly yes` in `valkeyConfig`	Saves replication / network bandwidth; see RDB-only override
gp3 EBS, generous IOPS	`persistence.storageClass: valkey-gp3`	Ships 6000 IOPS / 500 MiB/s vs gp3 baseline 3000 / 125
4,000 IOPS / instance baseline	`valkey-gp3` exceeds with 6,000 IOPS	50% headroom for BGSAVE bursts
Network-optimized (n-suffix) for multi-KB values	`tuning.networkOptimized: true`	Pins pods to `r7gn`/`r8gn`/`m7gn` via hard `nodeAffinity`
Valkey 7+ I/O threading	`valkeyConfig: io-threads 4` + `io-threads-do-reads yes`	Default in this chart. See I/O threads
Live upgrade (rolling)	`helm upgrade` + StatefulSet rolling restart	preStop runs `CLUSTER FAILOVER` to a replica → ~1 s write blip per primary
Primary + replica in different AZs (HA)	`topology.azAwareBootstrap: true`	Default; replica pairing verified at bootstrap
Dedicated primary per node (noisy neighbor)	`tuning.strictHostAntiAffinity: true`	Default is soft anti-affinity; flip to hard for strict isolation
EC2 instance type (R7i / R8i x86)	`r7g` / `r8g` (Graviton)	This stack is Graviton-only — ~22% lower latency, ~25% cheaper
EC2 size up to `12xlarge`	NodePool `instance-size` allows up to `16xlarge`	Headroom of one size above your current peak

Architecture

┌─────────────────────────── EKS Cluster ────────────────────────────┐
│                                                                    │
│  AZ us-west-2a            AZ us-west-2b           AZ us-west-2c    │
│  ┌──────────────┐         ┌──────────────┐        ┌──────────────┐ │
│  │ valkey-c-0   │         │ valkey-c-2   │        │ valkey-c-1   │ │
│  │ primary      │         │ primary      │        │ primary      │ │
│  │ slots 0-5460 │         │ slots 10923- │        │ slots 5461-  │ │
│  │              │         │       16383  │        │       10922  │ │
│  └──────────────┘         └──────────────┘        └──────────────┘ │
│         ▲                        ▲                       ▲         │
│         │ async replication      │                       │         │
│         │ (TCP 6379, PSYNC)      │                       │         │
│  ┌──────┴───────┐         ┌──────┴───────┐        ┌──────┴───────┐ │
│  │ valkey-c-3   │         │ valkey-c-5   │        │ valkey-c-4   │ │
│  │ replica of   │         │ replica of   │        │ replica of   │ │
│  │ valkey-c-1   │         │ valkey-c-0   │        │ valkey-c-2   │ │
│  └──────────────┘         └──────────────┘        └──────────────┘ │
│                                                                    │
│  Cluster bus (gossip) — TCP 16379, full mesh between all 6 pods   │
│  Headless Service: valkey-cluster-headless                         │
│   ├─ valkey-cluster-0.valkey-cluster-headless.<ns>.svc...:6379     │
│   ├─ valkey-cluster-1...                                           │
│   └─ valkey-cluster-5...    (per-pod stable DNS, hostname-aware)   │
└────────────────────────────────────────────────────────────────────┘

Three concepts make cluster mode work — slots, gossip, replication. Each pod owns a piece of each.

How the cluster communicates internally

Operators need this model to debug effectively. Two TCP ports per pod, three different protocols on top.

Hash slots and client routing (port 6379)

Every key is hashed into one of 16,384 slots via CRC16(key) mod 16384. Each primary owns a contiguous range; with three shards the default split is:

Shard	Primary pod (in our test cluster)	Slot range
0	`valkey-cluster-0`	0 – 5460
1	`valkey-cluster-2`	10923 – 16383
2	`valkey-cluster-1`	5461 – 10922

When a cluster-aware client (Lettuce, Jedis, ioredis, redis-py ≥ 4.1, go-redis, valkey-glide) connects:

It issues CLUSTER SHARDS (or CLUSTER NODES on older clients) to learn the slot → primary map.
For each command, it computes the slot from the key and routes the connection directly to the owning primary.
If the slot ownership has changed since the cached map (e.g., during a reshard), the targeted primary returns MOVED <slot> <new-host>:<port> or ASK <slot> <new-host>:<port>. The client updates its map and retries.

MOVED is permanent (slot has moved); ASK is transient (slot is mid-migration, the next single request should go to the new host). A non-cluster-aware client will hit MOVED on every cross-slot command and fail.

Hash tags let you co-locate keys onto the same slot. {user:42}:profile and {user:42}:orders share slot CRC16("user:42") % 16384, so MGET {user:42}:profile {user:42}:orders works inside a single shard.

Gossip — the cluster bus (port 16379)

Every pod opens a TCP connection to every other pod on port 16379. This is the cluster bus. With 6 pods that's 15 connections; with 100 pods it's 4,950. The bus carries small binary messages — never application data.

What gossip does:

Failure detection. Each pod sends PING to a few random peers every second. The recipient replies PONG. If a pod doesn't get a PONG within cluster-node-timeout (default 15s), it marks the peer as PFAIL (possible failure) and gossips that opinion. When the majority of primaries agree a peer is PFAIL, it transitions to FAIL and replicas of the dead primary start a failover election.
Topology propagation. Every gossip exchange piggybacks the sender's view of a random subset of nodes. New nodes joined via CLUSTER MEET get discovered transitively — you only need to introduce one new pod to one existing pod, the rest learn via gossip.
Configuration epoch. Each primary has a monotonic epoch. When a replica promotes to primary, its epoch increments. The cluster uses epochs to resolve conflicting slot-ownership claims after a partition heals.

"For example in a 100 node cluster with a node timeout set to 60 seconds, every node will try to send 99 pings every 30 seconds … 330 pings per second in the total cluster." — Valkey Cluster Specification

Gossip is intentionally cheap. Raising cluster-node-timeout halves heartbeat traffic at the cost of doubling failure-detection latency. Default 15s is fine up to ~500 pods. Above 500, bump to 30s.

Replication — primary → replica (port 6379, PSYNC)

Each replica maintains a single long-lived TCP connection to its primary on the client port. The connection runs the PSYNC protocol:

Initial full sync: replica connects, primary takes a BGSAVE snapshot (or uses a diskless transfer), streams it over the wire, then catches up via the replication backlog.
Partial resync: if the replica disconnects briefly, on reconnect it tells the primary its current replication offset. If the offset is still in the primary's backlog (default repl-backlog-size 10mb), the primary streams just the diff. Otherwise it falls back to full sync.

Auth: the replica authenticates to the primary using primaryauth (Valkey 8.0 rename of masterauth) from /data/conf/auth.conf. This chart uses requirepass + primaryauth with the same password — replicas connect as the default user. No separate ACL file.

Replication is asynchronous in Valkey by default. A write SET k v on a primary returns OK to the client before the replica has received it. If the primary dies in that 1-2 ms window, the unacknowledged write is lost. For stronger consistency use WAIT N timeout (blocks until N replicas ack) — but understand cluster mode tolerates async lag deliberately, and WAIT only helps within a single shard.

Hostname-aware addressing

This chart sets cluster-preferred-endpoint-type hostname and announces each pod's per-pod DNS name (valkey-cluster-N.valkey-cluster-headless.<ns>.svc.cluster.local) via cluster-announce-hostname. Result: when a pod restarts and gets a new IP, the cluster's gossip table updates the IP under the same hostname, and clients re-resolve via DNS rather than getting stuck on a stale IP.

Choosing instance types

Valkey is RAM-bound. The right defaults look almost exactly like what ElastiCache offers under the hood for Valkey nodes.

What ElastiCache uses (managed)

ElastiCache exposes cache.* SKUs that map to standard EC2 families. For Valkey, the recommended families are:

Family	Use case
`cache.r7g`, `cache.r8g`	Default for Valkey — memory-optimized, Graviton, ~8 GiB RAM per vCPU
`cache.m7g`, `cache.m6g`	General-purpose — smaller datasets (under 8 GiB/shard), CPU-bound clients
`cache.r6gd`	Memory-optimized with local NVMe for data tiering (hot/warm tier inside the node)
`cache.c7gn`	Network-optimized — proxies, counters, rate-limiters with very high QPS
`cache.t4g`	Dev/test only — burstable CPU credits are incompatible with Valkey's single-threaded command loop

AWS announced for ElastiCache on r7g: "up to 28% increased throughput, up to 21% improved P99 latency, up to 25% higher networking bandwidth" vs r6g.

What to use on self-managed EKS

R-family (memory-optimized) is the default. The data-on-eks Valkey NodePool (nodepool-valkey.yaml) is preconfigured for r7g and r8g Graviton instances.

Graviton vs x86. Valkey upstream officially supports ARM64 and ran its own 1M RPS benchmark on c7g.16xlarge. Independent benchmarks measure ~22% lower latency on Graviton vs equivalent x86 SKUs. Fork/BGSAVE is also cheaper on Graviton because its TLB / page-walker handles COW patterns more efficiently. Use Graviton unless you have a hard x86 dependency (e.g., a kernel module).

Size per shard (reserve ~25% for fork-time CoW headroom per AWS BGSAVE best practice):

Working set per shard	Suggested instance	Pod memory request/limit	`maxmemory`
12 GiB	`r7g.large` (16 GiB)	12 Gi / 16 Gi	`12gb`
24 GiB	`r7g.xlarge` (32 GiB)	24 Gi / 30 Gi	`24gb`
64 GiB	`r7g.2xlarge` (64 GiB)	48 Gi / 60 Gi	`48gb`
128 GiB	`r7g.4xlarge` (128 GiB)	96 Gi / 120 Gi	`96gb`
256 GiB	`r7g.8xlarge` (256 GiB)	192 Gi / 240 Gi	`192gb`
> 256 GiB	Shard horizontally instead — fork on a 512 GiB heap is operationally painful even with overcommit + THP disabled.

The pod-memory / maxmemory ratio above is 75% by design — same as the AWS reference and what most production EC2 Valkey installs settle on. The remaining 25% absorbs:

Fork-time CoW during BGSAVE / AOF rewrite (worst case ≈ 1× working set if every page is dirtied during the fork, typically much less).
Memory fragmentation — Valkey 9 with jemalloc averages 10–20% over raw key-value size; spikes higher during long-running mixed read/write workloads. Track mem_fragmentation_ratio from INFO memory — > 1.5 is a red flag.
Client buffers and replication backlog (default repl-backlog-size 10mb per shard; ~32 MiB per client for client-output-buffer-limit normal).

If mem_fragmentation_ratio stays > 1.5 for hours, your effective working-set memory is 30–40% below maxmemory. Either step up an instance size or schedule a MEMORY PURGE (returns freed pages to the OS — Valkey 7+) during a maintenance window.

Always set requests.memory == limits.memory; otherwise the node OOM-killer can target the BGSAVE child during a fork. The chart enforces this in the default values.yaml.

Avoid:

T-family (burstable). The CPU credits model is incompatible with Valkey's single-threaded event loop and unpredictable under spikes. AOF is unsupported on T2 per AWS docs.
Spot instances for primaries. The 2-minute interruption notice is not enough for a clean shard hand-off; replica resync after each interruption burns network bandwidth. Spot is acceptable only for read-replica overflow pools.

I/O threads and CPU sizing

Valkey's command-execution loop is single-threaded — "core command execution remains sequential" per the project docs. But the socket I/O layer (read parsing, write serialization, TLS) can be parallelized via io-threads since Valkey 7. The chart defaults to:

io-threads 4
io-threads-do-reads yes

This is a major win for the workloads where Valkey CPU is the bottleneck: multi-KB values, TLS termination, very high QPS. The Valkey project's 1 billion RPS demo used 8 I/O threads on a c7g.16xlarge. Tuning guidance:

Pod vCPU	Recommended `io-threads`
2 (large)	`2`
4 (xlarge)	`2`
8 (2xlarge)	`4` (chart default)
16 (4xlarge)	`8`
32+ (8xlarge+)	`8` (diminishing returns above 8 — the main thread saturates)

Cap at roughly vCPU - 2 so the main thread and the BGSAVE child have headroom. Override via valkeyConfig:

valkeyConfig: |
  io-threads 8
  io-threads-do-reads yes

CPU sizing rule of thumb: 4–8 vCPU per shard is the sweet spot. Above 16 vCPU the single-threaded main loop saturates and adding cores returns diminishing throughput; scale out (add shards) instead. The Valkey latency docs warn: "BGSAVE or BGREWRITEAOF … must never run on the same core as the main event loop" — keep at least 2 vCPU free of the io-threads pool.

Network bandwidth

Sustained replication + gossip + client traffic adds up fast. A 3-replica shard at 100k ops/sec, 1 KB values pushes ~800 Mbps of client traffic alone; full-sync of a 64 GiB replica can saturate a NIC for minutes. Plan for 2× peak steady-state:

Sustained throughput	Recommended family
< 2.5 Gbps	`r7g.xlarge` (baseline 1.876 Gbps, burst 12.5 Gbps)
2.5 – 10 Gbps	`r7g.4xlarge` (baseline 7.5 Gbps)
> 10 Gbps OR multi-KB values	`r7gn` / `r8gn` / `m7gn` network-optimized Graviton (up to 200 Gbps)

Opting into n-suffix (network-optimized) instances

The data-on-eks Valkey NodePool ships with r7gn, r8gn, and m7gn in its instance-family list, but Karpenter only schedules onto them when the pod explicitly requests them — otherwise it picks the cheaper r7g/r8g. The chart provides a one-line toggle:

# values override
tuning:
  networkOptimized: true

This injects a hard nodeAffinity on karpenter.k8s.aws/instance-family in [r7gn, r8gn, m7gn]. The trade-off:

Cost: n-suffix is ~25–30% more expensive than its non-n peer.
Benefit: 25 → 200 Gbps NIC ceiling, larger PPS budget, and ENA Express (where supported) cuts cross-AZ p99 in half.

Pick n-suffix when at least one of these holds:

Value size is multi-KB (the customer pattern this knob exists for).
Replication full-resync of large shards (> 64 GiB) repeatedly saturates the NIC.
Cross-AZ sustained traffic exceeds 10 Gbps.

If none of those hold, stay on the default r7g/r8g — the n-suffix capacity is unused at typical Valkey workloads.

Enable VPC CNI prefix delegation (ENABLE_PREFIX_DELEGATION=true) for dense Valkey deployments — each ENI gets /28 prefixes (16 IPs) instead of secondary IPs, and r7g.16xlarge jumps from 234 to 737 pods per node. The data-on-eks infra stack ships this enabled by default.

Cost

Compute Savings Plans (3-yr) are the recommended baseline — ~66% discount with flexibility to move between sizes/families. AWS itself recommends them over RIs "because they offer similar savings with more flexibility."
For a 6-pod r7g.2xlarge cluster in us-west-2: ~$17,650/year on-demand → ~$6,000/year on a 3-yr Compute SP.
Don't run primaries on Spot. Replica overflow pools, yes. Primaries, no.

Choosing storage

The per-pod PVC carries AOF, RDB, and nodes.conf. Its specs drive AOF write latency, BGSAVE / rewrite throughput, and PSYNC full-resync speed.

EBS gp3 is the right default (not local NVMe)

A common question when migrating from EC2: keep EBS, or switch to instance-attached NVMe (r6gd/r7gd/r8gd)? For nearly every Valkey cluster-mode deployment, EBS gp3 wins — Valkey is RAM-and-network-bound, and EBS gives you resilience that NVMe can't.

	EBS gp3 / `valkey-gp3`	Local NVMe (`r7gd`/`r8gd`)
Latency p99	1–3 ms	50–150 µs (10–20× faster)
Pod restart / node replacement	✅ volume re-attaches	❌ data lost; full PSYNC over network
Reshard / scale-out	Data follows the PVC	Every moved slot triggers a PSYNC
Multi-pod correlated failure	✅ EBS volumes survive	❌ if primary + replica both lose NVMe inside `cluster-node-timeout`, that shard is gone
Cost	~$58/mo per 100 GiB at 6 k IOPS / 500 MiB/s	Bundled in EC2 hourly rate

ElastiCache itself uses local NVMe (r6gd) only for data tiering, not primary persistence — match that pattern: EBS for the data-of-record; NVMe (if at all) for pure caches.

If you must use NVMe (you've measured EBS to be the bottleneck, which is rare): add r6gd/r7gd to the NodePool, install a local-path provisioner, set persistence.storageClass: local-storage. AZ-aware replicas are the only recovery layer in that mode.

gp3 sizing by working set

"gp3 volumes deliver a consistent baseline of 3,000 IOPS and 125 MiB/s, scaling to 16,000 IOPS and 1,000 MiB/s. They do not use burst performance — they sustain provisioned IOPS indefinitely." — AWS EBS docs

Working set per shard	PVC size	gp3 IOPS	gp3 throughput	StorageClass
≤ 12 GiB	64 GiB	3,000 (baseline)	125 MiB/s (baseline)	default `gp3`
12 – 64 GiB	200 GiB	6,000	500 MiB/s	`valkey-gp3` (this stack ships it)
64 – 256 GiB	800 GiB	12,000	750 MiB/s	custom gp3
> 256 GiB OR p99-sensitive	1+ TiB	16,000 / 1,000 MiB/s (gp3 ceiling)	—	step up to io2 Block Express (~3–5× gp3 cost, sub-ms p99)

PVC sized at ~3× working set: current AOF + rewritten AOF + RDB must coexist during a rewrite.

This stack ships storageclass-valkey-gp3.yaml (gp3, 6,000 IOPS, 500 MiB/s, xfs). Apply once per cluster, then set persistence.storageClass: valkey-gp3 in chart values. Never use EFS/FSx — NFS fsync latency + rename semantics have caused real nodes.conf corruption.

This stack also ships volumeattributesclass-valkey-gp3-perf.yaml — a VolumeAttributesClass (VAC) that lets you increase IOPS and throughput on existing PVCs without restarting pods or reprovisioning volumes. Apply it alongside the StorageClass:

kubectl apply -f data-stacks/valkey-on-eks/examples/storageclass-valkey-gp3.yaml
kubectl apply -f data-stacks/valkey-on-eks/examples/volumeattributesclass-valkey-gp3-perf.yaml

To apply the performance tier to running Valkey pods (no restarts, no cluster impact):

kubectl get pvc -n valkey-cluster --no-headers -o custom-columns="NAME:.metadata.name" | while read pvc; do
  kubectl patch pvc "$pvc" -n valkey-cluster \
    --type='merge' \
    -p '{"spec":{"volumeAttributesClassName":"valkey-gp3-perf"}}'
done

This discovers all PVCs in the namespace rather than assuming a fixed replica count — works for 6-replica clusters, 9-replica clusters, or any other size.

For the full reference — prerequisites, cooldown rules, scaling across hundreds of clusters — see In-Place EBS Volume Modification for Stateful Workloads.

Mount tuning: xfs > ext4 for shards > 50 GiB (better sustained-sequential-write); noatime,nodiratime saves a metadata write per AOF read.

Persistence mode — pick one

Mode	Durability	When
AOF + RDB (chart default)	up to 1 s data loss on crash	Production data of record
RDB-only	up to ~15 min data loss; lower steady-state I/O, no AOF-rewrite bandwidth spikes	Network-bound workloads where the replica IS your durability layer
No persistence (`appendonly no`, `save ""`, `persistence.enabled: false`)	Pod restart = empty shard	Pure caches reconstructible upstream

Switch to RDB-only by overriding valkeyConfig (the chart's repl-diskless-sync and io-threads defaults still apply):

valkeyConfig: |
  maxmemory 12gb
  maxmemory-policy noeviction
  appendonly no
  save 900 1
  save 300 10
  save 60 10000

Snapshot bandwidth caveat for AOF + RDB. On a 64 GiB working set with auto-aof-rewrite-percentage 100, an AOF rewrite writes ~64 GiB while the original AOF and new RDB coexist — at valkey-gp3's 500 MiB/s that's ~2 minutes of EBS-saturating writes, p99 spikes 5–10×. Mitigations: RDB-only (above), bump to the gp3 ceiling (16k IOPS / 1 GiB/s), or schedule manual BGREWRITEAOF off-peak after auto-aof-rewrite-percentage 0.

Cluster scaling limits

"The cluster's key space is split into 16384 slots, effectively setting an upper limit for the cluster size of 16384 primary nodes (however, the suggested max size of nodes is on the order of ~ 1000 nodes)." — Valkey Cluster Specification

The ~1000-node guidance is a gossip-protocol limit, not a slot limit. AWS ElastiCache caps at 500 nodes per cluster (83–500 shards) for Valkey ≥ 5.0.6 — docs. The Valkey project demonstrated 2,000 nodes (1000 shards × 1 replica) driving 1 billion RPS — beyond the recommended ceiling but possible.

Practical buckets:

Bucket	Shards	Pods (1 replica each)	Working set	What gets hard
Small	3–10	6–20	< 100 GB	Trivial. Default `cluster-node-timeout: 15s` works.
Medium	10–50	20–100	100 GB – 1 TB	Client `CLUSTER NODES` startup storms — use clients that cache `CLUSTER SHARDS` (jedis 4.4+, lettuce 6.2+, go-redis v9, redis-py 5+).
Large	50–200	100–400	1–5 TB	Gossip ~1–5% CPU/pod. Bump `cluster-node-timeout` to 30 s. Reshard at 30–60 s/GB on legacy migration.
Very large	200–500	400–1000	5–25 TB	Gossip 5–15% CPU/pod. PDB `maxUnavailable: 1` → rolling upgrade serial at ~90 s/pod (500 pods = 12+ h). ENI/IP planning critical. Shard at the application layer instead.
Beyond	> 500	> 1000	> 25 TB	Unsupported scale outside the valkey.io benchmark team. Always split into multiple clusters.

The break point is ~200 shards — beyond that, a single reshard or rolling upgrade routinely exceeds a maintenance window. Tenant-shard at the application layer instead.

Deployment

The chart deploys alongside the replication-mode default in a separate namespace (valkey-cluster).

Prerequisites

Data-on-eks Valkey stack already deployed (the Valkey NodePool, the gp3 StorageClass, ArgoCD, and the existing replication-mode release are all expected).
helm 3.13+ and kubectl configured for the EKS cluster.
(Recommended for production-grade I/O) the valkey-gp3 StorageClass and valkey-gp3-perf VolumeAttributesClass applied — see storage section.

Quickstart

cd data-stacks/valkey-on-eks/examples
./install-cluster-mode.sh                              # default: 6 pods, ns=valkey-cluster
./install-cluster-mode.sh --replicas 9 \
   --replicas-per-primary 2                            # 3 primaries × 2 replicas each
./install-cluster-mode.sh --values my-values.yaml      # custom overrides
./install-cluster-mode.sh --dry-run                    # render templates only

The script wraps helm install/upgrade with the right defaults. To uninstall:

./uninstall-cluster-mode.sh           # keeps PVCs and the auth Secret
./uninstall-cluster-mode.sh --purge   # delete everything

What the chart deploys

Resource	Purpose
`StatefulSet/valkey-cluster`	6 pods (`podManagementPolicy: Parallel`), kernel-tuning init container, prepare-config init container, valkey + metrics containers
`Service/valkey-cluster-headless`	Headless service (`clusterIP: None`, `publishNotReadyAddresses: true`) — per-pod DNS + cluster bus discovery
`Secret/valkey-cluster-auth`	Auto-generated 32-char password; preserved across `helm upgrade` via `lookup`
`ConfigMap/valkey-cluster-config`	`valkey.conf` — cluster mode + persistence + the include for `/data/conf/auth.conf`
`ConfigMap/valkey-cluster-scripts`	`topology.sh`, `bootstrap.sh`, `readiness.sh`, `prestop.sh`
`Job/valkey-cluster-bootstrap`	Post-install / post-upgrade Hook — runs `valkey-cli --cluster create` once; idempotent on re-runs
`Role`, `RoleBinding`, `ClusterRole`, `ClusterRoleBinding`	Minimal RBAC for the bootstrap Job's kubectl init container (pod list, node read)
`ServiceAccount/valkey-cluster`	Pod identity (no IAM yet; reserved for future restore-from-S3)
`PodDisruptionBudget`	`maxUnavailable: 1`
`ServiceMonitor`	Prometheus scrape config for the redis_exporter sidecar
`NetworkPolicy` (opt-in)	Default-deny + explicit allow for intra-cluster, application namespaces, and Prometheus scrape

AZ-aware bootstrap (default ON)

The post-install Hook Job runs a topology.sh init container with kubectl that maps each pod to its node's topology.kubernetes.io/zone label. The main bootstrap container reads this map and pairs each replica with a primary in a different AZ, so any single-AZ outage can be survived by failover. Without this, valkey-cli --cluster create --cluster-replicas N would pair by hostname pattern only — which, for a single StatefulSet, often produces same-AZ pairs and defeats the HA guarantee.

To disable (single-AZ dev environments):

topology:
  azAwareBootstrap: false

Verifying the install

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth \
        -o jsonpath='{.data.default}' | base64 -d)

# Cluster info — expect all six lines
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster info | head -6
# cluster_state:ok
# cluster_slots_assigned:16384
# cluster_slots_ok:16384
# cluster_slots_pfail:0
# cluster_slots_fail:0
# cluster_known_nodes:6

# Topology
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster shards

Then run a smoke test:

# Cluster-aware writes/reads
kubectl -n valkey-cluster exec -i valkey-cluster-0 -c valkey -- \
  valkey-cli -c -a "$PASS" --no-auth-warning <<'CMD'
SET test:user:42 alice
SET test:order:9001 "$50"
GET test:user:42
GET test:order:9001
CMD

# Quick benchmark
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-benchmark -a "$PASS" --cluster \
    -h valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local \
    -p 6379 -t set,get -n 10000 -c 50 -q

Connecting clients

Cluster mode requires a cluster-aware client.

Library	Cluster-aware client class
`redis-py` ≥ 4.1	`redis.cluster.RedisCluster`
Lettuce (Java)	`RedisClusterClient`
Jedis	`JedisCluster`
ioredis	`Redis.Cluster`
`go-redis`	`redis.NewClusterClient`
valkey-glide	native cluster mode support

Connect via the headless service:

from redis.cluster import RedisCluster, ClusterNode

nodes = [
    ClusterNode(f"valkey-cluster-{i}.valkey-cluster-headless.valkey-cluster.svc.cluster.local", 6379)
    for i in range(6)
]
client = RedisCluster(
    startup_nodes=nodes,
    password=os.environ["VALKEY_PASSWORD"],
    decode_responses=True,
)
client.set("user:42:profile", "carol@example.com")
client.get("user:42:profile")

The cluster client calls CLUSTER SHARDS once at startup to learn the slot map, then routes commands directly to the owning primary. Writes go to primaries; default reads also go to primaries. To load-balance reads across replicas, call READONLY on the connection after open — most cluster clients expose a read_from_replicas=True flag for this.

Best practices

Authentication

The chart enables auth by default with requirepass + primaryauth sharing the same auto-generated 32-char password from a Kubernetes Secret. The Secret is preserved across helm upgrade via a lookup guard — running upgrade does NOT rotate the password (which would break the cluster, because running pods cache the password in env at startup).

To use your own externally-managed Secret:

auth:
  enabled: true
  existingSecret: my-valkey-secret      # must contain key `default`
  existingSecretPasswordKey: default

Strict AZ spread

Default:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: valkey-cluster

DoNotSchedule is critical — ScheduleAnyway lets two pods land in the same AZ, and combined with the AZ-aware bootstrap puts you in a state where one AZ has no replica. For dev/test in single-AZ clusters, switch to ScheduleAnyway.

Persistence (required for cluster mode)

appendonly yes
appendfsync everysec
save 900 1
save 300 10
save 60 10000
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

Without persistence, a pod restart restores an empty dataset. The replica resyncs via PSYNC from the primary — fine in steady state, but during a controlled rolling restart of the entire cluster (e.g., AMI bump), the first pod back up must have its data on disk or that shard's keyspace is gone.

Kernel tuning (Transparent Huge Pages)

Valkey docs require THP disabled:

"echo never > /sys/kernel/mm/transparent_hugepage/enabled"

With THP enabled, a single-byte write by the BGSAVE parent copies a 2 MB page — ~500× write amplification. The chart's kernel-tuning init container handles this with a privileged container that runs as root for ~1 second:

kernelTuning:
  enabled: true        # default

It also sets net.core.somaxconn=65535 and vm.overcommit_memory=1.

If your namespace has the restricted PSA profile set, this container will be rejected. Two options:

Apply the same tuning at the node level via a separate DaemonSet that runs once at boot, then set kernelTuning.enabled: false.
Move Valkey to a namespace with baseline or no PSA enforcement.

PodDisruptionBudget — `maxUnavailable: 1`

pdb:
  enabled: true
  maxUnavailable: 1

Cluster mode tolerates one primary down because failover requires a majority of the remaining 2 primaries to agree (2/3 = quorum). maxUnavailable: 2 could remove 2 primaries simultaneously and stall failover.

For deeper safety during voluntary disruptions (node drains, Karpenter consolidation), pair the PDB with pod-deletion-cost annotations to bias eviction toward replicas first.

Strict one-pod-per-node anti-affinity

The chart's default podAntiAffinity is soft (preferredDuringScheduling…) — Kubernetes places Valkey pods on different nodes when capacity allows, but does not block scheduling when it doesn't. This matches typical production needs.

If your workload is sensitive to noisy-neighbor effects — co-tenant pods saturating the NIC, page-cache pressure during BGSAVE forks, or CPU contention with sidecars — flip to hard anti-affinity:

tuning:
  strictHostAntiAffinity: true

This replaces the soft preference with a requiredDuringSchedulingIgnoredDuringExecution rule on topologyKey: kubernetes.io/hostname. Two Valkey pods can never land on the same node. Trade-offs:

The Valkey NodePool must always have ≥ replicaCount nodes available. With Karpenter and r7g.large defaults that's 6 nodes; raise the Karpenter NodePool's CPU/memory limits to leave headroom for failover replacements.
A single un-schedulable pod (because no spare node exists) blocks rolling upgrades. Pair with the PDB above and monitor the StatefulSet's currentReplicas vs replicas.
For very small datasets where many shards fit on one node, hard anti-affinity is wasteful — stick with the default.

Most production deployments handling multi-KB values, TLS, or compliance-driven isolation want this on.

Network policy

Off by default in the chart. Turn on once you know which application namespaces need port 6379:

networkPolicy:
  enabled: true
  allowedFrom:
    - matchLabels:
        role: app
    - matchLabels:
        team: ml-platform

Intra-cluster traffic on 6379 + 16379 is always allowed (every Valkey pod talks to every other). DNS egress to kube-dns is allowed. Prometheus is permitted on :9121.

How failover works

Failover happens in two flavors: unplanned (a pod or node dies) and planned (you want to rotate the primary role, e.g., for a node upgrade).

Unplanned (automatic, gossip-driven)

T+0s    Primary pod-0 dies (kernel panic, AZ outage, OOM-kill)
T+0-15s Pod-3 (the replica) keeps trying to send REPLCONF heartbeats — no response
T+15s   `cluster-node-timeout` expires. Pod-3 marks pod-0 as PFAIL and gossips it.
T+15-30s Pods 1, 2, 4, 5 receive the PFAIL gossip. Each compares against their own
        view. When the majority of primaries (pod-1 and pod-2) agree, pod-0 is FAIL.
T+30-45s Pod-3 starts a failover election: it broadcasts a FAILOVER_AUTH_REQUEST to
        all primaries. Each primary checks its `lastVoteEpoch` and votes once per
        epoch. If pod-3 receives majority votes (>1 of 2 remaining primaries), it
        promotes.
T+45s   Pod-3 issues `CLUSTER FAILOVER TAKEOVER` and bumps its config epoch. Its
        slot range (0-5460) is now owned by pod-3 across the gossip table.
T+45-60s Existing client connections hit `MOVED 5403 pod-3:6379` on the next command
        for that slot range. Clients update their slot map. Writes resume.

End-to-end, expect 30–60 seconds of write unavailability for the affected shard in the default cluster-node-timeout: 15s configuration. Reads continue to the replica (now primary) once promoted.

To tune: lower cluster-node-timeout for faster detection (cost: higher gossip rate, more false positives under transient network blips). Don't go below 5s on multi-AZ deployments.

Planned (graceful, via CLUSTER FAILOVER)

Use this to rotate the primary role onto a specific replica — e.g., before a node upgrade or to rebalance load.

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)

# Find the replica you want to promote (e.g., valkey-cluster-3 replicates valkey-cluster-0)
kubectl -n valkey-cluster exec valkey-cluster-3 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster failover

# Verify
kubectl -n valkey-cluster exec valkey-cluster-3 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning role | head -1
# master

CLUSTER FAILOVER (without FORCE or TAKEOVER) coordinates with the primary: the replica catches up to the primary's offset, then both swap roles atomically. The window where writes can't proceed is typically < 1 second — much faster than the unplanned 30-60s.

The chart's preStop hook (/scripts/prestop.sh) does exactly this when a pod is being terminated: if the pod is currently a primary, it picks a healthy replica and runs CLUSTER FAILOVER, waits up to 10s for the role swap, then SHUTDOWN. This is what makes the rolling upgrade smooth — see the next section.

When failover stalls

cluster_state:fail after a multi-pod outage means the cluster lost quorum. Recovery:

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)

# 1. Wait for failed pods to come back. The chart's prepare-config init container
#    rewrites /data/conf/auth.conf and the bootstrap.sh in the StatefulSet's
#    main container restarts. Each pod loads its previous nodes.conf from PVC
#    and rejoins via gossip — no manual MEET needed.
kubectl -n valkey-cluster get pods

# 2. If pods rejoin but the cluster stays in fail state, identify the surviving
#    primaries and check their voting state:
for i in 0 1 2 3 4 5; do
  kubectl -n valkey-cluster exec valkey-cluster-$i -c valkey -- \
    valkey-cli -a "$PASS" --no-auth-warning cluster info | grep cluster_state
done

# 3. As a last resort, force a slot takeover on a surviving primary that owns
#    the orphaned slot range:
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster failover takeover

FAILOVER TAKEOVER skips the consensus wait — only use when you've confirmed the original primary is permanently lost. If nodes.conf is corrupted across multiple PVCs simultaneously (rare; would require multiple PVC failures), the only path is restore from RDB backup per shard via the EC2 → EKS migration runbook.

Planned upgrades

Three independent upgrade dimensions — handle them one at a time. The cluster's cluster_state should stay :ok throughout each one.

Pre-flight checklist

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)

# 1. All 6 pods 2/2 Ready
kubectl -n valkey-cluster get pods

# 2. cluster_state:ok, all 16384 slots covered
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster info | head -6

# 3. Replication healthy on every replica
for i in 3 4 5; do
  kubectl -n valkey-cluster exec valkey-cluster-$i -c valkey -- \
    valkey-cli -a "$PASS" --no-auth-warning info replication | grep -E '^(role|master_link_status):'
done

# 4. PDB allows 1 disruption
kubectl -n valkey-cluster get pdb valkey-cluster
# expect: maxUnavailable=1, disruptionsAllowed=1

If any pre-flight fails, fix before upgrading.

Chart bump (config-only)

For chart changes that don't touch the StatefulSet's pod template (e.g., a new ServiceMonitor label, a NetworkPolicy tweak):

helm upgrade valkey-cluster ./cluster-mode-helm-chart -n valkey-cluster

No pod restart. The post-upgrade Hook Job runs bootstrap.sh, sees cluster_state:ok already, and exits 0.

Valkey image version bump

For the underlying Valkey container image (e.g., 9.0.2 → 9.0.3):

# overrides.yaml
image:
  tag: "9.0.3"

helm upgrade valkey-cluster ./cluster-mode-helm-chart -n valkey-cluster -f overrides.yaml

The StatefulSet's pod template changes → rolling restart in reverse-ordinal order (pod 5 → 4 → 3 → 2 → 1 → 0), one pod at a time (enforced by PDB maxUnavailable: 1). For each pod:

K8s sends SIGTERM. preStop runs.
If the pod is a primary, prestop.sh issues CLUSTER FAILOVER to its replica. The replica's role flips to primary within ~1 second; clients see at most one MOVED redirect.
SHUTDOWN is sent; the server flushes AOF and exits.
Pod re-creates with the new image, runs the init containers (kernel tuning + prepare-config), then the main container.
New container starts — the existing /data/nodes.conf carries this pod's node ID. The cluster's other 5 pods recognize it via gossip.
Readiness probe waits for cluster_state:ok + cluster_slots_assigned:16384.
PDB releases — next pod can roll.

End-to-end for 6 pods at ~90s per pod: ~10 minutes. Tested in the chart development with sustained writes — 97% write success during the upgrade window, with the 3% gap landing in the seconds when the last primary (pod 0) failed over to its replica.

After upgrade, the master/replica roles are typically flipped vs the start (every primary's preStop triggers a failover to its replica). This is intentional and harmless; you can flip them back later via CLUSTER FAILOVER if you want a stable mapping for monitoring dashboards.

Karpenter AMI rollover

When the Karpenter NodePool's underlying AMI bumps (e.g., AL2023 security patch), the existing nodes drift and get replaced. Karpenter respects PDBs, so the cluster mode rollover behaves exactly like the image bump above — one pod evicted at a time, preStop does graceful failover, replica catches up before next pod.

To control timing:

# Disable Karpenter's automatic drift correction temporarily
kubectl -n karpenter patch nodepool valkey --type merge -p \
  '{"spec":{"disruption":{"budgets":[{"nodes":"0"}]}}}'

# … manually drain when ready …
kubectl drain ip-100-64-xxx --ignore-daemonsets --delete-emptydir-data

# Re-enable
kubectl -n karpenter patch nodepool valkey --type merge -p \
  '{"spec":{"disruption":{"budgets":[{"nodes":"1"}]}}}'

Rolling upgrade test (verification)

Before relying on the upgrade path in production, exercise it once with sustained traffic. A pattern that worked during chart development:

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)

# Terminal A — sustained workload
for i in $(seq 1 600); do
  kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
    valkey-cli -c -a "$PASS" --no-auth-warning SET "upgrade:probe:$i" "v" \
    && echo "$(date +%T) ok" || echo "$(date +%T) FAIL"
  sleep 0.5
done > /tmp/probe.log &

# Terminal B — monitor cluster health
while true; do
  kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
    valkey-cli -a "$PASS" --no-auth-warning cluster info | grep cluster_state
  sleep 2
done

# Terminal C — trigger the upgrade
helm upgrade valkey-cluster ./cluster-mode-helm-chart -n valkey-cluster -f overrides.yaml

# After completion
grep -c FAIL /tmp/probe.log
# expect < 5% — failures concentrate at the moment the last primary fails over

Operational runbooks

Scale out (add a shard)

Cluster mode scales by adding shards (a primary + its replica), not by adding pods to existing shards. The chart's bootstrap is one-shot — for scale-out, follow this manual procedure:

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)

# 1. Scale the StatefulSet
kubectl -n valkey-cluster scale statefulset valkey-cluster --replicas=8

# 2. Wait for pods 6 and 7 to be Running (they'll be in "bootstrap-pending"
#    Readiness state — that's expected, they haven't been joined yet)
kubectl -n valkey-cluster get pods

# 3. Add pod 6 as a new primary
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning --cluster add-node \
    valkey-cluster-6.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
    valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379

# 4. Reshard — move ~4096 slots from existing primaries to pod 6
NEW_PRIMARY_ID=$(kubectl -n valkey-cluster exec valkey-cluster-6 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster myid)

kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning --cluster reshard \
    valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
    --cluster-from all --cluster-to "$NEW_PRIMARY_ID" \
    --cluster-slots 4096 --cluster-yes

# 5. Add pod 7 as a replica of pod 6, in a different AZ if possible
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning --cluster add-node \
    valkey-cluster-7.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
    valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
    --cluster-slave --cluster-master-id "$NEW_PRIMARY_ID"

Resharding is online — clients see brief MOVED / ASK redirects but no errors. Budget 30–60 seconds per GB of resharded data on legacy migration.

Scale in (remove a shard)

# 1. Drain slots off the shard being removed (move them to another primary)
DRAIN_ID=$(kubectl -n valkey-cluster exec valkey-cluster-6 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster myid)
DEST_ID=$(kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster myid)

kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning --cluster reshard \
    valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
    --cluster-from "$DRAIN_ID" --cluster-to "$DEST_ID" \
    --cluster-slots 16384 --cluster-yes

# 2. Remove pod 6 and 7 from the cluster
for ID in $DRAIN_ID $REPLICA_ID; do
  kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
    valkey-cli -a "$PASS" --no-auth-warning --cluster del-node \
      valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
      "$ID"
done

# 3. Scale the StatefulSet down
kubectl -n valkey-cluster scale statefulset valkey-cluster --replicas=6

# 4. Clean up the orphaned PVCs (StatefulSet doesn't auto-delete them)
kubectl -n valkey-cluster delete pvc data-valkey-cluster-6 data-valkey-cluster-7

Recover from a wedged bootstrap

If the post-install Hook Job fails (e.g., a pod is stuck restarting), the Job stays in Failed state for debugging (the chart uses hook-delete-policy: before-hook-creation,hook-succeeded).

# Inspect logs
kubectl -n valkey-cluster logs -l app.kubernetes.io/component=bootstrap -c topology
kubectl -n valkey-cluster logs -l app.kubernetes.io/component=bootstrap -c bootstrap

# Fix the underlying issue (config error, image pull, etc.), then re-run helm
kubectl -n valkey-cluster delete job valkey-cluster-bootstrap
helm upgrade valkey-cluster ./cluster-mode-helm-chart -n valkey-cluster

bootstrap.sh is idempotent — if cluster_state:ok already, it exits 0 without re-running --cluster create.

Reset everything (dev / test)

./uninstall-cluster-mode.sh --purge     # removes PVCs + Secret + namespace
./install-cluster-mode.sh               # fresh install

Observability

The cluster-mode chart's ServiceMonitor is auto-discovered by kube-prometheus-stack (in the monitoring namespace), and the oliver006/redis_exporter sidecar is scraped on :9121. The general Grafana access pattern, dashboard import, and shared redis_* metrics are documented in the Replication → Observability section — the bundled dashboard's namespace template variable switches between valkey and valkey-cluster.

The cluster-specific metrics:

Metric	Meaning	Suggested alert
`redis_cluster_enabled`	Pod configured for cluster mode	`== 0`, critical (config drift)
`redis_cluster_slots_ok`	Slots in `OK` state across the cluster	`< 16384 for 2m`, critical
`redis_cluster_slots_pfail` + `redis_cluster_slots_fail`	Slots gossip suspects or has marked failed	`> 0 for 2m`, warning
`redis_cluster_known_nodes`	Total nodes gossip can see	`!= primaries × (1 + replicasPerPrimary) for 5m`, warning
`redis_cluster_size`	Primaries with at least one slot	`!= expected for 5m`, warning
`redis_cluster_slots_assigned`	Per-primary slot count — graph this during a reshard to watch slots migrate in real time	—

Useful cluster-scoped PromQL (in Grafana Explore):

# Cluster health
max(redis_cluster_slots_ok{namespace="valkey-cluster"})       # expect 16384
max(redis_cluster_size{namespace="valkey-cluster"})           # expect = primaries

# Per-shard throughput / memory / lag
sum by (pod) (rate(redis_commands_processed_total{namespace="valkey-cluster"}[1m]))
redis_memory_used_bytes{namespace="valkey-cluster"} / 1024 / 1024
redis_master_last_io_seconds_ago{namespace="valkey-cluster"}

For a live reshard demo, graph redis_cluster_slots_assigned per pod alongside the throughput query above — you'll see slot counts redistribute and the receiving primary's throughput spike.

Cross-region disaster recovery

For the region-loss scenario (single-AZ failures are handled by gossip-driven failover automatically — no DR runbook needed).

Valkey has no native cross-region replication. The supported pattern is periodic BGSAVE → S3 CRR → target-region restore, achievable with the bundled migration tooling:

RPO	RTO	Method
≤ snapshot interval (e.g., 1 h)	15–45 min	Cron BGSAVE → S3 CRR → restore in DR region
≤ 5 min	15–45 min	Same as above with tighter snapshot cadence
Near-zero	~30 s	App-layer dual-write to two regional clusters, DNS cutover — out of scope (app work, not a Valkey feature)

Prepare ahead of time

DR_REGION=us-east-1
DR_BUCKET="valkey-dr-snapshots-${DR_REGION}-$(aws sts get-caller-identity --query Account --output text)"
aws s3 mb "s3://${DR_BUCKET}" --region "${DR_REGION}"

# Enable S3 CRR on the primary-region valkey-migration bucket → DR_BUCKET.
# See AWS S3 CRR docs for the IAM role + replication-config.

# Pre-deploy the EKS infra in the DR region (no Valkey yet) so failover
# is just `helm install`.

Steady state — periodic snapshots

Cron job (Argo CronWorkflow / EKS CronJob / Lambda) in the primary region, per shard:

PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)
for shard in 0 1 2; do
  POD="valkey-cluster-${shard}"
  ROLE=$(kubectl -n valkey-cluster exec "$POD" -c valkey -- \
           valkey-cli -a "$PASS" --no-auth-warning role | head -1)
  [ "$ROLE" = "master" ] || continue        # skip if not the current primary

  kubectl -n valkey-cluster exec "$POD" -c valkey -- \
    valkey-cli -a "$PASS" --no-auth-warning BGSAVE
  # Wait for LASTSAVE to advance (see migration script), then kubectl cp
  # /data/dump.rdb out and aws s3 cp to <bucket>/cluster-mode/<shard>/dump.rdb.
done

S3 CRR replicates to the DR bucket within ~5 minutes.

Failover to DR

aws s3 ls "s3://${DR_BUCKET}/cluster-mode/" --recursive --region "${DR_REGION}"   # verify snapshots present
cd data-stacks/valkey-on-eks
AWS_REGION="${DR_REGION}" ./deploy.sh                                              # bring up EKS in DR
./install-cluster-mode.sh                                                          # install chart (empty cluster)

# Re-hydrate via valkey-cli --cluster import from a temp pod that can reach
# the source RDB on S3. (Per-shard initContainer RDB restore is on the
# chart's roadmap; until then, --cluster import is the cleanest path.)
# See: ./ec2-migration.md and ./cluster-mode.md#migration-from-replication-mode

# Verify + cut over
PASS=$(kubectl -n valkey-cluster get secret valkey-cluster-auth -o jsonpath='{.data.default}' | base64 -d)
kubectl -n valkey-cluster exec valkey-cluster-0 -c valkey -- \
  valkey-cli -a "$PASS" --no-auth-warning cluster info | head -3
# Update app DNS / LB to the DR cluster's endpoint.

Quarterly DR drill — snapshot, verify CRR copied, restore into a shadow namespace (valkey-cluster-dr), validate DBSIZE + sentinel keys, tear down. Record wall-clock time. The first drill always exposes one missing IAM permission or wrong-region path — catch it here, not during a real region loss.

Gotchas. AZ pairings derive from the DR cluster's topology, not preserved from source — correct. Hash slot identity IS preserved (slots 0–5460 stay on shard 0). Auth Secret in DR is a fresh password unless you pre-create it via auth.existingSecret. The shard-aware initContainer RDB restore for cluster mode is a chart enhancement on the roadmap.

Migration from Replication Mode

Already running this stack's replication mode and want to migrate to cluster mode without downtime?

Deploy the cluster-mode chart in the valkey-cluster namespace (the default). Both releases run in parallel.

Use valkey-cli --cluster import from a temporary client pod to live-migrate keys:

kubectl -n valkey-cluster run valkey-migrate --rm -it \
  --image=docker.io/valkey/valkey:9.0.2 --restart=Never -- /bin/sh

# Inside the pod:
valkey-cli --cluster import \
  valkey-cluster-0.valkey-cluster-headless.valkey-cluster.svc.cluster.local:6379 \
  --cluster-from valkey.valkey.svc.cluster.local:6379 \
  --cluster-from-pass "$REPLICATION_PASS" \
  --cluster-from-user default \
  --cluster-replace

Cut application traffic over to the cluster-mode endpoints (cluster-aware client library required).
Once stable, decommission the replication-mode release.

--cluster import uses DUMP / RESTORE for live key migration. Expect ~5–20 MB/s per source connection; parallelize by hash-slot range for very large datasets.

References

Valkey Cluster Specification — protocol-level reference for slot assignment, gossip, failover.
Valkey Cluster Tutorial — hands-on companion to the spec.
Valkey Administration Docs — THP, fork tuning, kernel guidance.
Valkey 1B RPS benchmark — scale ceiling demo (1000 shards).
valkey-helm PR #51 — the upstream PR being tracked for native cluster support.
valkey-helm #18 — official cluster-mode roadmap.
AWS ElastiCache supported node types — current instance families AWS recommends.
AWS gp3 EBS docs — IOPS/throughput knobs and pricing.
AWS VPC CNI Prefix Delegation — pod-density planning.

When to Use Cluster Mode​

Matches your EC2 setup​

Architecture​

How the cluster communicates internally​

Hash slots and client routing (port 6379)​

Gossip — the cluster bus (port 16379)​

Replication — primary → replica (port 6379, PSYNC)​

Hostname-aware addressing​

Choosing instance types​

What ElastiCache uses (managed)​

What to use on self-managed EKS​

I/O threads and CPU sizing​

Network bandwidth​

Opting into n-suffix (network-optimized) instances​

Cost​

Choosing storage​

EBS gp3 is the right default (not local NVMe)​

gp3 sizing by working set​

Persistence mode — pick one​

Cluster scaling limits​

Deployment​

Prerequisites​

Quickstart​

What the chart deploys​

AZ-aware bootstrap (default ON)​

Verifying the install​

Connecting clients​

Best practices​

Authentication​

Strict AZ spread​

Persistence (required for cluster mode)​

Kernel tuning (Transparent Huge Pages)​

PodDisruptionBudget — maxUnavailable: 1​

Strict one-pod-per-node anti-affinity​

Network policy​

How failover works​

Unplanned (automatic, gossip-driven)​

Planned (graceful, via CLUSTER FAILOVER)​

When failover stalls​

Planned upgrades​

Pre-flight checklist​

Chart bump (config-only)​

Valkey image version bump​

Karpenter AMI rollover​

Rolling upgrade test (verification)​

Operational runbooks​

Scale out (add a shard)​

Scale in (remove a shard)​

Recover from a wedged bootstrap​

Reset everything (dev / test)​

Observability​

Cross-region disaster recovery​

Prepare ahead of time​

Steady state — periodic snapshots​

Failover to DR​

Migration from Replication Mode​

References​