Skip to content

slime troubleshooting

Known gotchas when training with the slime backend.

For reproducibility, here’s the exact environment this integration was validated against:

ComponentVersion / SHA
Instance type8 × NVIDIA H100 80GB HBM3
CUDA12.9
PyTorch2.9.1+cu129
Docker imageslimerl/slime@sha256:0100c933f1f63e7c4acdb9ec575e769839d59de4a648551e09e3fe0e7885631b (built 2026-04-28)
slimecommit f3e7bd7f3091d3be05c20977eefb31a785d6221d (2026-04-28)
SGLangv0.5.9
Megatron-LMcommit 3714d81d418c9f1bca4594fc35f9e8289f652862 ⚠ see note

LinearCrossEntropyModule parallelism error on 32B (or any model with untied embeddings)

Section titled “LinearCrossEntropyModule parallelism error on 32B (or any model with untied embeddings)”

Symptom: During 32B training, the Megatron actor crashes with:

ValueError: Cannot determine parallelism type for module 'LinearCrossEntropyModule'
at weight 'output_layer.weight'.

Cause: The Megatron-LM bundled in slimerl/slime:latest is several hundred commits ahead of the sha pinned in slime’s docker README (3714d81d). Specifically, Megatron PR #3226 “Reapply fix Linear CE Fusion” (2026-02-04) replaced ColumnParallelLinear with a new LinearCrossEntropyModule that megatron-bridge’s AutoMapping doesn’t recognize. Models with tied embeddings (0.5B, 3B, 7B) skip this code path; models with --untie-embeddings-and-output-weights (32B and up) hit it.

Fix: Inside the container, pin /root/Megatron-LM to the stable sha:

Terminal window
cd /root/Megatron-LM
# Stash any image-local patches first (can be restored later with `git stash pop`)
git stash -u -m "slime local patches"
git checkout 3714d81d418c9f1bca4594fc35f9e8289f652862
# Clear pyc caches that reference the old code
find . -name __pycache__ -type d -exec rm -rf {} + 2>/dev/null

--norm-epsilon mismatch on Qwen2.5-32B-Instruct

Section titled “--norm-epsilon mismatch on Qwen2.5-32B-Instruct”

Cause: slime’s scripts/models/qwen2.5-32B.sh hardcodes --norm-epsilon 1e-5 (matching Qwen2.5-32B base), but Qwen2.5-32B-Instruct uses 1e-6.

Fix: Edit the slime model script to --norm-epsilon 1e-6, or pass an override through SlimeRunner(extra_flags=["--norm-epsilon", "1e-6"]). 0.5B/3B/7B Instruct variants match their base-model norm epsilons, so this only affects 32B.