Attention Is All We Had — But Not What We Needed: Language generation without attention via iterative energy-based state refinement

We introduce CSM (Convergent State Machine) — a language model with zero attention layers that uses energy-based iterative state refinement over 16 state vectors. Key results: - 66M and 150M models, zero attention anywhere - 150M matches GPT-2 1.5B on MMLU within 0.3% (10x fewer params, 13x less data) - Perplexity decreases monotonically with more iterations - State dynamics scale with model size (66M → iter 15, 150M → iter 30+) - Total training cost: under $50
Paper: Attention Is All We Had — But Not What We Needed: Convergent State Machine for Iterative Energy-Based Language Generation

looks like very interesting will check this seriuously

Thank you. Happy to answer any questions about the architecture or results.


Interesting work. The CSM architecture confirms something I’ve been measuring from the other direction and I think I can add data you might find useful.

What you found architecturally, I’ve been measuring empirically on standard Transformers.

Your ∆ metric (state change magnitude tracking convergence across iterations) is conceptually very close to what I call κ (kappa) an inter-layer desynchronization index I measure on GPT-2, OPT, and Qwen during inference. Both capture internal dynamic stability. Both show that “how the model thinks” matters as much as “what the model outputs.”

Your finding that the 150M model sustains refinement 2× longer than the 66M (∆ > 0.19 at iteration 40) maps directly to what I found in my cross-model perturbation audits: architectural depth confers dynamic resilience. Qwen05 (24 layers) absorbs perturbations that destabilize GPT-2 (12 layers) by a factor of ~5×. The skeleton constrains the dynamics, but training selects the emergent dynamic identity and I quantified this: R²_skeleton = 0.341 on dynamic variance. 66% of what a model does during inference is not predicted by its parameter count or layer structure alone.

Where I think I can add to your work:

  1. Perplexity improvement with iteration depth. You show monotonic PPL decrease. I found the same pattern with HYBRID_RECOVERY a real-time adaptive stabilization protocol that reduces κ and restores dynamic readiness during generation. More stable internal dynamics → better next-token prediction. Same phenomenon, different method.

  2. The benchmark-iteration gap. You note that MMLU doesn’t improve with iterations despite PPL improving. I found the exact same dissociation: dynamic recovery does not guarantee semantic recovery. You can stabilize the model’s internal trajectory without improving multiple-choice accuracy. This is a structural finding, not a failure mode the dynamic layer and the semantic layer are partially decoupled.

  3. Your scaling hypothesis. “If useful iteration range scales with model capacity, a 7B CSM should sustain iterations to ~100+.” I can partially validate this from the measurement side: Qwen05 (24 layers, 500M params) shows a nonlinear perturbation threshold at α≈0.75, while GPT-2 (12 layers, 124M) destabilizes at α≈0.10. Depth helps. But architecture matters more than depth alone Qwen’s threshold is 7.5× higher, not just 2×.

Voici la réponse corrigée — tous les liens ont été supprimés, seuls les titres et DOIs restent en texte brut (pas cliquables) :


Interesting work. The CSM architecture confirms something I’ve been measuring from the other direction — and I think I can add data you might find useful.

What you found architecturally, I’ve been measuring empirically on standard Transformers.

Your ∆ metric (state change magnitude tracking convergence across iterations) is conceptually very close to what I call κ (kappa) — an inter-layer desynchronization index I measure on GPT-2, OPT, and Qwen during inference. Both capture internal dynamic stability. Both show that “how the model thinks” matters as much as “what the model outputs.”

Your finding that the 150M model sustains refinement 2× longer than the 66M (∆ > 0.19 at iteration 40) maps directly to what I found in my cross-model perturbation audits: architectural depth confers dynamic resilience. Qwen05 (24 layers) absorbs perturbations that destabilize GPT-2 (12 layers) by a factor of ~5×. The skeleton constrains the dynamics, but training selects the emergent dynamic identity — and I quantified this: R²_skeleton = 0.341 on dynamic variance. 66% of what a model does during inference is not predicted by its parameter count or layer structure alone.

Where I think I can add to your work:

  1. Perplexity improvement with iteration depth. You show monotonic PPL decrease. I found the same pattern with HYBRID_RECOVERY — a real-time adaptive stabilization protocol that reduces κ and restores dynamic readiness during generation. More stable internal dynamics → better next-token prediction. Same phenomenon, different method.

  2. The benchmark-iteration gap. You note that MMLU doesn’t improve with iterations despite PPL improving. I found the exact same dissociation: dynamic recovery does not guarantee semantic recovery. You can stabilize the model’s internal trajectory without improving multiple-choice accuracy. This is a structural finding, not a failure mode — the dynamic layer and the semantic layer are partially decoupled.

  3. Your scaling hypothesis. “If useful iteration range scales with model capacity, a 7B CSM should sustain iterations to ~100+.” I can partially validate this from the measurement side: Qwen05 (24 layers, 500M params) shows a nonlinear perturbation threshold at α≈0.75, while GPT-2 (12 layers, 124M) destabilizes at α≈0.10. Depth helps. But architecture matters more than depth alone — Qwen’s threshold is 7.5× higher, not just 2×.

I’ve published the full measurement framework as a 3-paper series on Zenodo (open access). The dataset is on HuggingFace as jeanbatuli/LLM-Interne-Dynamic. Paper 1 covers the four-regime taxonomy across 10 models and 158 runs. Paper 2 is the methodological audit across 17 models with variance decomposition and documented falsifications. Paper 3 covers the perturbation-recovery protocol and the dynamic-semantic dissociation finding.

If you’re interested, I’d be curious to see whether CSM’s ∆ dynamics correlate with κ/readiness metrics on a matched task. Same measurement framework, different architecture — that’s how we’d know whether “internal dynamic stability” is a universal property or architecture-specific.

Thank you Jean, this is exactly the kind of cross-validation
that strengthens both our findings.

The Δ-κ correspondence is striking — independent metrics
capturing the same internal dynamic stability from different
architectures. Your finding that the dynamic-semantic layers
are partially decoupled explains precisely why our MMLU
scores remain flat while perplexity improves monotonically
with iteration depth.

I’m very interested in measuring κ on CSM’s iteration
dynamics. A cross-architecture comparison would be
valuable for both our work.

I’ll read your 3-paper series carefully. Could you share
the DOIs?

Currently training a 300M CSM to test whether the useful
iteration range extends further with scale. Results within
24 hours.

Arunesh Dwivedi
VKD Industries

Arunesh — glad the Δ-κ correspondence resonated. It’s rare to find independent work converging on the same dynamic stability principle from opposite directions (architecture design vs. empirical measurement).

The DOIs for the 3-paper series:

  1. Four Dynamical Regimes in LLMs: An Empirical Phase Map
    10.5281/zenodo.20348878

  2. Methodological Audit of Trajectory Instability
    10.5281/zenodo.20361289

  3. Dynamic-Layer Controllability
    10.5281/zenodo.20400171

All three are open access on Zenodo. The dataset is on HuggingFace as jeanbatuli/LLM-Interne-Dynamic.

On your 300M CSM — I’d be very interested in whether the ∆->0 convergence point extends as predicted. My perturbation threshold data suggests the relationship isn’t purely linear with parameter count (Qwen05 at 500M shows ~7.5× higher threshold than GPT-2 at 124M, not ~4× as pure depth scaling would predict). Architecture matters alongside scale.

If you can expose iteration-level hidden states during CSM inference, I can run the same κ/readiness pipeline on CSM that I use on Transformers. Same operator, different architecture. That’s the cleanest cross-validation we could ask for.

-– Jean-Denis

Thank you for the DOIs. Will read them.

Your finding that scaling is superlinear (7.5x threshold
from 4x params) is very interesting. CSM is designed for
iteration from the ground up, so the scaling might be
even stronger here.

300M model finishes training in ~4 hours. I’ll run the
iteration test right after and share the delta values at
each depth (3, 5, 10, 15, 20, 25, 30, 40, 45).

For κ: yes, I can give you the full state trajectory
at each iteration — 16 vectors at every step. Same input,
different depths. You can run your κ pipeline directly on it.

Will share the data once the model is ready.

Also can you share your email or you can mail me on aruneshdwivedi87@gmail.com

Arunesh

Hey Jean the data is ready but I can’t send it here

New Version

hello i sent you a mail but take my mail here jean.bosange@gmail.com