For now, I asked it to organize the points:
This is a very interesting project. I especially like that it is not presented as “yet another small Transformer,” but as a recurrent 3D substrate whose internal dynamics can be visualized, perturbed, and probed.
I cannot help with large-scale compute, and I am not suggesting that “just train v6.2 harder” should be the immediate next step. My impression is almost the opposite: before scaling the training, it may be more useful to separate the questions that are currently mixed together.
In particular, I think the project becomes much easier to evaluate if we distinguish these questions:
- Can this architecture produce language-like sequences at all?
- Does the 3D spatial structure actually matter?
- Which design choices are doing the work: dilation, fatigue, learned initial state, number of update steps, output-face readout, etc.?
- Are the reported spatial specializations functionally causal, or mainly visual/interpretive patterns?
- Is this best understood as a standalone language model, a recurrent memory substrate, a synthetic-data generator, an interpretable dynamical system, or an adapter-like module?
To me, the most interesting question may not be:
Can this beat Transformers?
but rather:
Does a local recurrent 3D system develop reproducible, causal internal organization when trained on language-like tasks?
That seems like a genuinely interesting research direction even if the model never becomes practically competitive as a language model.
1. A useful framing
I would frame the project less as:
“A new language model architecture that competes with Transformers”
and more as:
“A probeable 3D recurrent cellular substrate that can be trained on symbolic, semantic, and language-like tasks.”
That framing avoids making the project depend on beating GPT-like baselines, while preserving the interesting part: local communication, emergent spatial organization, recurrent computation, and visible internal dynamics.
This is also closer to how Neural Cellular Automata are usually studied. The classic reference is Growing Neural Cellular Automata, where the point is not just raw task performance, but how local learned update rules can produce stable, self-organizing, regenerative behavior.
There is also recent work connecting NCA-like dynamics to language model training, but in a different way: Training Language Models via Neural Cellular Automata uses NCA-generated spatiotemporal data as synthetic pre-pre-training data for language models. That paper is not doing exactly the same thing as this project, but it suggests that NCA dynamics may be useful as structured non-linguistic training signals, not only as standalone models.
So I think there are several possible interpretations of this project:
| Interpretation |
What would be tested? |
| Standalone NCA language model |
Can the 3D recurrent cube directly predict/generate language? |
| Recurrent memory substrate |
Can the cube store and propagate information better than simpler recurrent baselines? |
| Synthetic pretraining generator |
Can its dynamics produce useful structured data for other models? |
| Interpretable dynamics model |
Do grammar/semantic/emotion-like regions emerge in a reproducible, causal way? |
| Adapter/refinement block |
Can NCA-like local updates improve a Transformer/RNN/ConvLM as a component? |
I think the fourth interpretation — interpretable recurrent dynamics — is currently the most exciting one.
2. Suggested ablations
The first thing I would want is a small ablation table. Not necessarily huge training runs; just enough to clarify what is essential.
Possible variants:
| Variant |
Question |
| Full v5 |
Current reference point |
| No dilation |
Is global coverage through dilation essential? |
| Dilation cycle changed |
Is [1, 2, 4, 8] special, or just one reasonable schedule? |
| No synaptic fatigue |
Does fatigue actually reduce repetition collapse? |
| Fatigue only at inference |
Is it a training-time mechanism, inference-time heuristic, or both? |
| Fixed initial state |
How much comes from the learned init_state? |
| Random initial state |
Does the model rely on a learned “brain prior”? |
| Output face only |
Is the opposite-face readout important? |
| Global pooled readout |
Does reading from the whole cube improve or erase the spatial story? |
| Random output face |
Is the z-axis information-flow interpretation robust? |
| Fewer update steps |
Where does performance appear? |
| More update steps |
Does extra recurrent computation help or degrade output? |
| 1D version |
Is this just sequence convolution? |
| 2D version |
Is 3D actually useful over a simpler spatial substrate? |
| 3D ConvNet without recurrence |
Is recurrence doing real work? |
| Recurrent Conv3D without NCA framing |
Is the “cellular” framing adding anything beyond a recurrent ConvNet? |
The goal would not be to “disprove” the model. The goal would be to locate the actual source of the effect.
For example, if removing synaptic fatigue causes much more “the the the” collapse, that is useful evidence. If global pooling beats output-face readout, that would weaken the “wave reaches the opposite face” story. If 2D performs similarly to 3D, then maybe the important thing is recurrence + convolution, not specifically a 3D cube. If the learned initial state is crucial, then the “brain DNA” idea becomes a real object of study rather than just a metaphor.
3. Suggested baselines
I would also suggest a few simple baselines with roughly similar parameter counts and the same data/tokenizer where possible.
Possible baselines:
| Baseline |
Why it matters |
| Small GRU/LSTM |
Minimal recurrent sequence baseline |
| Small Transformer |
Standard language-modeling baseline |
| 1D ConvLM |
Convolutional sequence baseline |
| Temporal CNN / TCN |
Stronger non-attention sequence baseline |
| Recurrent Conv1D |
Similar recurrence, no 3D substrate |
| Recurrent Conv2D |
Spatial recurrence without full 3D |
| Recurrent Conv3D |
Same broad compute family without the NCA interpretation |
| Neural GPU-like model |
Classical recurrent convolutional algorithm-learning comparison |
The Neural GPU is especially relevant historically because it is a convolutional gated recurrent architecture that was studied for learning algorithmic sequence transformations. It is not the same as this project, but it is a useful comparison point for “local recurrent computation over a grid.”
I would not compare only against GPT-2 or modern Transformers. That comparison is too harsh and not very informative. A more useful question is:
Compared with other small recurrent/convolutional baselines, what does the 3D NCA-like substrate uniquely buy us?
4. Dynamics analysis
The internal dynamics are probably the most interesting part of the project. I would try to turn the qualitative “wave/eureka/decision” story into plots.
Useful measurements:
| Measurement |
Purpose |
| Loss vs update step |
Does performance really improve at a particular recurrent depth? |
| Entropy vs update step |
Does the model become more confident during propagation? |
| Top-k distribution vs step |
Does the predicted word sharpen over time? |
| Activation norm vs step |
Does the cube stabilize, explode, or collapse? |
| Spatial activation center of mass |
Does information actually move from input face to output face? |
| Mutual information with input tokens |
Does input information propagate spatially over time? |
| Region-wise contribution to logits |
Which regions causally affect output? |
| Seed-to-seed consistency |
Does the same specialization reappear across runs? |
The reported “steps 6-7 eureka” is particularly interesting. I would want to see:
- Does the same step transition appear across many prompts?
- Does it appear across random seeds?
- Does it align with the dilation schedule?
- Does it still appear when the dilation cycle is changed?
- Does it appear on arithmetic/relations/language tasks equally?
- Does running more steps help, saturate, or degrade?
If the “eureka” phase is stable across prompts and seeds, that is much stronger than a single visualization.
5. Interpretability/probing checklist
The reported spatial organization is the most exciting claim, but also the claim that needs the most care. Humans are very good at seeing meaning in visualizations. So I would try to convert each qualitative observation into a causal or statistical test.
Possible checks:
| Claim |
Possible test |
| Region x=12 produces better language |
Ablate x=12 and compare loss/generation quality |
| Region x=6 produces garbage |
Patch x=6 into good generations or ablate it |
| Grammar is central |
Train POS/syntax probes on cell states by location |
| Semantics is peripheral |
Train semantic-category probes by location |
| Emotional words use z=12 |
Compare activation maps for emotional vs neutral words |
| Semantic clusters exist |
UMAP/PCA of cell states with word-category labels |
| Wave carries answer |
Intervene on intermediate slices and measure output damage |
| Learned init_state contains specialization |
Probe/visualize init_state before any input |
| Good/bad regions are stable |
Repeat over seeds and datasets |
Some concrete interventions:
- Zero out one spatial region at a time.
- Add noise to one region at a time.
- Swap activations between two prompts.
- Patch the “good region” from one run into another.
- Freeze parts of the cube during training.
- Train linear probes per coordinate or per region.
- Compare probes against shuffled labels.
- Compare spatial maps across random seeds.
The key distinction is:
A region lighting up is not the same as a region being causally necessary.
If region ablation damages the relevant capability selectively, then the spatial specialization claim becomes much stronger.
6. Synthetic tasks before full natural language
Natural language is very hard to diagnose because many failure modes are entangled: tokenization, data size, frequency bias, repetition collapse, long-range dependency, objective mismatch, readout design, and recurrent depth.
Before focusing too much on open-ended text generation, I would test a ladder of synthetic tasks.
Suggested task ladder:
| Task |
Capability tested |
| Copy |
Can the cube preserve input? |
| Shift |
Can information propagate directionally? |
| Reverse |
Can it perform nontrivial sequence manipulation? |
| Parity |
Can it aggregate global information? |
| Modular addition |
Can it learn algorithmic rules? |
| Bracket matching / Dyck language |
Can it model stack-like structure? |
| Associative recall |
Can it bind keys and values? |
| Small symbolic grammar |
Can it learn controlled next-token structure? |
| Character-level corpus |
Can it model language without large word vocab issues? |
| Word-level small corpus |
Can it handle sparse word prediction? |
This would clarify where the architecture fails. If it cannot solve copy/reverse/parity reliably, then weak natural-language generation is unsurprising. If it solves synthetic grammar but fails at word-level language, then the bottleneck may be vocabulary/readout/data rather than the recurrent substrate itself.
Related work like LifeGPT, AutomataGPT, and Learning Elementary Cellular Automata with Transformers studies the opposite direction — Transformers learning CA dynamics — but those papers are still useful because they suggest evaluation patterns for local-rule systems: forecasting, rule inference, intermediate-state prediction, and generalization to unseen dynamics.
7. Possible alternative roles for the model
I would not restrict the project to “standalone language model.” There are other ways the idea could be valuable.
A. Recurrent memory substrate
The cube could be a memory/update substrate that receives token embeddings and evolves for several steps. Then another model reads from it.
Questions:
- Does it store local context better than a simple recurrent state?
- Does it denoise representations?
- Does it preserve information over many update steps?
- Does it help on associative recall or algorithmic tasks?
B. Adapter/refinement module
NCA-like blocks could be used inside another model instead of replacing the whole model. For example, AdaNCA uses NCA-style adapters between Vision Transformer layers to improve robustness. That is vision, not language, but the architectural idea is relevant: NCA as a plug-in refinement module rather than the entire model.
Possible language analogues:
- Transformer + NCA adapter
- RNN + NCA memory
- ConvLM + NCA refinement block
- Decoder-only LM with local NCA hidden-state smoothing
- NCA block between attention and MLP layers
C. Synthetic dynamics generator
The cube may be more useful for generating structured non-linguistic trajectories than for directly generating language. This connects to Training Language Models via Neural Cellular Automata, where NCA-generated data is used as synthetic pre-pre-training data before natural-language training.
Questions:
- Can 3D NCA trajectories produce useful synthetic curricula?
- Does pretraining on those trajectories help small language models?
- Does the complexity of the NCA dynamics matter?
- Are 3D dynamics more useful than 1D/2D CA dynamics?
D. Interpretable dynamical system
Even if the model is weak as an LM, it may be valuable as a visible dynamical system trained on language-like tasks.
Questions:
- Does syntax-like information localize?
- Does semantic category information localize?
- Are localized regions stable across seeds?
- Are regions causally necessary?
- Can “thought over time” be measured through recurrent steps?
This seems like the most compelling direction to me.
8. What I would prioritize
If I were organizing the next steps without providing compute, I would prioritize:
- Minimal reproducibility
- Ablation table
- Baseline table
- Step-wise dynamics plots
- Region ablation
- Simple probes
- Synthetic task ladder
- Only then larger training
A possible short roadmap:
Phase 1: Reproduce and measure
- Run v5 inference.
- Record predictions by recurrent step.
- Plot loss/entropy/confidence over steps.
- Check repetition rate.
- Test more/fewer recurrent steps.
- Save activation maps for a fixed prompt set.
Phase 2: Ablate
- Remove or alter dilation.
- Remove fatigue.
- Compare output-face readout vs pooled readout.
- Compare learned vs fixed init state.
- Compare 1D/2D/3D variants if feasible.
Phase 3: Baseline
- Train/evaluate small GRU, small Transformer, 1D ConvLM, and recurrent Conv3D baselines under similar conditions.
- Use the same tokenizer/data where possible.
- Report validation loss, next-token accuracy, repetition rate, and parameter count.
Phase 4: Probe
- POS probe.
- Semantic category probe.
- Emotion/neutral contrast.
- Region ablation.
- Activation patching.
- Seed consistency.
Phase 5: Reframe
Depending on results, decide whether the model is best pursued as:
- standalone NCA-LM,
- recurrent memory,
- interpretable dynamics system,
- synthetic data generator,
- or adapter module.
9. A compact experiment matrix
One possible experiment table:
| Experiment |
Minimal output |
| Step sweep |
loss/entropy/confidence vs recurrent step |
| Dilation ablation |
validation loss + repetition rate |
| Fatigue ablation |
repetition/collapse metrics |
| Init-state ablation |
performance drop from learned to fixed/random init |
| Readout ablation |
output face vs pooled vs random face |
| Region ablation |
heatmap of loss increase by region |
| Probe map |
spatial map of syntax/semantic probe accuracy |
| Seed repeat |
whether specialization recurs |
| Baseline comparison |
small Transformer/GRU/ConvLM/recurrent Conv3D |
| Synthetic tasks |
copy/reverse/parity/Dyck/associative recall |
This would make the project much easier to discuss.
10. Suggested wording of the main contribution
If this were written as a more formal project note, I would avoid claiming:
“A 3D brain learned language.”
I would phrase it more conservatively:
“We explore a recurrent 3D neural cellular substrate for language-like prediction, and investigate whether local update dynamics produce reproducible spatial specialization.”
That keeps the interesting claim while making it more testable.
A stronger version, if supported by ablations, could be:
“Although not competitive with Transformer baselines as a language model, the system shows measurable recurrent phase transitions and spatially localized representations that can be probed and causally intervened upon.”
That would be a very interesting result.
11. Why I think this is worth exploring
The raw language modeling performance is not the main reason this is interesting. The interesting part is that the model gives you a spatial, recurrent, perturbable object.
Transformers are powerful, but their internal representations are not naturally laid out as a physical 3D substrate. Here, even if the model is much weaker, you can ask questions like:
- Where does information enter?
- How does it move?
- When does a prediction become confident?
- Which regions matter?
- Do regions specialize?
- Does specialization survive retraining?
- Can we damage a region and observe selective failure?
- Can we watch recurrence improve or destroy the answer?
That makes the project interesting as a small experimental system.
12. Final take
My current impression:
- I would not spend the next effort mainly on larger training.
- I would not frame it as a Transformer competitor.
- I would focus on ablations, baselines, and causal interpretability.
- I would test synthetic tasks before open-ended language.
- I would consider roles other than standalone LM: memory substrate, adapter, synthetic-data generator, or interpretable dynamical system.
The most valuable next contribution might simply be a clear evaluation map.
Something like:
“Here are the claims, here are the ablations that test each claim, here are the baselines, and here are the probes that would make the emergent-organization story stronger.”
That kind of organization could make the project much easier for others to engage with, even without anyone immediately providing large-scale compute.