I'm not an engineer. I just wanted to see if a 3D cube of cells could learn to talk

I’m not an engineer. I just wanted to see if a 3D cube of cells could learn to talk.

Hi everyone,

I want to share a project I’ve been working on for the past week. I’m not a machine learning engineer, I don’t have a
CS degree, and I had no idea if this would work. I just had a question: what if instead of Transformers, we used a 3D
grid of simple cells that only talk to their neighbors?

Like a brain made of tiny cells, where information travels as waves. No attention, no layers — just local
communication.

It kind of worked. And along the way, I found things I didn’t expect.

The idea

I built a Neural Cellular Automaton in 3D — a 16×16×16 cube (4,096 cells) where each cell can only see its 26
immediate neighbors. Information enters one face of the cube, propagates as waves through the interior, and the
prediction is read from the opposite face.

Think of it like dropping a pebble in a pond — the ripples carry the information.

Phase 1: Can it do math?

I started simple: arithmetic. Addition, subtraction, multiplication, division.

With just 499K parameters (a Transformer would need millions), the model reached 98.4% accuracy on numbers it had
never seen during training. Not memorization — actual generalization. It learned the rules of arithmetic.

That gave me confidence. If a cube of cells can learn math, maybe it can learn something harder.

Phase 2: Does it understand relationships?

I taught it semantic relations: “dog is_a animal”, “Paris capital_of France”, “king opposite_of queen”. 100 relations,
thousands of pairs.

73.4% test accuracy. 87.5% generalization to novel combinations.

Then grammar + semantics together (184 relations): 93.5% overall. The Conv3d weights that learned math could also
learn world knowledge. Same brain, different skills.

Phase 3: Can it reason?

I tested transitive reasoning without training for it. If it knows “wolf is_a mammal” and “mammal produces milk”, can
it infer “wolf → milk”?

83.3% on novel chains it had never seen. wolf->mammal->milk, shark->fish->water, penguin->bird->fly. Reasoning emerged from
the structure.

It also learned to answer questions: “capital of France?” → “Paris”. 85% accuracy on direct questions, 75% on novel
combinations.

Phase 4: Language (the hard part)

This is where it got interesting — and where I failed many times.

9 versions of text generation failed. Every single one collapsed to “the the the” or “the of in a”. The most common
English words dominated everything.

The breakthrough came with three changes:

  1. Dilated convolutions — cycle [1, 2, 4, 8] so each cell can “see” the entire grid in 4 steps
  2. Word embeddings — switching from characters to a 30K word vocabulary
  3. Synaptic fatigue — cells that fire too much get tired, preventing repetition

The current model (v5) generates coherent phrases:

“she started to play together again”
“the little girl wanted to play with her parents”
“he said that he was very happy”
“in the morning she went to the garden”

10.7% eval accuracy on 30K vocabulary. That’s not impressive by Transformer standards, but for a cellular automaton
with 35M parameters that processes everything through local 3D wave propagation? I think it’s something.

What surprised me (emergent phenomena)

This is the part that really blew my mind. I didn’t program any of this — it emerged from training:

  1. The brain developed hemispheres. Region x=12 produces good language. Region x=6 produces garbage. Just like
    biological brains have lateralization — but nobody told the model to do this.

  2. Three phases of thinking. Steps 1-5: chaos (activations are noisy). Steps 6-7: “eureka” (the model suddenly
    organizes). Steps 8-15: decision (converges to the answer). The eureka moment coincides with the dilated convolution
    cycle reaching global coverage.

  3. Grammar and semantics separated spatially. Grammar channels concentrate in the center of the grid, semantic
    channels in the periphery. Like Broca’s area (syntax) and Wernicke’s area (meaning) in the human brain. The model
    spontaneously organized this way.

  4. Semantic clustering. Animals, family members, nature words, and objects each form distinct spatial clusters in the
    grid. The cube organized its own “brain regions” by category.

  5. Emotions activate a specific highway. Emotional words light up depth layer z=12 more than neutral words. The model
    created an “emotion highway” through the cube.

  6. The wave is visible. You can literally watch information travel from z=0 (input) to z=15 (output) step by step. The
    answer arrives as a wave at step 7 — the earliest step where the signal reaches the output face.

88 documented discoveries

Over the course of this project, I documented 88 experimental findings. Some of the key ones:

  • Cross-entropy loss works better than knowledge distillation (7.4% vs 4.2%)
  • The model thinks in waves — visualized and confirmed
  • Arithmetic knowledge gets overwritten when you teach language (the Conv3d transforms completely)
  • With 10 inference techniques combined, the model produced “you are having fun” — a grammatically perfect sentence —
    without any retraining, just by manipulating the grid’s activity
  • The init_state (the brain’s “DNA”) already contains the seeds of specialization before any training

What this is NOT

I want to be clear about what this project is:

  • It’s not a competitor to Transformers. GPT-2 Small (124M params) would destroy this model on every benchmark.
  • It’s not a practical language model. You can’t use it for anything useful.
  • It’s not polished research. I’m one person experimenting, not a lab with peer review.

What I think it IS

  • Proof that a fundamentally different architecture can learn language structure. Not well, but it can.
  • Evidence that spatial organization matters. The brain developed regions, hemispheres, and highways that weren’t
    programmed.
  • An exploration of what “thinking” looks like when computation happens through waves in 3D space instead of matrix
    multiplications in 1D.
  • A fun project by someone who just wanted to try something different.

The model

I uploaded the v5 model (the best one) to HuggingFace:

  • 35.4M parameters, 68 MB
  • 30K word vocabulary
  • Includes model code, inference script, dictionary, and brain visualizations
  • Runs on CPU, no GPU needed
  • MIT license

What’s next?

Honestly, I’m not sure. I’ve been at this for about a week and I’m a bit burned out. v6 (knowledge distillation from
GPT-2) showed promise but needs much more training than I can afford. I’d love to see what happens with:

  • More training data and compute (v6.2 is ready but needs ~20h on a B200)
  • A Gradio Space where people can see the waves propagate in real-time
  • Someone with more ML experience taking a look at the architecture

If any of this is interesting to you, the code and all 88 findings are in the repo. I’d love to hear what you think.

Thanks for reading.

-– Cristian

For now, I asked it to organize the points:


This is a very interesting project. I especially like that it is not presented as “yet another small Transformer,” but as a recurrent 3D substrate whose internal dynamics can be visualized, perturbed, and probed.

I cannot help with large-scale compute, and I am not suggesting that “just train v6.2 harder” should be the immediate next step. My impression is almost the opposite: before scaling the training, it may be more useful to separate the questions that are currently mixed together.

In particular, I think the project becomes much easier to evaluate if we distinguish these questions:

  1. Can this architecture produce language-like sequences at all?
  2. Does the 3D spatial structure actually matter?
  3. Which design choices are doing the work: dilation, fatigue, learned initial state, number of update steps, output-face readout, etc.?
  4. Are the reported spatial specializations functionally causal, or mainly visual/interpretive patterns?
  5. Is this best understood as a standalone language model, a recurrent memory substrate, a synthetic-data generator, an interpretable dynamical system, or an adapter-like module?

To me, the most interesting question may not be:

Can this beat Transformers?

but rather:

Does a local recurrent 3D system develop reproducible, causal internal organization when trained on language-like tasks?

That seems like a genuinely interesting research direction even if the model never becomes practically competitive as a language model.

1. A useful framing

I would frame the project less as:

“A new language model architecture that competes with Transformers”

and more as:

“A probeable 3D recurrent cellular substrate that can be trained on symbolic, semantic, and language-like tasks.”

That framing avoids making the project depend on beating GPT-like baselines, while preserving the interesting part: local communication, emergent spatial organization, recurrent computation, and visible internal dynamics.

This is also closer to how Neural Cellular Automata are usually studied. The classic reference is Growing Neural Cellular Automata, where the point is not just raw task performance, but how local learned update rules can produce stable, self-organizing, regenerative behavior.

There is also recent work connecting NCA-like dynamics to language model training, but in a different way: Training Language Models via Neural Cellular Automata uses NCA-generated spatiotemporal data as synthetic pre-pre-training data for language models. That paper is not doing exactly the same thing as this project, but it suggests that NCA dynamics may be useful as structured non-linguistic training signals, not only as standalone models.

So I think there are several possible interpretations of this project:

Interpretation What would be tested?
Standalone NCA language model Can the 3D recurrent cube directly predict/generate language?
Recurrent memory substrate Can the cube store and propagate information better than simpler recurrent baselines?
Synthetic pretraining generator Can its dynamics produce useful structured data for other models?
Interpretable dynamics model Do grammar/semantic/emotion-like regions emerge in a reproducible, causal way?
Adapter/refinement block Can NCA-like local updates improve a Transformer/RNN/ConvLM as a component?

I think the fourth interpretation — interpretable recurrent dynamics — is currently the most exciting one.

2. Suggested ablations

The first thing I would want is a small ablation table. Not necessarily huge training runs; just enough to clarify what is essential.

Possible variants:

Variant Question
Full v5 Current reference point
No dilation Is global coverage through dilation essential?
Dilation cycle changed Is [1, 2, 4, 8] special, or just one reasonable schedule?
No synaptic fatigue Does fatigue actually reduce repetition collapse?
Fatigue only at inference Is it a training-time mechanism, inference-time heuristic, or both?
Fixed initial state How much comes from the learned init_state?
Random initial state Does the model rely on a learned “brain prior”?
Output face only Is the opposite-face readout important?
Global pooled readout Does reading from the whole cube improve or erase the spatial story?
Random output face Is the z-axis information-flow interpretation robust?
Fewer update steps Where does performance appear?
More update steps Does extra recurrent computation help or degrade output?
1D version Is this just sequence convolution?
2D version Is 3D actually useful over a simpler spatial substrate?
3D ConvNet without recurrence Is recurrence doing real work?
Recurrent Conv3D without NCA framing Is the “cellular” framing adding anything beyond a recurrent ConvNet?

The goal would not be to “disprove” the model. The goal would be to locate the actual source of the effect.

For example, if removing synaptic fatigue causes much more “the the the” collapse, that is useful evidence. If global pooling beats output-face readout, that would weaken the “wave reaches the opposite face” story. If 2D performs similarly to 3D, then maybe the important thing is recurrence + convolution, not specifically a 3D cube. If the learned initial state is crucial, then the “brain DNA” idea becomes a real object of study rather than just a metaphor.

3. Suggested baselines

I would also suggest a few simple baselines with roughly similar parameter counts and the same data/tokenizer where possible.

Possible baselines:

Baseline Why it matters
Small GRU/LSTM Minimal recurrent sequence baseline
Small Transformer Standard language-modeling baseline
1D ConvLM Convolutional sequence baseline
Temporal CNN / TCN Stronger non-attention sequence baseline
Recurrent Conv1D Similar recurrence, no 3D substrate
Recurrent Conv2D Spatial recurrence without full 3D
Recurrent Conv3D Same broad compute family without the NCA interpretation
Neural GPU-like model Classical recurrent convolutional algorithm-learning comparison

The Neural GPU is especially relevant historically because it is a convolutional gated recurrent architecture that was studied for learning algorithmic sequence transformations. It is not the same as this project, but it is a useful comparison point for “local recurrent computation over a grid.”

I would not compare only against GPT-2 or modern Transformers. That comparison is too harsh and not very informative. A more useful question is:

Compared with other small recurrent/convolutional baselines, what does the 3D NCA-like substrate uniquely buy us?

4. Dynamics analysis

The internal dynamics are probably the most interesting part of the project. I would try to turn the qualitative “wave/eureka/decision” story into plots.

Useful measurements:

Measurement Purpose
Loss vs update step Does performance really improve at a particular recurrent depth?
Entropy vs update step Does the model become more confident during propagation?
Top-k distribution vs step Does the predicted word sharpen over time?
Activation norm vs step Does the cube stabilize, explode, or collapse?
Spatial activation center of mass Does information actually move from input face to output face?
Mutual information with input tokens Does input information propagate spatially over time?
Region-wise contribution to logits Which regions causally affect output?
Seed-to-seed consistency Does the same specialization reappear across runs?

The reported “steps 6-7 eureka” is particularly interesting. I would want to see:

  • Does the same step transition appear across many prompts?
  • Does it appear across random seeds?
  • Does it align with the dilation schedule?
  • Does it still appear when the dilation cycle is changed?
  • Does it appear on arithmetic/relations/language tasks equally?
  • Does running more steps help, saturate, or degrade?

If the “eureka” phase is stable across prompts and seeds, that is much stronger than a single visualization.

5. Interpretability/probing checklist

The reported spatial organization is the most exciting claim, but also the claim that needs the most care. Humans are very good at seeing meaning in visualizations. So I would try to convert each qualitative observation into a causal or statistical test.

Possible checks:

Claim Possible test
Region x=12 produces better language Ablate x=12 and compare loss/generation quality
Region x=6 produces garbage Patch x=6 into good generations or ablate it
Grammar is central Train POS/syntax probes on cell states by location
Semantics is peripheral Train semantic-category probes by location
Emotional words use z=12 Compare activation maps for emotional vs neutral words
Semantic clusters exist UMAP/PCA of cell states with word-category labels
Wave carries answer Intervene on intermediate slices and measure output damage
Learned init_state contains specialization Probe/visualize init_state before any input
Good/bad regions are stable Repeat over seeds and datasets

Some concrete interventions:

  • Zero out one spatial region at a time.
  • Add noise to one region at a time.
  • Swap activations between two prompts.
  • Patch the “good region” from one run into another.
  • Freeze parts of the cube during training.
  • Train linear probes per coordinate or per region.
  • Compare probes against shuffled labels.
  • Compare spatial maps across random seeds.

The key distinction is:

A region lighting up is not the same as a region being causally necessary.

If region ablation damages the relevant capability selectively, then the spatial specialization claim becomes much stronger.

6. Synthetic tasks before full natural language

Natural language is very hard to diagnose because many failure modes are entangled: tokenization, data size, frequency bias, repetition collapse, long-range dependency, objective mismatch, readout design, and recurrent depth.

Before focusing too much on open-ended text generation, I would test a ladder of synthetic tasks.

Suggested task ladder:

Task Capability tested
Copy Can the cube preserve input?
Shift Can information propagate directionally?
Reverse Can it perform nontrivial sequence manipulation?
Parity Can it aggregate global information?
Modular addition Can it learn algorithmic rules?
Bracket matching / Dyck language Can it model stack-like structure?
Associative recall Can it bind keys and values?
Small symbolic grammar Can it learn controlled next-token structure?
Character-level corpus Can it model language without large word vocab issues?
Word-level small corpus Can it handle sparse word prediction?

This would clarify where the architecture fails. If it cannot solve copy/reverse/parity reliably, then weak natural-language generation is unsurprising. If it solves synthetic grammar but fails at word-level language, then the bottleneck may be vocabulary/readout/data rather than the recurrent substrate itself.

Related work like LifeGPT, AutomataGPT, and Learning Elementary Cellular Automata with Transformers studies the opposite direction — Transformers learning CA dynamics — but those papers are still useful because they suggest evaluation patterns for local-rule systems: forecasting, rule inference, intermediate-state prediction, and generalization to unseen dynamics.

7. Possible alternative roles for the model

I would not restrict the project to “standalone language model.” There are other ways the idea could be valuable.

A. Recurrent memory substrate

The cube could be a memory/update substrate that receives token embeddings and evolves for several steps. Then another model reads from it.

Questions:

  • Does it store local context better than a simple recurrent state?
  • Does it denoise representations?
  • Does it preserve information over many update steps?
  • Does it help on associative recall or algorithmic tasks?

B. Adapter/refinement module

NCA-like blocks could be used inside another model instead of replacing the whole model. For example, AdaNCA uses NCA-style adapters between Vision Transformer layers to improve robustness. That is vision, not language, but the architectural idea is relevant: NCA as a plug-in refinement module rather than the entire model.

Possible language analogues:

  • Transformer + NCA adapter
  • RNN + NCA memory
  • ConvLM + NCA refinement block
  • Decoder-only LM with local NCA hidden-state smoothing
  • NCA block between attention and MLP layers

C. Synthetic dynamics generator

The cube may be more useful for generating structured non-linguistic trajectories than for directly generating language. This connects to Training Language Models via Neural Cellular Automata, where NCA-generated data is used as synthetic pre-pre-training data before natural-language training.

Questions:

  • Can 3D NCA trajectories produce useful synthetic curricula?
  • Does pretraining on those trajectories help small language models?
  • Does the complexity of the NCA dynamics matter?
  • Are 3D dynamics more useful than 1D/2D CA dynamics?

D. Interpretable dynamical system

Even if the model is weak as an LM, it may be valuable as a visible dynamical system trained on language-like tasks.

Questions:

  • Does syntax-like information localize?
  • Does semantic category information localize?
  • Are localized regions stable across seeds?
  • Are regions causally necessary?
  • Can “thought over time” be measured through recurrent steps?

This seems like the most compelling direction to me.

8. What I would prioritize

If I were organizing the next steps without providing compute, I would prioritize:

  1. Minimal reproducibility
  2. Ablation table
  3. Baseline table
  4. Step-wise dynamics plots
  5. Region ablation
  6. Simple probes
  7. Synthetic task ladder
  8. Only then larger training

A possible short roadmap:

Phase 1: Reproduce and measure

  • Run v5 inference.
  • Record predictions by recurrent step.
  • Plot loss/entropy/confidence over steps.
  • Check repetition rate.
  • Test more/fewer recurrent steps.
  • Save activation maps for a fixed prompt set.

Phase 2: Ablate

  • Remove or alter dilation.
  • Remove fatigue.
  • Compare output-face readout vs pooled readout.
  • Compare learned vs fixed init state.
  • Compare 1D/2D/3D variants if feasible.

Phase 3: Baseline

  • Train/evaluate small GRU, small Transformer, 1D ConvLM, and recurrent Conv3D baselines under similar conditions.
  • Use the same tokenizer/data where possible.
  • Report validation loss, next-token accuracy, repetition rate, and parameter count.

Phase 4: Probe

  • POS probe.
  • Semantic category probe.
  • Emotion/neutral contrast.
  • Region ablation.
  • Activation patching.
  • Seed consistency.

Phase 5: Reframe

Depending on results, decide whether the model is best pursued as:

  • standalone NCA-LM,
  • recurrent memory,
  • interpretable dynamics system,
  • synthetic data generator,
  • or adapter module.

9. A compact experiment matrix

One possible experiment table:

Experiment Minimal output
Step sweep loss/entropy/confidence vs recurrent step
Dilation ablation validation loss + repetition rate
Fatigue ablation repetition/collapse metrics
Init-state ablation performance drop from learned to fixed/random init
Readout ablation output face vs pooled vs random face
Region ablation heatmap of loss increase by region
Probe map spatial map of syntax/semantic probe accuracy
Seed repeat whether specialization recurs
Baseline comparison small Transformer/GRU/ConvLM/recurrent Conv3D
Synthetic tasks copy/reverse/parity/Dyck/associative recall

This would make the project much easier to discuss.

10. Suggested wording of the main contribution

If this were written as a more formal project note, I would avoid claiming:

“A 3D brain learned language.”

I would phrase it more conservatively:

“We explore a recurrent 3D neural cellular substrate for language-like prediction, and investigate whether local update dynamics produce reproducible spatial specialization.”

That keeps the interesting claim while making it more testable.

A stronger version, if supported by ablations, could be:

“Although not competitive with Transformer baselines as a language model, the system shows measurable recurrent phase transitions and spatially localized representations that can be probed and causally intervened upon.”

That would be a very interesting result.

11. Why I think this is worth exploring

The raw language modeling performance is not the main reason this is interesting. The interesting part is that the model gives you a spatial, recurrent, perturbable object.

Transformers are powerful, but their internal representations are not naturally laid out as a physical 3D substrate. Here, even if the model is much weaker, you can ask questions like:

  • Where does information enter?
  • How does it move?
  • When does a prediction become confident?
  • Which regions matter?
  • Do regions specialize?
  • Does specialization survive retraining?
  • Can we damage a region and observe selective failure?
  • Can we watch recurrence improve or destroy the answer?

That makes the project interesting as a small experimental system.

12. Final take

My current impression:

  • I would not spend the next effort mainly on larger training.
  • I would not frame it as a Transformer competitor.
  • I would focus on ablations, baselines, and causal interpretability.
  • I would test synthetic tasks before open-ended language.
  • I would consider roles other than standalone LM: memory substrate, adapter, synthetic-data generator, or interpretable dynamical system.

The most valuable next contribution might simply be a clear evaluation map.

Something like:

“Here are the claims, here are the ablations that test each claim, here are the baselines, and here are the probes that would make the emergent-organization story stronger.”

That kind of organization could make the project much easier for others to engage with, even without anyone immediately providing large-scale compute.

Thank you John6666 for such a detailed and well-structured response. Your roadmap makes a lot of sense — especially
starting with reproducibility and ablations before scaling up. I’m going to focus on:

  1. Ablation studies — isolating which components actually drive the effect
  2. Baseline comparisons with GRU/small Transformer at same param count
  3. Verifying the “eureka” phase consistency across seeds and prompts
  4. Synthetic task ladder before full language modeling

I’ll share results as I go. Really appreciate the guidance.

i have a unique, extensive background in procedural audio and music (first procedural lyrical song 1994). i’m still making my first few steps in NN/ML and am “very interested” in this level of method. neurons in a circle. yes, some of us have very different ideas we can try also.

i just have a very bad problem. i’ve spent the last quarter century trying to develop public resources to promote egalitarianism and been in conflict with those who defend “proprietarianism”. i have fought battles and taken damage and am currently unable to do some things other people find very reasonable,

people do not write to transmit method. they write to qualify expectations of form.

what you’re saying here is fine, but i’m a c/c++ programmer, not a python script jockey, because i believe in not atomising everything.

eg. i code LERP in one line: a += w times (b - a); understanding w is angular frequency and equals a 3dB/octave drop at tau times hertz / samplerate. but when i read other peoples’ code, they write a LERP in four documents. but they don’t know about angular frequency. it took me sixteen years to find someone who described the cepstrum as fft(log10(fft())) instead of pages of nonsense. the source for RNN in C has got so many pointless documents that are only there to say what a gooner that lodge monkey is instead of impart method. can i find anyone who can tell me how to backpropogate an RNN instead of show me how much officious looking junk they can add until i can’t even find anything in hours? how about a crazy LLM. try that for three years. jesus my head. and i already get sleep dep from the lodge crew. it’s hard. almost west papua hard.

i have a very big gifted educated i.q. but i can’t make it through any more of that kind of nonsense so it’s very rare that i can find anything that actually tells anyone how to do what they are talking about. thanks for the information, and with any luck, i’ll be able to follow what you did with the BPTT by the end of today. maybe. i don’t really know if i can talk to anyone anymore, but i’d be interested to see what any of these methods do when someone builds them with sensitivity for emergence instead of referential correctness. there’s no right or wrong in reality.

but yes, much more interesting to see the output of a different form. edifying to the living. reminds me of Julius Smith’s finite difference models at CCRMA.