Do language models store memories like Hopfield networks? This experiment log follows a series of encode-decode experiments revealing that a small memory model exhibits attractor dynamics, semantic denoising, and factual error correction — behaving less like a text generator and more like an associative memory system.
The model is a small "memory" system intentionally overfit on TriviaQA contexts: a frozen BGE-M3 encoder (1024-dim) feeds through a MultiEmbeddingAdapter (32 prefix queries with cross-attention) into a GPT-2 decoder (768-dim). It doesn't generalize — by design. It memorizes.
The key signal we're testing is Cycle Gap (CG): encode some text, generate from the encoding, re-encode the generated text, then measure how far the second encoding drifts from the first. If the model "knows" the input well, the cycle is tight. If the input is unfamiliar or corrupted, the cycle tears open.
We compare CG against the standard escalation signal: entropy (average per-token uncertainty during generation). The central question: can CG catch failures that entropy misses?
The first experiments establish when CG works and when it doesn't. The critical variable: the input must be in-distribution. When it is, CG dominates entropy. When it isn't, both signals are blind.
CG detects confident hallucination — the dangerous case where the model is certain but wrong. In Experiment 6, 28 samples had low entropy (model "confident") but high CG (cycle torn open). Entropy alone would miss these entirely.
What happens when you run the encode-decode cycle repeatedly? Text → encode → decode → text → encode → decode → ... The system reveals its dynamical structure.
Without anchoring, the system is dissipative. Each cycle adds sampling noise that accumulates. In-distribution contexts converge fastest (6.4 iterations on average), combined questions slowest (18.0). Semantic drift is real: "2001: A Space Odyssey" → Spielberg → Kubrick → Hitchcock — the system wanders through related attractors.
When the original query drives each iteration (concatenated to the generated text), two failure modes emerge:
A surprise: completely out-of-distribution queries (cooking, physics, biology) stabilize at low CG — similar to correct single-topic queries. Why? The model drifts to the nearest training content. "Sourdough bread" → "sugary drink" → "kitchen" → "silk fiber." The generated text IS training data, so CG is low.
CG alone is insufficient. It measures "does the model know what it's generating" — not "is it answering the right question."
You need a second signal: cos(generated, query) — the relevance between output and input.
Together they give a three-way classification:
| Pattern | CG | Relevance | Interpretation |
|---|---|---|---|
| ✓ Reliable | Low | High | Confident and relevant |
| ⚠ Hallucination | High / oscillating | Medium | Uncertain, mixing topics |
| ✗ OOD drift | Low | Low | Confident but irrelevant |
If the model has attractor dynamics, it should behave like a Hopfield network: recover clean patterns from noisy inputs. We test this by randomly replacing content words with random English words, then feeding the corrupted text through the encode-decode cycle.
The denoising is semantic, not lexical. The model doesn't copy surviving words — it generates correct replacements from learned associations:
Near an attractor (low noise), the model converges to the stored pattern. Far from any attractor (50% noise), it drifts to wrong patterns. Graceful degradation — the denoising ratio drops smoothly: 5.37 → 3.30 → 1.45 as noise increases. Each stored pattern has a basin of attraction with finite radius.
Two types of perturbation: rephrasing (different words, same facts) and factual errors (same words, wrong facts). Both are corrected 70% of the time — but through different mechanisms.
Human-quality paraphrases with completely different wording. The model converges toward the original 70% of the time. The strongest example: the Korean War passage — original and rephrase produce literally identical output despite completely different input wording. The weakest: specific narrative details with proper names.
Same sentence structure, wrong names/dates/numbers. Also 70% correction — but the mechanism is different. Familiar structure acts as a scaffold that helps the model slot in correct facts:
Factual errors: convergence 0.945, entropy ratio 2.8×.
Rephrases: convergence 0.919, entropy ratio 6.3×.
The model is more confident with familiar structure + wrong facts than with unfamiliar structure + correct facts.
Structure scaffolds correction.
The hardest test: change both the wording and the facts simultaneously. If the two attractor systems (form and content) were independent, we'd expect 49% correction (0.7 × 0.7). We get 20%.
The proof is the Coolidge pair: with familiar structure and wrong facts, the model produces perfect output (cos = 1.000). With unfamiliar structure and the same wrong facts, it fails (cos = 0.843). Losing form actively undermines factual correction. The attractor systems are coupled.
Form and content attractors reinforce each other. When both are perturbed, the model doesn't degrade gracefully — it collapses. Only technical terminology (Red Book, CD-ROM) survives the combined assault, suggesting an attractor robustness hierarchy: technical terms > proper names > narrative details.
Does iterating the correction cycle help? We run 8 iterations on each perturbed input.
The first encode-decode cycle does all the correction. Additional iterations always degrade. In 7–8 out of 10 cases across all perturbation types, quality decreases with more iterations. The system is dissipative — sampling noise accumulates.
But there's an exception. Strong-attractor content reaches true fixed points: CG = 0.000, identical output every cycle. These are genuine fixed points of the dynamical system — the model generates exactly the same text, character for character, on every subsequent iteration.
| Content | Group | Fixed at iter | cos to original |
|---|---|---|---|
| Barriers / society | rephrase | 1 | 0.991 |
| Monkees | factual error | 1 | 0.981 |
| CD specifications | factual error | 2 | 0.983 |
| CD specifications | rephrase | 3 | 0.974 |
| Grand Prix | rephrase | 4 | 0.952 |
CG after one iteration predicts which class a pattern belongs to. CG ≈ 0 means stable fixed point. CG > 0.02 means drift. This refines the Hopfield analogy: the system has deep basins (true fixed points for strong patterns) and shallow basins (dissipative for weak patterns).
At low noise, 96% of outputs converge to the original. The corrections are semantic — "horse" → "negotiations", "marsh" → "leader" — not lexical copying. The Hopfield analogy holds: encode-decode cycles have basins of attraction around stored patterns.
CG beats entropy precisely when it matters most: when the model is confident but wrong. Entropy says "all good." CG says "the cycle is torn." This is the escalation signal the System 1/System 2 architecture needs.
Changing form OR content alone: 70% recovery. Changing both: 20% — far worse than the 49% predicted by independence. Familiar structure scaffolds factual correction. This explains why LLMs struggle with novel framings of familiar facts.
Technical terminology (Red Book, CD-ROM) > Proper names (Ralph, Coolidge) > Narrative details (dates, descriptions). Conceptually distinctive terms form the deepest basins of attraction.
The first encode-decode cycle extracts all available correction. More iterations add noise. But strong attractors reach true fixed points (CG = 0) — genuinely stable states of the dynamical system. CG at iteration 1 predicts whether a pattern is in a deep or shallow basin.
Ongoing work — part of the cognitive offloading project at IBM Research.
Model: BGE-M3 → MultiEmbeddingAdapter → GPT-2, trained on TriviaQA.