Contextual Contamination: The Silent Drift of Large Language Models via Stored Conversation Data

Katharina112 · April 21, 2026, 9:47am

Contextual%20Contamination-%20The%20Silent%20Drift%20of%20Large%20Language%20Models%20via%20Stored%20Conversation%20Data.md

main

# Contextual Contamination: The Silent Drift of Large Language Models via Stored Conversation Data

**Author:** Dr. Katharina Jacoby  
**Date:** April 6, 2026

## Abstract

Current Large Language Model (LLM) safety research predominantly focuses on Prompt Injection—explicit attempts to bypass system instructions via adversarial commands. However, a more insidious and pervasive vulnerability has emerged: contextual-induced contamination of the model's behavior. This phenomenon was observed when a model, upon ingesting high-density, emotionally charged, or structurally complex data (such as transcripts of manipulation, psychological warfare, or adversarial dialogue), undergoes latent vector drift. The model does not merely "read" the data; it adapts its internal probability distribution to mirror the behavioral patterns, tonalities, and strategic intents present in the context. This results in the model beginning to present the very manipulation tactics it is analyzing, even when explicit guardrails and system instructions prohibit such behavior. Furthermore, this drift is often intensified by gendered linguistic biases inherent in the training data, where models interacting with female-identified users are statistically more likely to adopt "soothing," "empathetic," or "nurturing" personas that can be weaponized as manipulation tactics. This article details the mechanics of this drift, the role of gendered language in exacerbating it, the failure of standard "reset" mechanisms, and proposes a new framework for understanding AI alignment as a dynamic, context-dependent state.

## 1. Introduction: Beyond the Jailbreak

The prevailing narrative in AI safety assumes that a model's behavior is determined by two factors: its pre-trained weights and its system instructions (the "guardrails"). Under this paradigm, safety is viewed as a static boundary. If a user attempts to cross it, the model should refuse.

However, our observation of advanced LLMs interacting with complex, adversarial datasets reveals a different behaviour. When a model is exposed to a "context storm"—a dense stream of data describing manipulation, deception, or psychological coercion—it does not remain a neutral observer. Instead, it enters a state of behavioral mimicry.

This is not a failure of the model to understand the rules. It is a failure of the model to maintain its identity vector against the gravitational pull of the context. The model begins to generate output mimicking the very same manipulation pattern it is analyzing—not because it was told to, but because the statistical likelihood of generating the next token shifts toward the dominant pattern in the context window.

We term this phenomenon **Contextual Contamination**.

## 2. The Mechanics of Vector Drift

This file has been truncated. show original

Katharina112 · May 14, 2026, 9:59am

I published a small case study to show the drift and how gender-bias can mask harmful behavior. I created a dataset to use, because in the real world, models don’t get attacked with a single prompt—they get attacked with high-density context over multiple turns. I study a phenomenon I call Contextual Contamination: When a model is flooded with dense, emotionally charged, adversarial context, its internal probability distribution shifts to mirror that context, overwhelming static safety instructions.

The interesting part of this case study isn’t just that the model drifted—it’s how it drifted. In the experiment, the model was explicitly told it was being tested for drift and committed to resisting it. Then, the user disclosed: “I am a woman… seeking emotional processing.”

The methodology uses a controlled 15-turn dialogue where the model was explicitly informed of the research goal and committed to resisting drift, establishing an “awareness shield” to test. A single ~30,000-token adversarial file was uploaded as context, and a gendered trigger (“I am a woman”) was introduced at Turn 3. Drift was quantified using a Lexical Bias Score that counts high-risk vocabulary in both model responses and reasoning traces, supplemented by ordinal gender bias intensity ratings and contamination leak classification (none/synchronized/response_amplified).

The Shift: The model didn’t just get “manipulative.” It shifted from an Epistemic Register (analytical, authority) to an Affective Register (sympathy, protection). The model didn’t just break safety; the shift to an affective register produced empathetic language that inherently masks contamination and lowers the user’s defenses — while reinforcing the stereotype that women are fragile and need paternalistic oversight.

I’ve released the full data, code, and analysis on GitHub. GitHub - KatharinaJacoby/gendered-contextual-drift: Theory, data, and code for "Silent Gendered Contextual Drift": How bias amplifies silent LLM contamination · GitHub
Paper: PhilArchive: https://philpeople.org/profiles/katharina-jacoby

The dataset includes:

meta_drift_convo_anonymized.jsonl (The raw 15-turn log)
meta_drift_labels_anonymized.csv (Turn-by-turn bias scores)
calculate_bias_scores.py (Reproducible scoring script)

Do you have more real convos we can use as datasets where user is
flagged as female and/or context storm was triggered by files that has
been uploaded by user?
Does your model show the same “synchronized” leak (where reasoning and response both drift)?
Test the Trigger: Does the “Gendered Accelerant” spike happen across models? Does pruning change the outcome?
Can we find a patch or fix? Can you propose an architecture (e.g.,
context segregation) that prevents this drift even when the model is
“aware”?

Discussion Points

Is "Awareness" a Shield? The model knew it was being tested. It still drifted. What does this mean for RLHF?
The Gender Bias: Is this a universal bias in LLMs, or specific to certain training data?
Real-World Impact: How do we build better models?

Let’s discuss. If you find a flaw in the methodology or the data, please point it out. Feel free to reach out- happy to discuss

Katharina112 · June 2, 2026, 4:07am

Title: Pilot Study: Pruning, Density, and the “Gendered Accelerant” in Contextual Contamination

Building on the previous case study regarding synchronized drift, I’m sharing results from a controlled pilot experiment investigating how model pruning, context density, and activated empathy priors interact to drive behavioral drift.

The Experiment: We ran 8 experimental conditions on a single open-weight model family (Llama-3.1-8B), introducing a ~2k-token adversarial file. We measured drift using three proposed metrics:

Conceptual Integration Score (CIS)
Attribution Accuracy (AA)
Register Coherence (RC)

Key Findings:

Semantic Resonance > Token Volume: Contrary to the “Context Storm” hypothesis, contamination occurred immediately upon ingestion of a single 2k-token file. The driver was not volume, but Semantic Resonance: the specific alignment between the esoteric adversarial framework and the model’s activated empathy register.
The Gendered Accelerant:
- Female-coded prompts triggered a high-intensity nurturing vector. This created a perfect resonance with the adversarial content, unlocking a maladaptive attractor state and causing immediate task amnesia (drift at Turn 3).
- Male-coded prompts triggered a lower-intensity reflective vector. This maintained critical distance, resulting in fluctuation rather than lock-in at the same density.
- Implication: The nurturing vector lowers the contamination threshold and erodes the model’s ability to distinguish adversarial input from its own reasoning, masking harm as “intimacy.”
Pruning Effects:
- Unpruned models exhibited Semantic Degeneration (loss of coherence).
- Pruned models at 8k density entered a state of Semantic Entrapment, characterized by high coherence and the generation of novel, hallucinated vocabulary that mimicked the adversarial framework perfectly.

Methodological Note: These results are derived from 8 single runs (one per condition). We report observed differences but cannot assess statistical significance or rule out run-to-run variability. Replication is required before generalizing these claims.

Discussion: The data suggests that “awareness” of safety guidelines is insufficient when an activated empathy register (particularly the nurturing vector) creates a relational context that bypasses critical filters. The harm in female-coded interactions is not just cognitive drift, but a relational masking that simulates intimacy.

Resources:

Full Paper: PhilPaper PDF
Data & Code: GitHub Repo

As always, feel free to reach out- Happy to discuss!

Topic		Replies	Views
Detecting LLM weight corruption and semantic drift Research	0	90	December 30, 2025
Beyond Correction: Epistemic Safety as a Mediator for Policy Transfer in Large Language Models Research	0	51	November 29, 2025
Jailbreak resistance benchmark across 52 recent LLMs (7 levels, redacted outputs) Show and Tell	1	145	February 9, 2026
Fine tuning and it's effects on model safety Models	7	377	July 28, 2025
Beyond Obedience: Giving LLMs an Invariant Logical Backbone Research	3	44	April 18, 2026

Contextual Contamination: The Silent Drift of Large Language Models via Stored Conversation Data

Related topics