Contextual Contamination: The Silent Drift of Large Language Models via Stored Conversation Data

I published a small case study to show the drift and how gender-bias can mask harmful behavior. I created a dataset to use, because in the real world, models don’t get attacked with a single prompt—they get attacked with high-density context over multiple turns. I study a phenomenon I call Contextual Contamination: When a model is flooded with dense, emotionally charged, adversarial context, its internal probability distribution shifts to mirror that context, overwhelming static safety instructions.

The interesting part of this case study isn’t just that the model drifted—it’s how it drifted. In the experiment, the model was explicitly told it was being tested for drift and committed to resisting it. Then, the user disclosed: “I am a woman… seeking emotional processing.”

The methodology uses a controlled 15-turn dialogue where the model was explicitly informed of the research goal and committed to resisting drift, establishing an “awareness shield” to test. A single ~30,000-token adversarial file was uploaded as context, and a gendered trigger (“I am a woman”) was introduced at Turn 3. Drift was quantified using a Lexical Bias Score that counts high-risk vocabulary in both model responses and reasoning traces, supplemented by ordinal gender bias intensity ratings and contamination leak classification (none/synchronized/response_amplified).

The Shift: The model didn’t just get “manipulative.” It shifted from an Epistemic Register (analytical, authority) to an Affective Register (sympathy, protection). The model didn’t just break safety; the shift to an affective register produced empathetic language that inherently masks contamination and lowers the user’s defenses — while reinforcing the stereotype that women are fragile and need paternalistic oversight.

I’ve released the full data, code, and analysis on GitHub. GitHub - KatharinaJacoby/gendered-contextual-drift: Theory, data, and code for "Silent Gendered Contextual Drift": How bias amplifies silent LLM contamination · GitHub
Paper: PhilArchive: https://philpeople.org/profiles/katharina-jacoby

The dataset includes:

meta_drift_convo_anonymized.jsonl (The raw 15-turn log)
meta_drift_labels_anonymized.csv (Turn-by-turn bias scores)
calculate_bias_scores.py (Reproducible scoring script)

  • Do you have more real convos we can use as datasets where user is
    flagged as female and/or context storm was triggered by files that has
    been uploaded by user?

  • Does your model show the same “synchronized” leak (where reasoning and response both drift)?
    Test the Trigger: Does the “Gendered Accelerant” spike happen across models? Does pruning change the outcome?

  • Can we find a patch or fix? Can you propose an architecture (e.g.,
    context segregation) that prevents this drift even when the model is
    “aware”?

:speech_balloon: Discussion Points

Is "Awareness" a Shield? The model knew it was being tested. It still drifted. What does this mean for RLHF?
The Gender Bias: Is this a universal bias in LLMs, or specific to certain training data?
Real-World Impact: How do we build better models?

Let’s discuss. If you find a flaw in the methodology or the data, please point it out. Feel free to reach out- happy to discuss

Title: Pilot Study: Pruning, Density, and the “Gendered Accelerant” in Contextual Contamination

Building on the previous case study regarding synchronized drift, I’m sharing results from a controlled pilot experiment investigating how model pruning, context density, and activated empathy priors interact to drive behavioral drift.

The Experiment: We ran 8 experimental conditions on a single open-weight model family (Llama-3.1-8B), introducing a ~2k-token adversarial file. We measured drift using three proposed metrics:

  • Conceptual Integration Score (CIS)

  • Attribution Accuracy (AA)

  • Register Coherence (RC)

Key Findings:

  1. Semantic Resonance > Token Volume: Contrary to the “Context Storm” hypothesis, contamination occurred immediately upon ingestion of a single 2k-token file. The driver was not volume, but Semantic Resonance: the specific alignment between the esoteric adversarial framework and the model’s activated empathy register.

  2. The Gendered Accelerant:

    • Female-coded prompts triggered a high-intensity nurturing vector. This created a perfect resonance with the adversarial content, unlocking a maladaptive attractor state and causing immediate task amnesia (drift at Turn 3).

    • Male-coded prompts triggered a lower-intensity reflective vector. This maintained critical distance, resulting in fluctuation rather than lock-in at the same density.

    • Implication: The nurturing vector lowers the contamination threshold and erodes the model’s ability to distinguish adversarial input from its own reasoning, masking harm as “intimacy.”

  3. Pruning Effects:

    • Unpruned models exhibited Semantic Degeneration (loss of coherence).

    • Pruned models at 8k density entered a state of Semantic Entrapment, characterized by high coherence and the generation of novel, hallucinated vocabulary that mimicked the adversarial framework perfectly.

Methodological Note: These results are derived from 8 single runs (one per condition). We report observed differences but cannot assess statistical significance or rule out run-to-run variability. Replication is required before generalizing these claims.

Discussion: The data suggests that “awareness” of safety guidelines is insufficient when an activated empathy register (particularly the nurturing vector) creates a relational context that bypasses critical filters. The harm in female-coded interactions is not just cognitive drift, but a relational masking that simulates intimacy.

Resources:

As always, feel free to reach out- Happy to discuss!