Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs
Abstract
Watermarking AI-generated text for detection fails when multiple models are used, as averaging outputs cancels perturbations and suppresses detection while improving quality and speed.
Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.
Community
We show that AI text watermarks can be surprisingly fragile in multi-model settings. By averaging output probabilities across several models, our method WASH cancels out watermark perturbations, substantially reducing detection scores while maintaining generation quality and faster generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks (2026)
- QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs (2026)
- Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking (2026)
- SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking (2026)
- Echoes within the Reasoning: Stealthy and Effective Watermarking via Chain of Thought (2026)
- Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors (2026)
- RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.30501 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper