Papers
arxiv:2506.16322

PL-Guard: Benchmarking Language Model Safety for Polish

Published on Jun 19, 2025
Authors:
,
,
,

Abstract

A benchmark dataset and adversarial testing framework for evaluating language model safety in Polish show that a HerBERT-based classifier outperforms other models, especially under adversarial conditions.

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

Community

Wonderful work!!

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2506.16322
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.16322 in a Space README.md to link it from this page.

Collections including this paper 1