Main article

Mehmet Yilmaz*
Department of Computer Engineering, Middle East Technical University, Ankara, Turkey, 06800
myilmaz@ceng.metu.edu.tr
Preethi Rajagopalan
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA, 15213
Anton Ivashkin
Skolkovo Institute of Science and Technology, Moscow, Russia, 143026

Abstract

Hallucination — the generation of factually incorrect, fabricated, or internally inconsistent text by large language models — is one of the most practically consequential failure modes in LLM deployment. Fine-tuning on domain-specific data is widely used to improve LLM performance in specialised domains, but the relationship between fine-tuning and hallucination rates is poorly characterised. This paper presents a systematic evaluation of hallucination rates before and after domain-specific fine-tuning across four domains (biomedical, legal, financial, and software engineering) and three base models (Llama-3.1-8B, Mistral-7B-v0.3, and Qwen2.5-7B). We use a three-component hallucination taxonomy — factual hallucination, entity hallucination, and reasoning hallucination — and evaluate each component using a combination of automated fact-checking pipelines and expert annotation. Counter to the common assumption that fine-tuning on domain data reduces hallucination by reinforcing factual associations, we find that fine-tuning on high-quality but narrow domain corpora frequently increases entity and reasoning hallucination rates even when factual hallucination rates decrease. We link this phenomenon to a degradation in world-model breadth during fine-tuning and provide evidence that the effect is modulated by the ratio of domain-specific to general knowledge in the fine-tuning data mix.

Article details

How to Cite

Yilmaz, M., Rajagopalan, P., & Ivashkin, A. (2025). Hallucination Rates Across Domain-Specific LLM Fine-Tuning: A Systematic Evaluation. DATAMIND, 3(1), 1-4. https://doi.org/10.63646/