Abstract
The proliferation of open-weight large language models (LLMs) has democratized access to advanced natural language processing capabilities while simultaneously introducing significant challenges in content authentication and intellectual property protection. This study investigates the phenomenon of watermark transferability across different model architectures and examines the efficacy of defense mechanisms against watermark removal attacks in open-weight LLMs. We propose a novel hierarchical watermarking framework that distributes signature information across multiple architectural layers, analogous to multi-source data integration systems. Our methodology combines statistical watermark detection techniques with adversarial robustness testing to quantify watermark survival rates under various fine-tuning scenarios. Experimental results demonstrate that watermarks embedded using our hierarchical approach exhibit varying degrees of transferability, with an average retention rate of 73.4% across architecture boundaries when subjected to moderate fine-tuning procedures. We analyze the relationship between model performance preservation and watermark detectability, revealing a positive correlation where higher-performing architectures maintain stronger watermark signatures during transfer. Our multi-layered defense strategy incorporating redundant watermark embedding and adaptive verification mechanisms achieves 91.2% detection accuracy against sophisticated removal attempts. The findings reveal critical vulnerabilities in existing watermarking schemes and provide actionable insights for developing more robust authentication systems in the era of openly accessible LLMs.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright (c) 2026 Zhenyu Qiao (Author)