Statistical Fingerprinting Beyond Token Distributions Using Syntactic and Discourse-Level Signatures for LLM Authentication

Yongjie Lin; Zihan Cui; Alexander Monroe

doi:10.71465/ajbce3554

Vol. 7 No. 1 (2026), Articles

Vol. 7 No. 1 (2026)

Statistical Fingerprinting Beyond Token Distributions Using Syntactic and Discourse-Level Signatures for LLM Authentication

Articles

Published 2026-01-31

Yongjie Lin⁺⁻
Zihan Cui⁺⁻
Alexander Monroe⁺⁻

https://doi.org/10.71465/ajbce3554

Yongjie Lin

Department of Computer Science, University of Maryland, College Park, USA

Zihan Cui

Department of Computer Science, University of Maryland, College Park, USA

Alexander Monroe

Department of Computer Science, University of Maryland, College Park, USA

PDF

Keywords

Large Language Model Authentication
Statistical Fingerprinting
Syntactic Analysis
Discourse Structure
Model Attribution
Adversarial Robustness
Linguistic Signatures

Abstract

The rapid proliferation of Large Language Models (LLMs) has created unprecedented challenges in content authentication and model attribution. Traditional statistical fingerprinting methods primarily rely on token-level distribution patterns, which are increasingly vulnerable to adversarial manipulation and model fine-tuning. This research introduces a novel multi-layered fingerprinting framework that extends beyond superficial token distributions to incorporate syntactic structures and discourse-level signatures for robust LLM authentication. We propose a hierarchical feature extraction methodology that captures deep linguistic patterns including dependency parsing structures, semantic coherence measures, and discourse relationship markers. Our experimental evaluation across multiple state-of-the-art LLMs demonstrates that syntactic and discourse-level features provide significantly higher discriminative power compared to traditional token-based approaches, achieving 94.7% authentication accuracy even under adversarial conditions. The framework introduces three key innovations: syntactic tree embedding for structural fingerprinting, discourse graph construction for rhetorical pattern analysis, and temporal coherence modeling for stylistic consistency verification. Results indicate that combining multiple linguistic levels creates robust signatures resistant to common evasion techniques while maintaining computational efficiency for real-world deployment. This work establishes foundational principles for next-generation LLM authentication systems that can adapt to evolving model architectures and generation strategies.

PDF

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.