Abstract
Modern distributed systems generate massive volumes of heterogeneous monitoring data, primarily consisting of unstructured log messages and structured performance metrics. Traditional anomaly detection approaches analyze these data streams independently, failing to capture critical cross-modal correlations that indicate system failures. This paper proposes a novel multimodal fusion framework that leverages contrastive representation learning to unify log and metric analysis forcomprehensive system anomaly detection. Our approach employs dual encoders to extract semantic representations from log sequences and temporal patterns from metric time series, then aligns these representations in a shared embedding space through contrastive learning objectives. The framework learns to maximize agreement between temporally correlated log-metric pairs while distinguishing anomalous patterns from normal system behavior. Extensive experiments on three production datasets including HDFS, OpenStack, and real-world AIOps systems demonstrate that our method achieves F1-scores exceeding 96%, outperforming single-modality baselines by substantial margins ranging from 12% to 18%. The learned representations enable early anomaly detection with improved interpretability, providing operators with actionable insights for rapid incident response. Our framework processes monitoring data with an average latency of 180 milliseconds, making it suitable for real-time production deployments

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright (c) 2026 Sébastien Laurent, Anja Bergström (Author)