Abstract
Pedestrian re-identification in autonomous driving is affected by heterogeneous sensor conditions and modality-dependent noise. Building on CLIP-based uncertainty-aware modeling, this paper presents a cross-modal contrastive learning framework that aligns visual and textual pedestrian representations while accounting for sensor uncertainty. Modality-specific confidence weights are introduced to reduce the influence of unreliable features during representation learning. The method is evaluated on three autonomous driving datasets, including nuScenes-ReID and two large-scale urban benchmarks, comprising more than 120,000 pedestrian samples and 45,000 identities. Comparisons are conducted against recent vision-only and vision–language baselines, including PCB, MGN, TransReID, and CLIP-based ReID models. Experimental results show improvements of 3.9%–5.0% in rank-1 accuracy and 4.2%–5.6% in mean average precision under low-visibility and adverse weather conditions.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright (c) 2026 ames R. Walker, Emily A. Thompson, Daniel K. Hughes (Author)