2026
Tezis
Zero-Shot Action Recognition aims to transfer visual-semantic consistency knowledge learned from seen actions to the recognition of unseen actions. However, existing methods predominantly focus on learning the visual-semantic consistency in seen domain, neglecting that distribution discrepancies between seen and unseen domains lead to biased and non-causal feature learning. To address this issue, we propose an Unbiased Spatial–Temporal Atomic Fusion-based Zero-Shot Action Recognition method (USAF) for alleviating the influence of biased data on cross-modal mapping generalization. Specifically, we conceptualize the inference mechanism of ZSAR as a directed acyclic graph-based causal system to diagnose the data bias of cross domains induced by spatial and temporal confounders, and design the model following its causal feature extraction pathway. We first decomposes video representation space into motion and object atomic spaces, allowing for the extraction of fine-grained causal visual features that are robust against spatial confounders. Furthermore, we propose a Cross-modal Fusion & Matching Mechanism that dynamically fuses motion patterns and object cues from atomic spaces, and performing cross-modal matching against label semantic, thereby capturing temporal saliency distribution discrepancies between the atomic spaces. Notably, to extract local spatiotemporal dependencies for constructing the motion atomic space, we propose a novel Short-term Spatial–Temporal Graph Convolutional Network. Extensive experiments demonstrate that USAF achieves state-of-the-art performance on multiple public action recognition benchmarks, significantly enhancing cross-modal mapping generalization from seen to unseen domains.