Cross attentive multi-cue fusion for skeleton-based sign language recognition

Ogulcan Ozdemir, Inci M Baytas, Lale Akarun

IEEE Access

Abstract

Sign language, the primary communication medium of the Deaf, uses visual cues from upper body, hands, and face. Sign Language Recognition (SLR) aims to learn salient representations from these cues to bridge the communication gap between Deaf and hearing communities. Existing Graph Neural Network-based SLR frameworks often represent sign videos as sequences of graphs formed by hand and body joints. However, relying solely on upper body topology often results in suboptimal solutions. This work shows that incorporating domain-specific hand topologies can single-handedly help reach state-of-the-art SLR performance. This motivates the need for fusing multiple visual cues to build robust and generalizable SLR frameworks. Yet, the fusion process is challenged by changing spatial and temporal dynamics across articulators. To address this, we propose a multi-cue cross-attention framework that enables interactions between hand and upper body cues during fusion. We demonstrate how the proposed attention-based framework exposes distinct temporal patterns of visual cue representations extracted via Spatio-Temporal Graph Convolutional Network (ST-GCN) and exploits them for learning SL representations more effectively. Our experiments on two benchmark isolated sign language datasets, BosphorusSign22k and AUTSL, show that our proposed framework is on par with state-of-the-art performance on isolated SLR while highlighting the benefit of choosing domain-specific hand graph topologies and fusing multiple cues for SLR. Furthermore, our cross-attentive approach for fusion of upper-body and hand cues improves recognition accuracy by around 1% and 3% on the respective datasets over hand-only models while making the recognition interpretable and demonstrating the complementary interactions between visual cues.