The left is the original video, the right is produced by our T2S-GPT.
This paper proposes an end-to-end cross-modal contrastive generative model (CCC) for directly generating semantically consistent sign language videos from text. Addressing the shortcomings of existing methods in semantic-action alignment and autoregressive error accumulation, the model adopts a three-stage framework: first, it learns discrete motion representations based on MotionVQVAE, extracting skeletal information via RTMPose and reconstructing it into high-quality motion encodings; then, it designs a multimodal fusion mechanism, utilizing the German_Semantic_V3b text encoder combined with contrastive learning to achieve text-action feature alignment, and employs false negative sample filtering and parallel task training to enhance fusion effectiveness; finally, it constructs an autoregressive generation module based on Transformer, reducing error accumulation through multimodal feature fusion and data-driven initialization. Experiments on the RWTH-PHOENIX-Weather 2014T dataset demonstrate that this method outperforms existing techniques in BLEU and ROUGE metrics, with ablation studies validating the effectiveness of each module. Compared to traditional methods relying on gloss annotations,
(The left is the original video, the right is produced by our T2S-GPT)