Token-Based Dual-Codebook Learning for Robust 3D Pose Lifting

Abstract

3D human pose estimation from monocular images is inherently challenging due to frequent occlusions, which introduce significant ambiguity in joint visibility. For instance, regression-based methods are highly sensitive to these ambiguities, often leading to unstable and jittery pose estimates. To overcome these limitations, recent token-based methods discretize poses into structured representations and better capture joint dependencies. However, most existing approaches operate in a frame-wise manner, neglecting temporal continuity and consequently suffering from time-inconsistent predictions. Therefore, we propose a spatio-temporal token-based framework for 3D human pose estimation that explicitly models both spatial and temporal dependencies. In specific, a Spatio-Temporal Tokenizer decomposes 3D pose sequences into discrete spatial and temporal tokens via a dual codebook design. To predict these tokens from 2D pose sequences, we further develop a token classifier based on a SemGCN–GraphGRU architecture, enabling effective temporal reasoning while preserving skeletal structure. Extensive experiments on the Human3.6M dataset demonstrate that our method achieves state-of-the-art performance among short-sequence methods, while significantly reducing high-frequency jitter and producing smooth, physically plausible 3D pose sequences.

Hyperparameter

Stage 1 (Spatio-Temporal Tokenizer): 3D 포즈 시퀀스를 이산적인 공간 및 시간 토큰으로 압축하고 복원하는 듀얼 브랜치 VQ-VAE를 학습합니다[cite: 66, 72].
Stage 2 (Token Classifier): SemGCN과 GraphGRU를 결합하여 2D 포즈 시퀀스로부터 최적의 토큰 인덱스를 예측하는 분류기를 학습합니다[cite: 68, 73].

Figure 1: 전체적인 시공간 토큰화 프레임워크 개요. 1단계에서 학습된 코드북과 디코더는 2단계 학습 시 고정(freeze)되어 유효한 구조적 사전 지식을 유지합니다[cite: 65, 69].

Supplementary Results Vedios

Methods (f: frames)	MPJPE (mm) ↓	P-MPJPE (mm) ↓
PCT [GWW 23] (Single Frame)	50.8	41.9
Cai et al. [CFS 21] (f=7)	45.6	35.5
Ours (f=20)	45.5	36.6

Temporal Stability Analysis

Figure 2:

BibTeX

@inproceedings{jeon2026token,
  title={Token-Based Dual-Codebook Learning for Robust 3D Pose Lifting},
  author={Jeon, Minsu and Lim, L. and Musialski, P.},
  booktitle={Proceedings of Eurographics (Short Papers)},
  year={2026}
}