3D human pose estimation from monocular images is inherently challenging due to frequent occlusions, which introduce significant ambiguity in joint visibility. For instance, regression-based methods are highly sensitive to these ambiguities, often leading to unstable and jittery pose estimates. To overcome these limitations, recent token-based methods discretize poses into structured representations and better capture joint dependencies. However, most existing approaches operate in a frame-wise manner, neglecting temporal continuity and consequently suffering from time-inconsistent predictions. Therefore, we propose a spatio-temporal token-based framework for 3D human pose estimation that explicitly models both spatial and temporal dependencies. In specific, a Spatio-Temporal Tokenizer decomposes 3D pose sequences into discrete spatial and temporal tokens via a dual codebook design. To predict these tokens from 2D pose sequences, we further develop a token classifier based on a SemGCN–GraphGRU architecture, enabling effective temporal reasoning while preserving skeletal structure. Extensive experiments on the Human3.6M dataset demonstrate that our method achieves state-of-the-art performance among short-sequence methods, while significantly reducing high-frequency jitter and producing smooth, physically plausible 3D pose sequences.
Figure 1: 전체적인 시공간 토큰화 프레임워크 개요. 1단계에서 학습된 코드북과 디코더는 2단계 학습 시 고정(freeze)되어 유효한 구조적 사전 지식을 유지합니다[cite: 65, 69].
| Methods (f: frames) | MPJPE (mm) ↓ | P-MPJPE (mm) ↓ |
|---|---|---|
| PCT [GWW 23] (Single Frame) | 50.8 | 41.9 |
| Cai et al. [CFS 21] (f=7) | 45.6 | 35.5 |
| Ours (f=20) | 45.5 | 36.6 |
Figure 2:
@inproceedings{jeon2026token,
title={Token-Based Dual-Codebook Learning for Robust 3D Pose Lifting},
author={Jeon, Minsu and Lim, L. and Musialski, P.},
booktitle={Proceedings of Eurographics (Short Papers)},
year={2026}
}