Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation

Gai, D.; Feng, R.; Min, W.; Yang, Xiaosong; Su, P.; Wang, Q.; Han, Q.

Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation.

Tools

Gai, D., Feng, R., Min, W., Yang, X., Su, P., Wang, Q. and Han, Q., 2023. Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation. IEEE Transactions on Circuits and Systems for Video Technology, 33 (9), 4564-4576.

Full text available as:

[thumbnail of Spatiotemporal_Learning_Transformer_for_Video-Based_Human_Pose_Estimation.pdf]

Preview

PDF
Spatiotemporal_Learning_Transformer_for_Video-Based_Human_Pose_Estimation.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial.
6MB

Copyright to original material in this document is with the original owner(s). Access to this content through BURO is granted on condition that you use it only for research, scholarly or other non-commercial purposes. If you wish to use it for any other purposes, you must contact BU via BURO@bournemouth.ac.uk.

Any third party copyright material in this document remains the property of its respective owner(s). BU grants no licence for further use of that third party material.

DOI: 10.1109/TCSVT.2023.3269666

Abstract

Multi-frame human pose estimation has long been an appealing and fundamental issue in visual perception. Owing to the frequent rapid motion and pose occlusion in videos, this task is extremely challenging. Current state-of-the-art methods seek to model spatiotemporal features by equally fusing each frame in the local sequence, which weakens the target frame information. In addition, existing approaches usually emphasize more on deep features while ignoring the detailed information implied in the shallow feature maps, resulting in the dropping of crucial features. To address the above problems, we propose an effective framework, namely spatiotemporal learning transformer for video-based human pose estimation (SLT-Pose), which consists of a Personalized Feature Extraction Module (PFEM), Self-feature Refinement Module (SRM), Cross-frame Temporal Learning Module (CTLM) and Disentangled Keypoint Detector (DKD). To be specific, we propose PFEM which extracts and modulates the individual frame features to adapt to the varying human shape, and integrates single-frame features to obtain the spatiotemporal features. We further present SRM to establish global correlation spatial cues on the target frame to attain the refinement feature. Then, a CTLM is designed to search for the information most closely related to the target frame from the spatiotemporal features to intensify the interaction between the target frame and the local sequence, using both the shallow detailed and the deep semantic representations. Finally, we employ DKD to extract the disentangled characteristics of each joint and encode the articulated joint pairs in the human body, promoting the model to reasonably and accurately predict the keypoint heatmaps. Extensive experiments on three huamn motion benchmarks, including PoseTrack2017, PoseTrack2018, and Sub-JHMDB dataset, demonstrate that SLT-Pose plays favorably against state-of-the-art approaches in terms of both objective evaluation and subjective visual performance.

Item Type:	Article
ISSN:	1051-8215
Uncontrolled Keywords:	human pose estimantion; transformer; video; spatiotemporal feature learning
Group:	Faculty of Media & Communication (Until 31/07/2025)
ID Code:	38864
Deposited By:	Symplectic RT2
Deposited On:	10 Aug 2023 10:49
Last Modified:	24 Apr 2025 01:08

Downloads

Downloads per month over past year

More statistics for this item...

Repository Staff Only -