Skip to main content

WTM: Weighted Temporal Attention Module for Group Activity Recognition.

Yadav, S., Agrawal, P., Tiwari, K., Adeli, E., Pandey, H. and Akbar, S. A., 2022. WTM: Weighted Temporal Attention Module for Group Activity Recognition. In: IEEE WCCI 2022 International Joint Conference on Neural Networks (IJCNN 2022), 18-23 July 2022, University of Padua, Italy. (In Press)

Full text available as:

[img] PDF
Group_WCCI.pdf - Accepted Version
Restricted to Repository staff only until 24 July 2022.
Available under License Creative Commons Attribution Non-commercial.

7MB

Abstract

Group Activity Recognition requires spatiotemporal modeling of an exponential number of semantic and geometric relations among various individuals in a scene. Previous attempts model these relations by aggregating independently derived spatial and temporal features. This increases the modeling complexity and results in sparse information due to lack of feature correlation. In this paper, we propose Weighted Temporal Attention Mechanism (WTM), a representational mechanism that combines spatial and temporal features of a local subset of a visual sequence into a single 2D image representation, highlighting areas of a frame where actor motion is significant. Pairwise dense optical flow maps representing the temporal characteristic of individuals over a sequence are used as attention masks over raw RGB images through a multi-layer weighted aggregation. We demonstrate a strong correlation between spatial and temporal features, which helps localize actions effectively in a multi-person scenario. The simplicity of the input representation allows the model to be trained by 2D image classification architectures in a plug-and-play fashion, which outperforms its multi-stream and multi-dimensional counterparts. The proposed method achieves the lowest computational complexity in comparison to other works. We demonstrate the performance of WTM on two widely used public benchmark datasets, namely the Collective Activity Dataset (CAD) and the Volleyball Dataset. and achieve state-of-the-art accuracies of 95.1% and 94.6% respectively. We also discuss the application of this method to other datasets and general scenarios. The code is being made publicly available.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:Video Action Recognition; Human Activity; Recognition; Transformers; Temporal Attention; Consensus; Convolutional; Neural Networks
Group:Faculty of Science & Technology
ID Code:36997
Deposited By: Symplectic RT2
Deposited On:30 May 2022 10:43
Last Modified:30 May 2022 10:43

Downloads

Downloads per month over past year

More statistics for this item...
Repository Staff Only -