Yadav, S., Agrawal, P., Tiwari, K., Adeli, E., Pandey, H. and Akbar, S. A., 2022. WTM: Weighted Temporal Attention Module for Group Activity Recognition. In: IEEE WCCI 2022 International Joint Conference on Neural Networks (IJCNN 2022), 18-23 July 2022, University of Padua, Italy.
Full text available as:
|
PDF
Group_WCCI.pdf - Accepted Version Available under License Creative Commons Attribution Non-commercial. 7MB | |
Copyright to original material in this document is with the original owner(s). Access to this content through BURO is granted on condition that you use it only for research, scholarly or other non-commercial purposes. If you wish to use it for any other purposes, you must contact BU via BURO@bournemouth.ac.uk. Any third party copyright material in this document remains the property of its respective owner(s). BU grants no licence for further use of that third party material. |
Abstract
Group Activity Recognition requires spatiotemporal modeling of an exponential number of semantic and geometric relations among various individuals in a scene. Previous attempts model these relations by aggregating independently derived spatial and temporal features. This increases the modeling complexity and results in sparse information due to lack of feature correlation. In this paper, we propose Weighted Temporal Attention Mechanism (WTM), a representational mechanism that combines spatial and temporal features of a local subset of a visual sequence into a single 2D image representation, highlighting areas of a frame where actor motion is significant. Pairwise dense optical flow maps representing the temporal characteristic of individuals over a sequence are used as attention masks over raw RGB images through a multi-layer weighted aggregation. We demonstrate a strong correlation between spatial and temporal features, which helps localize actions effectively in a multi-person scenario. The simplicity of the input representation allows the model to be trained by 2D image classification architectures in a plug-and-play fashion, which outperforms its multi-stream and multi-dimensional counterparts. The proposed method achieves the lowest computational complexity in comparison to other works. We demonstrate the performance of WTM on two widely used public benchmark datasets, namely the Collective Activity Dataset (CAD) and the Volleyball Dataset. and achieve state-of-the-art accuracies of 95.1% and 94.6% respectively. We also discuss the application of this method to other datasets and general scenarios. The code is being made publicly available.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Uncontrolled Keywords: | Video Action Recognition; Human Activity; Recognition; Transformers; Temporal Attention; Consensus; Convolutional; Neural Networks |
Group: | Faculty of Science & Technology |
ID Code: | 36997 |
Deposited By: | Symplectic RT2 |
Deposited On: | 30 May 2022 10:43 |
Last Modified: | 01 Sep 2022 12:33 |
Downloads
Downloads per month over past year
Repository Staff Only - |