Skip to main content

MS-KARD: A Benchmark for Multimodal Karate Action Recognition.

Yadav, S., Deshmukh, A., Gonela, R., Kera, S., Tiwari, K., Pandey, H. and Akbar, S. A., 2022. MS-KARD: A Benchmark for Multimodal Karate Action Recognition. In: IEEE WCCI 2022 International Joint Conference on Neural Networks (IJCNN 2022), 18-23 July 2022, University of Padua, Italy.

Full text available as:

KarateNet_WCCI.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial.



Classifying complex human motion sequences is a major research challenge in the domain of human activity recognition. Currently, most popular datasets lack a specialized set of classes pertaining to similar action sequences (in terms of spatial trajectories). To recognize such complex action sequences with high inter-class similarity, such as those in karate, multiple streams are required. To fulfill this need, we propose MS-KARD, a Multi-Stream Karate Action Recognition Dataset that uses multiple vision perspectives, as well as sensor data - accelerometer and gyroscope. It includes 1518 video clips along with their corresponding sensor data. Each video was shot at 30fps and lasts around one minute, equating to a total of 2,814,930 frames and 5,623,734 sensor data samples. The dataset has been collected for 23 classes like Jodan Zuki, Oi Zuki, etc. The data acquisition setting involves the combination of 2 orthogonal web cameras and 3 wearable inertial sensors recording both vision and inertial data respectively. The aim of this dataset is to aid research that deals with recognizing human actions that have similar spatial trajectories. The paper describes statistics of the dataset, acquisition setting, and provides baseline performance figures using popular action recognizers. We propose an ensemble-based method, KarateNet, that performs decision-level fusion on the two input modalities (vision and sensor data) to classify actions. For the first stream, the RGB frames are extracted from the videos and passed into action recognition networks like Temporal Segment Network (TSN) and Temporal Shift Module (TSM). For the second stream, the sensor data is converted into a 2- D image and fed into a Convolutional Neural Network (CNN). The results reported were obtained on performing a fusion of the 2 streams. We also report results on ablations that use fusion with various input settings. The dataset and code will be made publicly available.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:Action recognition; Multimodal; Karate and martial arts; Sports and exercises; Deep learning; Vision and wearable
Group:Faculty of Science & Technology
ID Code:36996
Deposited By: Symplectic RT2
Deposited On:30 May 2022 10:33
Last Modified:01 Sep 2022 12:32


Downloads per month over past year

More statistics for this item...
Repository Staff Only -