Noriy, K. A., Yang, X., Budka, M. and Zhang, J. J., 2023. CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition. CoRR, abs/23. (In Press)
Full text available as:
![]() |
PDF
CLASP__Multilingual_contrastive_approach_for_speech_and_sound_representation_learning.pdf - Accepted Version Restricted to Repository staff only Available under License Creative Commons Attribution Non-commercial. 351kB |
Copyright to original material in this document is with the original owner(s). Access to this content through BURO is granted on condition that you use it only for research, scholarly or other non-commercial purposes. If you wish to use it for any other purposes, you must contact BU via BURO@bournemouth.ac.uk. Any third party copyright material in this document remains the property of its respective owner(s). BU grants no licence for further use of that third party material. |
Abstract
In multilingual speech processing, accurately understanding and interpreting emotions is pivotal yet challenging due to the scarcity of labelled datasets. The recent strides in contrastive learning have catalysed self-supervised methodologies, enabling the harnessing of unlabelled data. In light of these advancements, we introduce CLARA, a groundbreaking multilingual framework, as our solution to diminish labelled data dependence and escalate generalisation across a diverse array of languages and conditions. CLARA excels in guiding models to embody shared representations across languages, seamlessly facilitating the cross-lingual transfer of speech and emotional understanding, even amidst scenarios with limited target language data. A cornerstone of our approach is proficiently capturing emotional subtleties within speech, transcending the challenges posed by the subjective nature of perceptual assessments. Embarking into self-supervised learning with a rich multilingual data corpus, we aim to delineate speech representations imbued with emotive dimensions, unlocking new potentials in emotion-aware multilingual speech processing. Employing a substantial corpus of multilingual audio data, our methodology leverages data augmentation techniques to broaden the dataset spectrum, incorporating visual understanding via textual embedding, augmentation of language data from highresource data sources to low-resource languages and model CLARA to learn these representations across domains. Rigorous experimentation demonstrates our model’s superior performance across various tasks, such as emotion recognition, multilingual language comprehension, audio classification, and retrieval benchmarks, especially in zero-shot and few-shot scenarios. Our model presents a compelling approach to obtaining shared and adaptable speech representations across languages and acoustic conditions while encoding latent emotional aspects. Additionally, we showcase the model’s capability to adapt to lowresource languages, marking a significant stride in multilingual speech representation learning.
Item Type: | Article |
---|---|
Group: | Faculty of Science & Technology |
ID Code: | 40905 |
Deposited By: | Symplectic RT2 |
Deposited On: | 07 Apr 2025 11:22 |
Last Modified: | 07 Apr 2025 11:22 |
Downloads
Downloads per month over past year
Repository Staff Only - |