Skip to main content

Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions.

Taati, E., Budka, M., Neville, S. and Canniffe, J., 2024. Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions. In: Nguyen, N. T., Chbeir, R., Manolopoulos, Y., Fujita, H., Hong, T-P., Nguyen, L. M. and Wojtkiewicz, K., eds. Intelligent Information and Database Systems: ACIIDS 2024. Cham: Springer, 184-197.

Full text available as:

[img] PDF
Optimizing_Neural_Topic_Modeling_Pipelines__preprint_.pdf - Accepted Version
Restricted to Repository staff only until 16 July 2025.

299kB

Official URL: https://doi.org/10.1007/978-981-97-4982-9

DOI: 10.1007/978-981-97-4982-9_15

Abstract

Gaining insights from large-scale document archive is a challenging task. Recent advances in natural language processing, specifically unsupervised topic modeling, allow for automated discovery of abstract “topics” that characterize groups of semantically related documents within textual corpora. Neural topic modeling has emerged as a scalable approach through integrating state-of-the-art sentence embedding models into modeling pipelines. This embedding-based architecture enables efficient processing of large datasets. However, topic quality often related to input data quality, particularly in the case of speech-to-text, remains an open issue. This study presents a comparative evaluation of various component configurations within a neural topic modeling pipeline, as applied to a corpus of telephony transcriptions. Incorporating four embedding models (E5, Instructor, MiniLM, and SGPT), three dimensionality reduction approaches (maintaining versus reducing original embeddings by Truncated-SVD and UMAP), and two clustering algorithms (K-Means and HDBSCAN), 48 topic modeling pipelines are evaluated. The experimental results reveal that placing a context-aware embedding model in the pipeline leads to significant improvement in topic coherence, while larger models tend to achieve better topic diversity. Based on the above, we also propose best practices of the model layout in the pipeline, considering coherence and topic diversity scores.

Item Type:Book Section
ISBN:978-981-97-4981-2
Volume:14795
ISSN:0302-9743
Group:Faculty of Science & Technology
ID Code:40537
Deposited By: Symplectic RT2
Deposited On:24 Dec 2024 12:19
Last Modified:24 Dec 2024 12:19

Downloads

Downloads per month over past year

More statistics for this item...
Repository Staff Only -