Pre-Training LongT5 for Vietnamese Mass-Media Multi-Document Summarization

Rusnachenko, Nikolai; Le, T. A.; Nguyen, N. D.

Pre-Training LongT5 for Vietnamese Mass-Media Multi-Document Summarization.

Tools

Rusnachenko, N., Le, T. A. and Nguyen, N. D., 2024. Pre-Training LongT5 for Vietnamese Mass-Media Multi-Document Summarization. Journal of Mathematical Sciences United States, 285 (1), 88-99.

Full text available as:

[thumbnail of AINL_2023_longt5_summarization.pdf]

Preview

PDF
AINL_2023_longt5_summarization.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial.
485kB

Copyright to original material in this document is with the original owner(s). Access to this content through BURO is granted on condition that you use it only for research, scholarly or other non-commercial purposes. If you wish to use it for any other purposes, you must contact BU via BURO@bournemouth.ac.uk.

Any third party copyright material in this document remains the property of its respective owner(s). BU grants no licence for further use of that third party material.

DOI: 10.1007/s10958-024-07435-z

Abstract

Multi-document summarization is a task aimed to extract the most salient information from a set of input documents. One of the main challenges in this task is the long-term dependency problem. When we deal with texts written in Vietnamese, it is also accompanied by the specific syllable-based text representation and lack of labeled datasets. Recent advances in machine translation have resulted in significant growth in the use of a related architecture known as the Transformer. Being pretrained on large amounts of raw texts, Transformers allow to capture a deep knowledge of the texts. In this paper, we survey the findings of language model applications for text summarization problems, including important Vietnamese text summarization models. According to the latter, we select LongT5 to pretrain and then fine-tune it for the Vietnamese multi-document text summarization problem from scratch. We analyze the resulting model and experiment with multi-document Vietnamese datasets, including ViMs, VMDS, and VLSP2022. We conclude that using a Transformer-based model pretrained on a large amount of unlabeled Vietnamese texts allows us to achieve promising results, with further enhancement via fine-tuning within a small amount of manually summarized texts. The pretrained model utilized in the experiment section has been made available online at https://github.com/nicolay-r/ViLongT5.

Item Type:	Article
ISSN:	1072-3374
Group:	Faculty of Media & Communication (Until 31/07/2025)
ID Code:	41160
Deposited By:	Symplectic RT2
Deposited On:	10 Jul 2025 08:44
Last Modified:	08 Nov 2025 01:08

Downloads

Downloads per month over past year

More statistics for this item...

Repository Staff Only -