Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval.

Tools

Gong, Y. and Cosma, G., 2023. Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval. Pattern Recognition, 137, 109272.

Full text available as:

Preview

PDF (OPEN ACCESS ARTICLE)
Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
3MB

Copyright to original material in this document is with the original owner(s). Access to this content through BURO is granted on condition that you use it only for research, scholarly or other non-commercial purposes. If you wish to use it for any other purposes, you must contact BU via BURO@bournemouth.ac.uk.

Any third party copyright material in this document remains the property of its respective owner(s). BU grants no licence for further use of that third party material.

DOI: 10.1016/j.patcog.2022.109272

Abstract

Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptions and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image–description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image–description pairs. To address the challenge of measuring the optimal similarities between image–description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically-enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image–description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.

Item Type:	Article
ISSN:	0031-3203
Uncontrolled Keywords:	Visual semantic embedding network; Cross-modal; Information retrieval; Hard negatives
Group:	Faculty of Science & Technology
ID Code:	40889
Deposited By:	Symplectic RT2
Deposited On:	31 Mar 2025 14:42
Last Modified:	31 Mar 2025 14:42

Downloads

Downloads per month over past year

More statistics for this item...

Repository Staff Only -