Self-contained Entity Discovery from Captioned Videos

Melika Ayoughi, Pascal Mettes, Paul Groth

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

This article introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g., faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating entities with labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame-caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks, SC-Friends and SC-BBT, based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.1

Original languageEnglish
Article number177
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume19
Issue number5 s
DOIs
StatePublished - Jun 7 2023

Keywords

  • Entity discovery
  • multimodal video understanding
  • self-contained video recognition

Fingerprint

Dive into the research topics of 'Self-contained Entity Discovery from Captioned Videos'. Together they form a unique fingerprint.

Cite this