Learning a Joint Embedding Representation for Image Search using Self Supervised Means

Research output: Contribution to conferencePaperpeer-review

Abstract

Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbour search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.

In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
Original languageAmerican English
StatePublished - Apr 27 2021

Fingerprint

Dive into the research topics of 'Learning a Joint Embedding Representation for Image Search using Self Supervised Means'. Together they form a unique fingerprint.

Cite this