UCSC-SOE-21-12: Self-Supervision for Scene Graph Embeddings

Brigit Schroeder, Adam Smith, Subarna Tripathi
11/18/2021 02:33 PM
Computer Science and Engineering
Scene graph embeddings are used in applications such as image retrieval, image generation and image captioning. Many of the models for these tasks are trained on large datasets such as Visual Genome, but the collection of these human-annotated datasets is costly and onerous. We seek to improve scene graph embedding representation learning by leveraging the already available data (e.g. the scene graphs themselves) with the addition of self-supervision. In self-supervised learning, models are trained for pretext tasks which do not depend on manual
labels and use the existing available data. However, it is largely unexplored in the area of image scene graphs. In this work, starting from a baseline scene graph embedding model trained on the pretext task of layout prediction, we propose several additional self-supervised pretext tasks. The impact of these additions is evaluated on a downstream retrieval task that was originally associated with the baseline
model. Experimentally, we demonstrate that the addition of each task individually and cumulatively improves on the retrieval performance of the baseline model, resulting in near saturation when all are combined.