Oliver Groth has won the Carl Zeiss Diplompreis for his Diplom thesis on the topic “Visual Phrase Grounding with Variable Supervision in an EM-RNN Framework”, written at CVLD and Stanford Vision Lab.
The localization of natural language phrases in images (also know as grounding) is an emerging task in Computer Vision having important applications in image description, scene understanding and human-machine interaction. This work investigates the relation between the grounding task and the closely related captioning task and leverages this relationship to build a grounding model based on a recurrent neural network (RNN) architecture. Additionally, this work extends traditional supervised RNN training with an expectation-maximization framework, which also accommodates unsupervised and semi-supervised learning scenarios. The proposed grounding model is trained and evaluated on the publicly available Flickr30k Entities dataset and the viability of the semi-supervised approach is shown experimentally on the visual grounding task. The findings of this thesis are a promising indicator towards the implementation of unified captioning and grounding models which are trainable with small supervision on paired image-text data corpora.