In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then define the similarities (matching weights) between regions and phrases.
We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical effectiveness of our approach.
Textbook:
Szeliski, R., 2010. Computer vision: algorithms and applications. Springer Science & Business Media.
Papers:
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C. and Berg, T.L., 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR.
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J. and Lazebnik, S., 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision (pp. 2641-2649).
Wang, L., Li, Y. and Lazebnik, S., 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5005-5013).
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T. and Schiele, B., 2016, October. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision (pp. 817-834). Springer, Cham.
Karpathy, A., Joulin, A. and Fei-Fei, L.F., 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (pp. 1889-1897).
Karpathy, A. and Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K. and Darrell, T., 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4555-4564).
Klein, B., Lev, G., Sadeh, G. and Wolf, L., 2014. Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399.
Vinyals, O., Toshev, A., Bengio, S. and Erhan, D., 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T. and Rohrbach, M., 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847.