Alejandro Newell will present his FPO "Learning to Solve Structured Vision Problems" on Tuesday, March 15, 2022 at 10AM at Friend 125 and on Zoom (https://princeton.zoom.us/j/7228079201). Location: Friend 125 & https://princeton.zoom.us/j/7228079201 The members of Alejandro's committee are as follows: Examiners: Jia Deng (advisor), Adam Finkelstein, Felix Heide Readers: Olga Russakovsky, Szymon Rusinkiewicz A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu mailto:gradinfo@cs.princeton.edu if you would like a copy of the thesis. Everyone is invited to attend his talk. Abstract follows below: We want computer vision models to understand the rich world captured in images and video. This requires not just recognizing objects, but identifying their relationships and interactions. Combining contributions in both neural architecture and loss design, we expand the capacity of convolutional networks to express such interactions and solve a broad range of structured computer vision tasks. We first introduce a convolutional network architecture for dense per-pixel prediction. We show how intermediate supervision and repeated processing across feature scales lead to better network performance, referring to the architecture as a “stacked hourglass” network based on the successive steps of pooling and up sampling during inference. We benchmark on the task of human pose estimation achieving state-of-the-art performance. Next, we introduce associative embedding, a method for supervising networks to solve detection and grouping tasks. A number of problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually these problems are addressed with multi-stage pipelines, instead we train a model to simultaneously output detections and group assignments. We can then extend the use of associative embeddings to define arbitrary graphs. We demonstrate how to supervise embeddings such that a model both detects the objects in a scene and defines semantic relationships between pairs of objects. Finally, we perform an investigation of self-supervision methods. Recent self-supervised losses rely on a similar learning signal to the loss we leverage in our as-sociative embedding work. But it is unclear how useful these losses are for general purpose visual feature pretraining. We investigate what factors play a role in the utility of such pretraining by evaluating self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. Our experiments highlight how self-supervision can be more or less useful depending on the amount of labeled data, the complexity of the data, and the target downstream task. Together the work in this thesis shows how to build and train better models while providing insights into what steps lead to the best performance across a wide variety of computer vision tasks.