Semantic uncertainty intervals for disentangled latent spaces

 Swami Sankaranarayanan1 Anastasios N. Angelopoulos2 Stephen Bates2 Yaniv Romano3 Phillip   Isola1
 1 MIT CSAIL    2 UC berkeley    3 Technion IIT, Israel
 [Paper] [Code]

# Abstract

 Meaningful uncertainty quantification in computer vision requires reasoning about semantic information— say, the hair color of the person in a photo or the location of a car on the street. To this end, recent breakthroughs in generative modeling allow us to represent semantic information in disentangled latent spaces, but providing uncertainties on the semantic latent variables has remained challenging. In this work, we provide principled uncertainty intervals that are guaranteed to contain the true semantic factors for any underlying generative model. The method does the following: (1) it uses quantile regression to output a heuristic uncertainty interval for each element in the latent space (2) calibrates these uncertainties such that they contain the true value of the latent for a new, unseen input. The endpoints of these calibrated intervals can then be propagated through the generator to produce interpretable uncertainty visualizations for each semantic factor. This technique reliably communicates semantically meaningful, principled, and instance-adaptive uncertainty in inverse problems like image super-resolution and image completion.

# Summary

 Our work addresses the problem of directly giving uncertainty estimates on semantically meaningful image properties. We make progress on this problem by bringing techniques from quantile regression and distribution-free uncertainty quantification together with a disentangled latent space learned by a generative adversarial network (GAN). We call the coordinates of this latent space semantic factors, as each controls one meaningful aspect of the image, like age or hair color. Our method takes a corrupted image input and predicts each semantic factor along with an uncertainty interval that is guaranteed to contain the true semantic factor. When the model is unsure, the intervals are large, and vice-versa. By propagating these intervals through the GAN coordinate-wise, we can visualize uncertainty directly in image-space (shown left) without resorting to per-pixel intervals. The result of our procedure is a rich form of uncertainty quantification directly on the estimates of semantic properties of the image.

# Method

Generative architectures such as StyleGAN and its variants have been shown to contain a disentangled latent space which factors into interpretable features such as hair color, expression etc. We use this feature to obtain semantically meaningful uncertainty in the latent space of a generative model. Since the latent space is disentangled, the uncertainty intervals natually factor into different interpretable dimensions. Thus given an image of a face for instance, we can obtain meaningful uncertainty estimates separately for different aspects of the face such as hair shape / color etc. Our training consists of three core steps:

• Sample random latent codes $Z$ and the corresponding generated images $Y$ from the generative model. We apply a corruption model (image super-resolution / image masking) to generate our training data: $(X,Z)$ pairs where $X$ is the set of corrupted images.
• Train an encoder network to predict the latent code along with a lower and upper quantile for each dimension of the latent code. The uncertainty intervals around each latent dimension can be derived trivially using the lower and upper quantile predictions.
• Calibrate the quantile intervals to provide coverage guarantees. The quantile intervals natually provide a notion of uncertainty around each latent dimension. However, they is no guarantee of coverage i.e. it is unclear if the underying true latent factor is contained in the predicted interval. The calibration procedure provides guarantees this notion of coverage.

# Producing semantic uncertainties

 The image on the left shows a uncertainty prediction output on a sample drawn from the CLEVR dataset trained generative model. The uncertainty factors naturally into the latent factors, we visualize shape and color here. The lower and upper quantile images yield similar colors, which is predictable from the blurry input. The model predicts that both a cylinder and sphere would be consistent with this blurry input. The calibrated quantiles cover the ground truth color value while the uncalibrated ones do not.

# Adaptivity to varying input corruption

 Image super-resolution Image inpainting

For the image super-resolution case, the corruption intensity is varied across each set, the input image in the top row is not corrupted while the input in the bottom row is under-sampled by 16x. In both cases, we can observe that the most diverse prediction is in the bottom row where the input is corrupted the most.

For the image inpainting case, a random mask is applied to the same input image in each row. When there is no mask (1st row), both quantiles are extremely close to the pointwise prediction. As we increase the regions that are being masked, the predicted intervals expand, as indicated by the variability on the quantile predictions.

# Quantitative results: Set sizes and Calibration

 Calibrated set sizes adapt to problem difficulty Calibration procedure guarantees desired coverage

# Acknowledgements

SS and PI acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper.