Unpuzzle Dataset Bias

Abstract

Dataset bias in manually collected datasets is a known problem in computer vision. Models trained on such datasets tend to underperform in the field. In safety-critical applications such as autonomous driving, these biases can lead to catastrophic errors, jeopardizing the safety of users and their surroundings. Being able to unpuzzle the bias in a given dataset, and across datasets, is an essential tool for building responsible AI. In this paper, we present deepPIC: deep Perceptual Image Clustering, a novel hierarchical clustering pipeline that leverages deep perceptual features to visualize and understand bias in unstructured and unlabeled datasets. It does so by effectively highlighting nuanced subcategories of information embedded within the data (such as multiple but repetitive shadow types) that typically are hard and/or expensive to annotate. Through experiments on a variety of image datasets, both open-source and internal, we demonstrate the effectiveness of deepPIC in (i) singling out errors in metadata from open-source datasets such as BDD100K; (ii) automatic nuanced metadata annotation; (iii) mining for edge cases; (iv) visualizing inherent bias both within and across multiple datasets; and (v) capturing synthetic data limitations; thus highlighting the wide variety of applications this pipeline can be applied to.

Visualization

All the clustering results included in the paper have been uploaded here with image thumbnails. We recommend zooming in using a touch pad for the best impact. For desktop users without touch pad, one can zoom in by clicking (choosing) on the image when the zoom in cursor shows up --> Click and hold on the image --> Move while holding to see different parts of the image.

BDD100K Clustering Results

Figure 3 (in paper): deepPIC stage 1 output (C¹) for 5000 BDD100K images. Note automatic segregation of day and night images into two distinct clusters.

Figure a: deepPIC stage 2 output for images assigned to C¹₁ in Fig. 3. Note the shift from high traffic density city scenes in the bottom right to the relatively low traffic density highway scenes on the top right.

Figure 4 (in paper): deepPIC stage 2 output for images assigned to C¹₀ in Fig. 3. Note the shift from low traffic density highway and residential scenes on the left to the relatively high traffic density city scenes on the right.

Parking Dataset Clustering Results

Figure 5 (in paper): deepPIC stage 1 output (C¹) for 2000 images from an internal parking dataset. Note automatic segregation of indoor garage images (right cluster) from those captured in outdoor parking lots (left 3 clusters).

Figure 6 (in paper): Stage 2 deepPIC output for images assigned to C¹₁ in Fig. 5. Sub-clusters 0, 1 and 2 on the left have overcast skies. In contrast, sub-clusters 3 and 4 on the right have sunny skies and a strong ego-vehicle shadow.

Visualizing inherent dataset bias using 5 different open-source lane detection datasets

Figure 8 (in paper): Stage 1 clustering output from deepPIC applied to 10000 images from 5 different datasets - ApolloScape, BDD100K, CULane, Mapillary and TuSimple.

Visualizing sim-to-real gap using deepPIC

Figure 9 (in paper): Visualizing sim-to-real gap using C¹, i.e. stage 1 output from deepPIC applied to an equally split mix of 10k real, simulated and sim-to-real GAN translated parking images. The gradual progression of realism between the three sets of data confirms the efficacy of the data augmentation steps.