The following subsections provide an extensive ablation study by varying both the makeup of the dataset, and the CNN backbone used in our model.
Images-per-instance
It is important to distinguish the model’s instance classification accuracy for objects with many images from those with fewer images, because as the number of different images-per-instance increases, so too does the potential visual information for the model to learn to recognize also increase. Given the advantages that more images per instance can afford, in addition to experimenting with the full dataset of 24,502 images (obtained by keeping only instances with 3 images or above as described in the “Dataset” section), we also experiment with further restricting the minimum number of images per instance in the dataset to 4, 5, and 6. As shown by the general increase in accuracy at each image-per-instance subset in Fig. 4, we find that instances with a higher number of images are more accurately classified, as intuitively expected. We find that excluding instances with fewer images per instance does not substantially improve the accuracy of the higher image count subsets, demonstrated by the relatively close scores of each subset (the most inclusive training setup even scoring the highest on the subset of 9 images-per-instance). This indicates there is little risk to the overall performance of an object detector in including objects of lower image counts, as the lower overall accuracy stems from the naturally poorer classification performance of less represented objects, and not a degradation in performance on higher-count subsets. Although it is generally the case that performance increases as the dataset becomes smaller (and thus more easily solved), the similar performance between the four training setups in Fig. 4 implies that our strictest dataset has a limitation of 6 images per instance not sufficiently reduced dataset size for this problem to manifest. Furthermore, we verify that the high accuracies of our models do not come from a small number of easy or ‘solved’ instances disproportionately carrying the overall accuracy, as demonstrated by the relatively small difference in accuracy between the subsets of image-per-instance counts 8, 9, and 10+.
The accuracy of images of each subset by image-per-instance count for our best model (from Fig. 5) under each makeup dataset. For example, the triangular red plots indicate the object prediction accuracy of our highest performing model trained on the dataset with a minimum image-per-instance cutoff of three, further split by subsets of objects with exactly 3/4/5… images-per-instance. The dashed red line represents the overall accuracy, ie a weighted average of accuracies of each subset.
CNN model
As the CNN backbone is the most important part of the model design, we experiment with state-of-the-art EfficientNet5 and ResNet Rescaled6 models with further ablation on Inception-v37/v48 models. We find that all EfficientNet models perform significantly better than the other models on each training setup (Fig. 5), and that the larger EfficientNet models consistently achieve a higher accuracy than the smaller ones. Although a higher image-per-instance threshold yields higher performance as previously discussed, we also see that the model size does not significantly change the performance on different image-per-instance thresholds. This is depicted by the consistent spacing between each polygon in Fig. 5 across each different type of model. Instead, the change in accuracy at image-per-instance thresholds is noticeably different for the three different models architecturesie the increase in performance from images-per-instance ≥ 3 and ≥ 6 is (sim) 19%, (sim) 11%, and (sim) 8% for ResNet-RS, EfficientNet, and Inception architectures respectively.

Performance of models with respect to its size ie the number of parameters. The four dataset image-per-instance scenarios here are the same four scenarios as in Fig. 4.
High performance trade-offs
The strongest performing single model is EfficientNet-b4/3/4/6 for image-per-instance dataset thresholds ≥ 3/4/5/6 respectively. However, to maintain an adequately large batch size for stable training, the larger EfficientNet variations require significantly larger computational resources during training (full details in Supplementary Table 1). Furthermore, inline with standard practice for CNN-based image classification models24, we find that a collaborative ensemble of models pushes accuracy even higher. Table 3 shows that an ensemble of the 5 best EfficientNet models gives a ~ 1–2% increase in top-1 accuracy, which can be exploited provided one is willing to purchase it with increases in computational resources (~ 2.5 GB of VRAM for inference of a single image). Even in scenarios where the predicted object is incorrect, we find that the correct answer is often still in the next few guesses, ie the top-3, top-5, and top-10 accuracies of our models are actually significantly higher than our regular accuracy scores. We see in Table 3 that (for the best performing single model) the more-relaxed ‘image-per-instance ≥ 3/4’ dataset scenarios yield ~ 10%, ~ 13%, and ~ 17% improvements for top-3, top-5 and top-10 accuracies respectively. The top-3, top-5, and top-10 accuracies show less relative improvement for the less relaxed image-per-instance ≥ 5/6 scenarios (+ ~ 8%, + ~ 10%, + ~ 12%) as the baseline top-1 accuracies are already higher than image-per-instance ≥ 3/4.
Subcategories of objects
Although the central question of this paper is about instances, we explore how the instance classification accuracy differs for instances in each subcategory of the dataset. We see in Table 4 that the Oriental and Egyptian subcategories score consistently above the overall average. Castle objects score significantly below average (− 8% to − 32%), and Fulling Mill objects score between (− 8% to + 0.33%). The biggest variations in subcategory accuracy occur in the more relaxed image-per-instance ≥ 3/4 scenarios (+ 6% to − 29.%), whereas the more restrictive and generally higher-performing image-per-instance ≥ 6 scenario has much less variation overall (+ 1% to − 8%). We note that the smaller subcategories experience the most substantial drop in accuracy, and that this further coincides with the average image-per-instance for each subsection (calculated from Table 2b): Oriental (approx) 5.84, Egyptian (approx) 5.67, Fulling Mill (approx) 4.71, Castle (approx) 4.43. However, we cannot conclude that the larger size of a subcategory is the cause of increasing performance as the Egyptian subcategory (~ 18.93% of instances) scores higher than the much larger Oriental subset (~ 69.09% of instances). Conversely, we also cannot conclude that the relatively small size of the Fulling Mill ((sim) 5.94%) and Castle ((sim) 3.80%) subcategories cause their relative reduction in performance compared with overall accuracy, because the accuracy of these two smaller subcategories approaches the overall accuracy in the higher image-per-instance dataset scenarios. We hypothesize that instances of these subcategories are instead not as easily represented with fewer images-per-instance. In order to gauge the differences in image information between the subsections, we apply t-SNE dimension reduction25 on feature vectors extracted from the penultimate layer of the CNN in our best models for each image in the test set. This generates a 2D point for each image which we can plot to observe any clusters the t-SNE reduction may have generated. We see from Fig. 6 that the plot the 2-dimensional t-SNE reduction generates does not strongly cluster the images by subsets, as the four colors (representing each different subcategory) are relatively evenly distributed. However, the points instead appear clustered into a large number of very small neighborhoods respectively of their subcategory. We find this surprising, as the CNN has been trained to distinguish images by instance instead of their subcategory. This is evidence that our model is not relying on features unique to each subcategory eg Orientaland is instead primarily using the distinctive features of each object instance as intended. See Supplementary Figs. 2 and 3 in Section D of the supplementary materials for PCA and UMAP dimensionality reduction respectively.

t-SNE Dimension Reduction25 on the features generated from each image of the dataset, extracted from the penultimate layer of the CNNs used in our experiments.
Visualizing predictions with saliency maps
We use saliency maps26 to see which regions of the input image were most influential in the decision the network makes, and thus allowing us to estimate how our neural network makes a prediction for which instance it believes the image belongs to. Given an image and the class a model has predicted for it, we can track the origin of the signals propagating through the network that led to the given classification, ie we can highlight the image regions that most influenced the model’s instance classification choice. Figure 7 shows saliency maps generated by our best single model (EfficientNet-b6 on images-per-instance ≥ 6) overlaid on the original image for clarity. A higher intensity (darker red) saliency indicates that the pixels were highly influential in the model’s decision. Our models do not demonstrate an over-reliance on any one feature in its predictions: Fig. 7a shows examples where the boundary of the object, ie its shape, led to correct classification. Figure 7b instead shows examples where the finer details on the surface of the objects are most salient to correct classification. We did not find any one feature in saliency that correlates with incorrect classifications. However, it is often the case that incorrectly classified objects exhibit a more scattered saliency as in Fig. 7c. Such incorrect saliency maps still appear to somewhat attend to the shape/details of the objects, although to a much lesser degree than the correctly classified counterparts in Fig. 7a, b. For example, objects in Fig. 7c demonstrate regions of saliency spread thinly across the background. Yet some of the saliency is still overlaid around the shape of the objects (middle and right objects), or the details (seen on the base of the leftmost object in Fig. 7c). This behavior is typical of a less confident prediction, where the model is still aware of the features in the object, but is unable to exploit them confidently in classification.

Saliency maps26 generated from our best model on images-per-instance ≥ 6 (EfficientNet-b6) to visualize which regions in the image are most influential in choosing the instance. The saliency map is overlaid on the original image for clarity. The darker red regions indicate a higher intensity score.