Visualizing Object Detections

Visualizing Object Detections

Exciting new ways to troubleshoot and understand your object detection models

This is a guest post written by Eric Hofesmann


Recent years have shown a spike in computer vision (CV) applications, namely self-driving cars, robotics, medical imaging, and many others. One CV task that links many of these applications is object detection. The purposes of object detection are to identify what and where certains things are in images. To do that you need to train a model to take an image as input and return a set of boxes that show the locations and types of objects in the image. 

Object detection is generally more complex than image classification tasks (where an image is given as input and a label is applied to the image as a whole). The addition of “locality”, identifying not just what is in the image but also where, requires a significant increase in the amount and structure of the ground truth annotations and model predictions for an image. This complexity can make it increasingly difficult for data scientists to explore their model outputs and gain insights into the weak points of their models. 

In order to improve the performance of a model, you first need to be able to analyze the data to see where it is failing so that you can retrain it using a new training scheme or an augmented dataset that addresses the issues. This is generally easy for classification models, you just need to look at your data by separating out the misclassified images. However, for object detection you will need to regenerate these images visualizing hundreds of bounding boxes drawn on them across dozens of classes in order to analyze your data properly. This is often done through painstaking custom scripting which is fairly rigid once the images are generated. Thresholding and removing certain boxes would require you to regenerate the entire dataset again. Luckily, there are some tools out there that can help alleviate this pain.  

FiftyOne is a new tool developed by Voxel51 to tackle this challenge. It allows users to easily load, filter, and explore entire image datasets with ground truth and predicted labels through a fast and responsive GUI. This blog will showcase previous tools used to visualize bounding boxes and then demonstrate how FiftyOne extends this functionality for dataset-wide analysis and hands-on model evaluation.

Standard bounding box visualization methods

In the current state of the CV community, there is a lack of bounding box visualization tools. As a result, most CV/ML engineers are left to roll their own custom solutions from scratch. Most of these efforts often wind up with the same basic functionality; static boxes with labels and scores. However, this method does not scale to the size of modern day image datasets, nor is it easy to write the numerous scripts that would be required to modify the drawn boxes as you explore different aspects of your data. There are some libraries that assist in drawing bounding boxes on images and being able to modify those bounding boxes. Two notable examples are TensorFlow visualization utilities and Weights & Biases.

TensorFlow Visualization Utilities

TensorFlow provides a utility in its object detection library that contains functions to load images and draw bounding boxes on them using the Python Imaging Library, PIL. If you have a single image that you want to visualize, and your data is already in TensorFlow, then this utility will be helpful. It lets you avoid having to write custom scripts to draw a bounding box and provides some basic customization of what is displayed, like colors and a list of strings to print on the boxes. While this utility can be useful for quickly drawing bounding boxes on images that are already in memory, it does not allow you to change the boxes you are looking at or easily choose a new image to draw bounding boxes on without significant scripting on the users end.

TensorFlow Object Detection
Image 1: 000000174482.jpg with ground truth from the MSCOCO 2017 validation set visualized with TensorFlow visualization utils.

The code snippet below demonstrates how it is relatively easy to draw bounding boxes on an image, assuming you have already selected the images you want and have formatted the detections correctly.

from utils import visualization_utils as vis_utils

for img_name, annots in image_annotation_pairs:
    img_path = os.path.join(data_path, img_name)
    img =
    for annot in annots:
        tlx, tly, w, h = annot['bbox']
        cat = annot['category_id']
        caption = categories[cat]
        score = str(random.random())[:4]
        vis_utils.draw_bounding_box_on_image(img, tly, tlx, tly+h, tlx+w,
                display_str_list = [caption, score],
                use_normalized_coordinates=False), img_name))

Weights & Biases

The machine learning developer tool, Weights & Biases, also provides bounding box visualization functionality. It allows you to load your image and detection in a specified format and visualize them in their dashboard. Where this diverges from TensorFlow visualization utils is that this dashboard provides useful controls for filtering bounding boxes of a small set of images by any provided scalar metrics. Even though this tool is mostly designed for visualizing a few images at a time, the ability to choose which boxes are being drawn in realtime can save the user a lot of scripting to regenerate images.

Weights & Biases Object Detection
Image 2: 000000397133.jpg with ground truth from the MSCOCO 2017 validation set visualized with Weights & Biases.

The following code was used to generate the above example. In order to add images and detections to Weights & Biases, you have to parse your data and convert it into their format, though it is easy to use and pretty lightweight.

import wandb


labels_file = "/path/to/coco/annotations/instances_val2017.json"
with open(labels_file) as f:
    labels = json.load(f)

categories = {c["id"]:c["name"] for c in labels["categories"]}

img_metadata = [(i["id"], i["file_name"]) for i in labels["images"][:10]]

data_path = "/path/to/coco/data/val2017/"
imgs = []
for img_id, img_name in img_metadata:
    wandb_boxes = {"predictions": {"box_data":[], "class_labels": categories}}
    annots = [a for a in labels["annotations"] if a["image_id"]==img_id]
    for annot in annots:
        tlx, tly, w, h = annot['bbox']
        cat = annot['category_id']
        curr_box = {}
        curr_box["position"] = {"minX": tlx, "maxX": tlx+w, "minY": tly, "maxY":
        curr_box["class_id"] = cat
        curr_box["box_caption"] = categories[cat]
        curr_box["domain"] = "pixel"
        curr_box["scores"] = { "score" : random.random() }

    img_path = os.path.join(data_path, img_name)
    img = wandb.Image(img_path, boxes=wandb_boxes)

wandb.log({"Example": imgs})

Other tools

Most tools that provide bounding box visualization functionality are designed with annotation in mind. Some examples are the Computer Vision Annotation Toolbox, Labelbox, LabelImg, Scalable, LabelMe, and many others.  They are focused on allowing users to easily draw and modify bounding boxes. They are not designed, however, for loading thousands of images with both ground truth and predicted bounding boxes for the purpose of evaluation. This blog focuses more on bounding box visualization tools that are designed with model analysis in mind, though they are few and far between. To this end, TensorFlow visualization utilities and Weights & Biases are more in line with the goal of model and dataset analysis for the machine learning engineer. 

What’s missing?

Both of these tools provide fairly low-level interfaces that help visualize bounding boxes for a handful of images that you have preselected. However, if you want to understand where your object detection model is failing then you will need to look through orders of magnitude more images. Additionally, you most likely need to filter your dataset many times and in many ways. You need to slice and dice it and look at specific examples, like only seeing how well you are detecting trucks and then refiltering to see how well you are detecting cars with large bounding box areas instead. In this case, the onus is on you to write scripts that can find notable images that you want to visualize, and then write more scripts to load those detections into these tools. This is where FiftyOne comes in. The data-first mindset of FiftyOne resulted in a tool that can let you know you data like never before.


FiftyOne is a powerful machine learning tool developed by Voxel51 to assist machine learning scientists and data engineers in exploring large image datasets. FiftyOne allows you to easily load, filter, and search through your data and labels. We are going to demonstrate how to evaluate the model Faster-RCNN on the MSCOCO object detection dataset.

FiftyOne Object Detection
Image 3: MSCOCO validation images displayed in FiftyOne.


Using a virtual environment is recommended if following this example. Steps to create a virtual environment are provided in the FiftyOne docs. The following code snippets require the installation of torch, torchvision, and FiftyOne.

FiftyOne is currently in beta, you can sign up and receive a personal access token to use FiftyOne by visiting today! Detailed instructions can be found in the FiftyOne docs. The basic commands are as follows:

pip install --upgrade pip setuptools wheel
pip install --index fiftyone
pip install ipython torch torchvision


FiftyOne provides easy access to the Pytorch and TensorFlow dataset zoos through the `fiftyone.zoo` package. The validation split of COCO can be loaded in two lines:

import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("coco-2017", "validation")

We can now already use the FiftyOne App to explore the images and labels of the validation split of COCO by creating a new session.

import fiftyone as fo
session = fo.launch_app(dataset=dataset)
FiftyOne MSCOCO Object Detection
Image 4: MSCOCO validation images displayed in FiftyOne with ground truth detections shown.

Add and evaluate Faster-RCNN detections

Faster-RCNN detections can be calculated and added to every sample of the dataset in a new field.

import torch, torchvision
from PIL import Image
from torchvision.transforms import functional as func
import json

# Load the pretrained Pytorch model
# Run the model on gpu if it is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(

# Load the classes to map the model predictions back to object categories
labels_path = "~/fiftyone/coco-2017/validation/labels.json"
with open(labels_path, "r") as labels_file:
    classes = json.load(labels_file)["classes"]

# Run inference on each Sample and add the detections to our Dataset
for sample in dataset:
    image =
    image = func.to_tensor(image).to(device)
    c,h,w = image.shape

    preds = model([image])[0]

    labels = preds["labels"].cpu().detach().numpy()
    scores = preds["scores"].cpu().detach().numpy()
    boxes = preds["boxes"].cpu().detach().numpy()

    detections = []
    for label, score, box in zip(labels, scores, boxes):
        # Compute relative bounding box coordinates
        x1, y1, x2, y2 = box
        rel_box = [x1/w, y1/h, (x2-x1)/w, (y2-y1)/h]


    sample["faster_rcnn"] = fo.Detections(

Our detections can easily be thresholded by filtering the `confidence` attribute of our `faster_rcnn` detections field. We can then clone this filtered field for easy access.

faster_rcnn_75 = dataset.filter_detections(
    "faster_rcnn", F("confidence")>0.75

    "faster_rcnn", "faster_rcnn_75", samples=faster_rcnn_75

FiftyOne supports the evaluation of loaded predictions and can automatically compute per-sample true positives, false positives, and false negatives following pycocotools evaluation.

import fiftyone.utils.cocoeval as fouc

fouc.evaluate_detections(dataset, "faster_rcnn_75", "ground_truth")


The number of true and false positives and false negatives for each IoU is stored under `ground_truth_eval` in our prediction field `faster_rcnn_75`.

{'true_positives': {'0_5': 10,
  '0_55': 10,
  '0_6': 10,
  '0_65': 10,
  '0_7': 8,
  '0_75': 6,
  '0_8': 6,
  '0_85': 3,
  '0_8999999999999999': 2,
  '0_95': 1},
 'false_positives': {...},
 'false_negatives': {...}}


Deep models are often regarded as black boxes that you shove data into and magically get results from. While this often seems like the case, there is a reason for every choice made by a model and it usually starts and ends with the data. Looking at the overall performance of a model rarely helps in understanding the nuances of how a model is reaching its prediction. The right way of improving a model starts with building intuition about how it processes images and the only way to build this intuition is by looking through many many predictions.

Where FiftyOne comes in is that it can help narrow down which images and predictions you should be looking at that will help you build intuition about your model the fastest. It then provides an easy to use App to quickly view images and predictions and lets you poke and prod at your data in any way you see fit. 

Best and worst samples

Running evaluation and marking your model’s detections as true and false positives allows you to quickly filter, search, and sort by samples where your model has the most true or false positives. These fields can be used to see the samples where your model performed the best and the worst. To find the best samples, sort by the number of true positives.

session.view = dataset.sort_by("tp_iou_0_75", reverse=True)
MSCOCO  FiftyOne
Image 5: MSCOCO validation images displayed in FiftyOne sorted by samples with the most Faster-RCNN true positives (i.e. the best performing samples).

More interestingly, to find the worst images that significantly impact the performance of your model, look at samples with the highest false positives.

session.view = dataset.sort_by("fp_iou_0_75”, reverse=True)
MSCOCO  FiftyOne
Image 6: MSCOCO validation images displayed in FiftyOne sorted by samples with the most Faster-RCNN false positives (i.e. the worst performing samples).

Filtering samples

Any field or object attribute can be used to sort your dataset. For example, you can write a single line of code to use bounding box coordinates to calculate area and filter out large boxes. Small objects are often more blurry and thus more difficult to detect than large objects so seeing how your model performs on them can tell you if you should prioritize small objects during training.

from fiftyone import ViewField as F

# Only show boxes with area < 0.005
small_boxes_view = dataset.filter_detections(
    F("bounding_box")[2] * F("bounding_box")[3] < 0.005

session.view = small_boxes_view
MSCOCO  FiftyOne
Image 7: MSCOCO validation images with Faster-RCNN detections displaying only small bounding boxes with an area < 0.005.

MSCOCO includes the attribute “iscrowd” which states if a bounding box contains multiple of the same type of objects. We can use FiftyOne to easily write a filter that shows us only samples that have a crowd of objects in them.

from fiftyone import ViewField as F

# Only show images that contain at least one detection for which
# iscrowd == True
crowded_images_view = dataset.match(
        F("attributes.iscrowd.value") == 1
    ).length() > 0

session.view = crowded_images_view
MSCOCO  FiftyOne
Image 8: MSCOCO validation images that contain a ground truth object tagged “iscrowd”.


How can this help you improve your model? Take the example above of sorting by true positives. If we browse through the samples with the most number of incorrect predictions, we can see a pattern emerge. The samples with the most false positives are often very crowded scenes. 

session.view = dataset.sort_by("fp_iou_0_75", reverse=True)
MSCOCO  FiftyOne
Image 9: MSCOCO validation images displayed in FiftyOne sorted by samples with the most Faster-RCNN false positives (i.e. the worst performing samples).

Lets additionally filter to only show samples with “iscrowd” objects and then filter those from worst to best.

from fiftyone import ViewField as F

# Show images that contain at least one detection for which
# iscrowd == True, sorted with most false positives first
sorted_crowded_images_view = dataset.match(
        F("attributes.iscrowd.value") == 1
    ).length() > 0
    "fp_iou_0_75", reverse=True

session.view = sorted_crowded_images_view
MSCOCO  FiftyOne Object Detection
Image 10: MSCOCO validation images that contain a ground truth object tagged “iscrowd” and sorted by the number of false positive predictions made by Faster-RCNN.

In MSCOCO, when there are lots of objects in a crowd, then only a few of them are annotated and a big box is drawn around them that indicates they are in a crowd. Any predictions inside of this box that are the same class as the box are automatically assumed to be correct. However, there are many instances where the crowd box is not properly marked with the “iscrowd” attribute and thus the predictions, while mostly correct, result in a much lower mAP for the sample and dataset as a whole. For example, most predictions in the image with broccoli shown below are correct but the false positive count is high because only one prediction was matched with the crowd box. If you are creating a dataset, this is a great way to see where you need to go back and fix your annotations.

MSCOCO  FiftyOne Object Detection
Image 11: MSCOCO validation image that contains a ground truth object missing the “iscrowd” tag. Left – the ground truth bounding box. Right – the Faster-RCNN predictions.

An interesting observation is that these crowd boxes seem to have been left in when training Faster-RCNN as well. We can see that Faster-RCNN predictions also show a box drawn around groups of objects. This would generally not be the desired behavior of a model that is being used to detect individual objects during inference time. In order to fix this, the training scheme of Faster-RCNN should be changed when using MSCOCO. This new training scheme would not include the crowd box itself as a true positive but just use it to augment the loss of predictions inside the crowd box so that they are marked correct.

MSCOCO  FiftyOne Object Detection
Image 11: MSCOCO validation image that contains a ground truth object missing the “iscrowd” tag. The ground truth bounding box and the matched Faster-RCNN prediction are displayed. All other Faster-RCNN predictions were marked as incorrect since the ground truth was missing the “iscrowd” tag.

Observations like seeing that the model is predicting the crowd box is something that would be nearly impossible to observe just by looking at the mAP of a model across an entire dataset. Even looking at individual sample statistics would be ineffective in finding these nuances. The best way to increase your understanding of your detection model is to go through and visualize its outputs directly on images. FiftyOne gives you this ability and provides you the tools needed to really get to know your model.

Possible Extension… Quantitative model improvement