COCO Evaluation metrics explained

As we saw in a previous article about Confusion Matrixes, evaluation metrics are essential for assessing the performance of computer vision models. In this article, we will take a closer look at the COCO Evaluation Metrics and in particular those that can be found on the Picsellia platform.

In order to better understand the following sections, let’s have a quick reminder of some evaluations metrics:

Reminders and Definitions

TP, FP, FN, TN

TP as True Positive: occurs when a model correctly predicts a positive outcome.
TN as True Negative: occurs when a model correctly predicts a negative outcome.
FP as False Positive: occurs when a model predicts a positive outcome when it should have been negative.
FN as False Negative: occurs when a model predicts a negative outcome when it should have been positive.

Precision

Precision indicates how many of the predicted positive cases are correct. It quantifies the ratio of true positives to the total number of positive predictions and is computed as follows:

Precision reveals the model's ability to make accurate positive predictions. A high precision value indicates that when the model predicts a positive outcome, it is often correct.

Recall

Recall, also known as sensitivity or true positive rate, measures the ratio of true positives to the total number of actual positive samples and is calculated as follows:

Recall focuses on the model's ability to correctly identify positive samples from the entire pool of positive instances.

Intersection over Union

Commonly used in Computer Vision, for tasks like object detection, instance segmentation, and image segmentation, IoU is a metric that evaluates the extent of overlap between two bounding boxes, providing a measure of how well a predicted object aligns with its ground-truth counterpart. IoU enables quantification of the precision and recall of detection algorithms. A higher IoU score implies a more accurate prediction.

Mathematically, IoU is defined as the ratio of the area of intersection to the area of the union of the predicted bounding box and the ground-truth bounding box:

IoU is a number between 0 and 1.

If IoU = 0, it means that there is no overlap between the boxes.
If IoU = 1, it means that they overlap completely.

You can still read our article on the Confusion Matrix to learn more about precision, recall, true positives and negatives, and false positives and negatives.

Now, that we are clear on these metrics, let's go deeper and look at Average Precision and Average Recall.

Average Precision (AP) and Average Recall (AR) for object detection

1) Predictions and Annotations: Imagine an image, with predictions made by your object detection model, which includes bounding boxes around objects along with confidence scores. You also have ground truth annotations that specify the actual object positions and classes.

Here is an evaluation of an experiment on Picsellia’s platform, the green boxes are the ground truth and the red ones are the model’s predictions.

2) IoU Calculation: Calculate the Intersection over Union for each prediction-annotation pair. IoU measures the overlap between the prediction and the annotation by taking the ratio of the intersection area to the union area as explained in the first paragraph.

3) IoU Threshold Selection: On COCO, the average precision is evaluated over several IoU values, the thresholds of 0.50, the range 0.50-0.95, and 0.75.

4) Precision Calculation: For each IoU threshold, sort the predictions using true positives, false positives, and false negatives.

Example of sorting for IoU=0,5, https://learnopencv.com/

Let’s evaluate our car example :

‍

For IoU=0.5

4 True Positives

0 False Positive

1 False Negative

Precision = 4/4+0 = 1

Recall =4/4+1= 0.8

For IoU=0.75

2 True Positive

2 False Positive

Precision = 2/2+2 = 0.5

Recall = 2/2+1=0.66

You can visualize all these data and more on Picsellia!

5) Precision-Recall Curve: Create a precision-recall curve by plotting precision on the y-axis and recall on the x-axis for each IoU threshold. Each point on the curve corresponds to a specific IoU threshold.

6) Area Under the Curve (AUC) Calculation: Calculate the area under the precision-recall curve for each IoU threshold. This gives the Average Precision (AP) value for each threshold.

AP at IoU 0.50: Calculate the mean of the AP values obtained from the precision-recall curve for the 0.50 threshold.
AP at IoU 0.50-0.95: Calculate the mean of the AP values obtained from the precision-recall curve over the range of IoU thresholds (0.50 to 0.95).
AP at IoU 0.75: Calculate the mean of the AP values obtained from the precision-recall curve for the threshold of 0.75.

Mean Average Precision (mAP): If you have several object categories, calculate the AP for each category and then take the mean to get the mAP.

On Picsellia, you have access to a table that summarises your evaluation data, let’s have a closer look.

Here's what each component means:

GT Objects is the number of objects annotated by the user and Eval Objects, is the number of objects predicted by the model.
We have already seen what the 50, 50-95, and 75 Average Precision are.
Small, Medium, and Large Data: These categories typically represent different ranges of object sizes. Objects are categorized into these groups based on their bounding box dimensions or area. For example:

Small: Objects with a small bounding box area (e.g., small animals, small objects).
Medium: Objects with a medium bounding box area (e.g., medium-sized vehicles, people).
Large: Objects with a large bounding box area (e.g., large vehicles, large structures).

In Average Recall we also have the columns Det1, Det10, Det100: these refer to different confidence thresholds used for object detection. The notation "DetX" represents the confidence threshold used to determine whether a prediction is considered a true positive or a false positive.

For example:

Det1: The model considers a prediction to be a true positive if its confidence score is above the highest confidence threshold (e.g., top 1 prediction).
Det10: The model considers a prediction as a true positive if its confidence score is above the top 10 confidence threshold.
Det100: The model considers a prediction as a true positive if its confidence score is above the top 100 confidence threshold.

By evaluating AP and AR 50-95 across small, medium, and large data, you can gain insight into how well your model performs across different object sizes and levels of overlap. This approach provides a more complete understanding of your model's strengths and weaknesses in different scenarios.

What about Instance segmentation and Image segmentation?

All of these metrics are not only used for object detection but also for Instance segmentation and Image Segmentation.

Instance Segmentation:

In instance segmentation, the goal is to not only detect objects but also segment each instance of the object separately. This involves predicting both the bounding box and the pixel-level mask for each instance. The evaluation metrics are extended to account for both detection and segmentation aspects.

Average Precision (AP): AP for instance segmentation considers both the accuracy of bounding box predictions and the quality of pixel-level masks. It is calculated by comparing the predicted masks to the ground truth masks using IoU. A prediction is considered correct if both the bounding box and mask sufficiently overlap with the ground truth. The AP is computed over a range of IoU thresholds and averaged.

Average Recall (AR): Similar to AP, AR for instance segmentation considers both detection and segmentation. It measures how well the model recalls instances of objects at various IoU thresholds. A prediction is considered a true positive if its IoU with the ground truth exceeds a certain threshold. AR is also computed over a range of IoU thresholds and averaged.

Image Segmentation:

The goal of Image segmentation is to assign a class label to each pixel in an image. The primary metric used is IoU, which also quantifies the similarity between the predicted segmentation mask and the ground truth mask for each class.

For image segmentation, IoU is used to measure the quality of pixel-wise predictions. It calculates the ratio of the intersection of the predicted and ground truth regions to their union. IoU is calculated separately for each class, and then the average IoU across all classes is often reported.

It's important to note that the exact details of how these metrics are computed may vary slightly depending on the specific segmentation task, dataset, and evaluation protocol.

For example, in segmentation and image segmentation, these metrics help to quantify the accuracy and quality of the model's predictions, providing valuable insight into its performance.

To go further on IoU

IoU serves as a fundamental metric for evaluating the performance of computer vision models, helping researchers and practitioners in fine-tuning their algorithms. While IoU provides valuable insights, it's important to note that it has certain limitations, such as being sensitive to object scale and aspect ratio. This sensitivity can lead to misleading results, especially for objects with different shapes and sizes.

To overcome this limitation, variants of IoU have been proposed, such as GIoU (Generalized Intersection over Union) and DIoU (Distance Intersection over Union).

These variants incorporate additional geometric information to provide a more robust evaluation of bounding box overlap, mitigating the effect of scale and aspect ratio.