I am a novice to the computer vision field and while working on an object detection project I realized that there are a few ways you can represent bounding boxes. These representations depend on the type of neural network you pick to use for your particular problem. For example, the Region-based Convolutional Neural Network family that includes RCNN, Fast-RCNN, Faster-RCNN and Mask-RCNN utilizes COCO type bounding boxes. COCO bounding boxes have the following representation, X, Y, W, H – where X, Y are the coordinates of the top left corner and W is the width and H is the hight of the rectangle.

The famous YOLO CNN uses PASCAL VOC type of bounding boxes that are represented with X1, Y1, X2, Y2 where X1, Y1 are the coordinates of the top left corner and X2, Y2 are the coordinates of the bottom right corner.

While doing some research I came across another type of bounding boxes, the new DERT Transformer takes the following format Xc, Yc, W, H. Where Xc and Yc are coordinates that represent the center of the bounding box and W is the width and H is the hight of the bounding box.

This seems to be a small detail but took me quite time to realize the issue when I trained the Mask-RCNN model. If you don’t use the correct format everything goes wrong. The issue is particularly problematic because it doesn’t give you an error during the data processing. The training goes without errors and your only hint that something is amiss is the fact that your results don’t look good (And considering that you don’t know how the results suppose to be you may not guess that something was wrong with bounding boxes).

I hope this short post can help you pick the right type of bounding boxes for your model.