Notes on YOLO

see also:


YOLO is fast and simple. It’s especially good if we are in only interested in the existence but not so much in the location of the object .




24 convolutional layers (features layer): $448 \times 448 \to 7 \times 7$
$ \Downarrow $
2 fully-connected layers
$ \Downarrow $
Outputs: a $ S \times S \times (B \times 5 + C) $ tensor, where
$S$ is the number of grids,
$B$ is the number of bounding boxes,
$C$ is the number of classes (Or, as it is called in the paper, number of conditional class probabilities $Pr(Class_i | Object)$,
the number $5$ implies 4 bounding box location parameters $(x, y, w, h)$ and 1 bounding box confidence (objectness prediction) $p = P_r(Object) \times IOU_{pred}^{truth} $

Each output contains $S \times S$ predictions, i.e. each grid predicts only one object. In other words, YOLOv1 enforces spatial diversity in making predictions.

For example, To evaluate PASCAL VOC, YOLOv1 uses $7 \times 7$ grids, $2$ bounding boxes, and $20$ classes. So the network outputs a $7 \times 7 \times (2 \times 5 + 20)$ tensor, which also shown in the image above.


Improvements on YOLOv1:

  • Romove fully-connected layers. And thus, YOLOv2 allows different input image sizes and higher resolution detection (because you don’t have to shrink an image no more).
  • Convolutional with Anchor Boxes. Instead of predicting coordinates of bounding boxes directly, YOLOv2 takes advantage of the idea of achors and predicts five coordinates ($t_x, t_y, t_w, t_h, t_o$) for each achor. $$ b_x = \sigma(t_x) + c_x $$ $$ b_y = \sigma(t_y) + c_y $$ $$ b_w = p_we^{t_w} $$ $$ b_h = p_he^{t_h} $$ $$ Pr(Object) \times IOU(b, Object) = \sigma(t_o) $$

  • Dimension Clusters. In order to determine the sizes of anchors, YOLOv2 run k-means clustering on the training set bounding boxes to automatically find good priors (anchor is also called prior in the paper). In stead of normal Euclidean distance, YOLOv2 defines its own so-call IOU distance: $$ d(box, centroid) = 1 - IOU(box, centroid) $$
  • Fine-Grained Features. YOLOv1 predicts detections on a $13 \times 13$ feature map. While this is sufficient for larger objects, it may benefit from finer grained features for localizing smaller objects. YOLOv2 adds a passthrough layer that brings features from an earlier layer at $26 \times 26$ resolution.
  • Multi-Scale Training. Since YOLOv2 removes fully-connected layers, it now can accepts all sizes of images. So instead of fixing the input image size, it can be trained with different image dimension size, as long as they are multiples of 32, e.g. {320, 352, …, 608}.


  • Feature extractions: Darknet-19: reduce the image dimension by a factor of 32, for example: $416 \times 416 \to 13 \times 13$
  • Some intermedia layers
  • Outputs: $ S \times S \times (B \times (4 + 1 + C)) $, YOLOv2 moves the class prediction from the grid level to box level, and it changes the output tensor shape.


YOLO9000 uses classification data to train a detection netword. (Details coming soon)


No surprise or new ideas. Improvements on YOLOv2:

  • Prediction Across Scales. YOLOv3 predicts objects at 3 different scales, each with 3 clusters. On the COCO dataset the 9 clusters are: $(10 \times 13)$, $(16 \times 30)$, $(33 \times 23)$, $(30 \times 61)$, $(62 \times 45)$, $(59 \times 119)$, $(116 \times 90)$, $(156 \times 198)$, $(373 \times 326)$. See config for network structure.
  • Feature Extractor. YOLOv3 uses Darknet53, which is a hybird approach between Darknet19 and residual network.


  • Darknet: the nn framework, not the specific network like Darknet19
  • Training YOLO: a step by step guide on how to train your own yolo detection network


Darknet has a bunch of different implementations:

Training YOLO