Bird Classification

Classifying bird species using visual data in old fashioned way

The Challenge: Training neural networks for bird classification

Before the advent of neural networks, people provided complex algorithms to classify birds based on visual data. These algorithms were often intricate and required extensive feature engineering. However, with the rise of deep learning, we can now train neural networks to automatically learn features from images, making the classification process more efficient and accurate.

In this project, we aim to classify bird species using visual data, leveraging the power of neural networks to automate the feature extraction process. We will use a convolutional neural network (CNN) architecture to achieve this goal. For the training data, we use Caltech-UCSD Birds-200-2011 dataset, which contains images of 200 bird species, a total of 11,788 images with each species having 30-50 images. The dataset is available at Caltech-UCSD Birds-200-2011.

This report will delve into a series of experiments conducted on the CUB-200-2011 dataset, exploring the performance of various cutting-edge deep learning models, including ResNet, DenseNet, and the versatile YOLOv8. The analysis will also examine the impact of different optimization strategies, specifically Adam and AdamW. By scrutinizing the experimental results, this report aims to provide a comprehensive understanding of the efficacy of different architectural choices and training methodologies in tackling this intricate fine-grained classification problem.

AI Toolkits and Optimizers

Transfer learning: Feature Extraction vs Fine-tuning

Transfer learning is a powerful technique that allows us to leverage pre-trained models on large datasets to improve performance on specific tasks. In this project, we will explore two approaches: feature extraction and fine-tuning.

Feature Extraction: Feature extraction involves using a pre-trained model as a fixed feature extractor, where we freeze the weights of the pre-trained model and only train a new classifier on top of it. This approach is useful when we have limited labeled data for our specific task.

Fine-tuning: Fine-tuning, on the other hand, involves unfreezing some of the layers of the pre-trained model and training them along with the new classifier. This allows the model to adapt to the specific characteristics of our dataset, potentially leading to better performance.

The choice between feature extraction and fine-tuning has a direct and significant impact on model performance and computational requirements, as evidenced by the experimental results on the CUB-200-2011 dataset. For instance, the data indicates that ResNet50, a deeper model, used as a feature extractor yields 58-60% accuracy, while ResNet18, a shallower model, fine-tuned achieves a higher 66%.

ResNet: Building Deeper with Residual Connections

ResNet, or Residual Network, is a deep learning architecture that revolutionized the training of very deep neural networks by introducing residual connections. These connections, also known as skip connections, allow information to bypass one or more layers, effectively addressing the vanishing gradient problem that often hampers deep networks.

The core building block of ResNet is the residual block, which consists of two or more convolutional layers and a shortcut connection that adds the input directly to the output of the block. Mathematically, this can be represented as y = F(x, w_i) + x, where x is the input, F(x, w_i) is the residual function (the output of the stacked layers), and y is the final output of the block.

This design enables the network to learn residual mappings instead of directly trying to learn unreferenced functions, making it easier to train deeper models and preserve information from earlier layers.

Experimentation

Experiments show that fine-tuning a smaller model like ResNet18 (66% accuracy) works better than just using a deeper model like ResNet50 as a fixed feature extractor (58-60% accuracy). This means that letting the model adapt to the bird dataset is more important than just having a deeper network. For fine-grained tasks like bird classification, it's usually better to fine-tune—even with a smaller model—than to use a larger model without any adaptation.

DenseNet: Maximizing Feature Reuse and Efficiency

DenseNet, or Densely Connected Convolutional Network, is another deep learning architecture that builds upon the idea of residual connections but takes it a step further by connecting each layer to every other layer in a feed-forward fashion. This means that each layer receives inputs from all previous layers, allowing for maximum feature reuse and efficient gradient flow.

In DenseNet, each layer is connected to every other layer in a feed-forward manner, which means that the feature maps from all preceding layers are concatenated and passed as input to the subsequent layers. This dense connectivity pattern encourages feature reuse and helps mitigate the vanishing gradient problem, as gradients can flow more easily through the network during backpropagation.

Experimentation

The experiments utilized DenseNet121 and DenseNet161, both configured for feature extraction. In this setup, their pre-trained backbones were frozen, and only a new classification head was trained on the CUB-200-2011 dataset. DenseNet, even when used solely as a feature extractor, demonstrates strong performance (DenseNet161 achieving 69.399%), which notably surpasses ResNet50 in a similar feature extraction configuration (60%). This suggests that DenseNet's inherent design for pervasive feature reuse provides a more discriminative and effective set of features even without fine-tuning the backbone.

YOLOv8: Real-time Object Detection and Classification

YOLO (You Only Look Once) is a leading family of real-time object detection algorithms. Its "single-shot" approach predicts bounding boxes and class probabilities in one pass, making it fast and efficient compared to older multi-stage methods.

YOLOv8, developed by Ultralytics, is not just for object detection—it can also handle image classification and segmentation. For classification, models like yolov8n-cls.pt are used. YOLOv8 comes in several sizes (nano, small, medium, large, extra-large), letting you balance accuracy and speed.

Key improvements in YOLOv8 include:

  • Anchor-Free Detection: Predicts object centers directly, improving generalization and simplifying training.
  • Mosaic Data Augmentation: Combines four images during training for richer context, but stops this augmentation in the last training epochs for better results.
  • C2f Module & Decoupled Head: Architectural tweaks for better performance and modularity.

The YOLOv8 architecture has a backbone (for feature extraction), a neck (for merging features), and a head (for the final task, like classification).

On the CUB-200-2011 bird dataset, YOLOv8 achieved 73-78% accuracy—higher than ResNet and DenseNet. Its strong results come from its advanced architecture, training strategies (like AdamW optimizer, 100 epochs), and data augmentation. Even though YOLOv8 was designed for detection, its backbone and training make it excellent for fine-grained classification too.

This shows a trend: modern deep learning models can be adapted for multiple vision tasks, making development and deployment more efficient.

Optimizers

Adam (Adaptive Moment Estimation) is a popular optimizer in deep learning. It adapts the learning rate for each parameter by keeping track of both the average of past gradients (momentum) and the average of their squares (variance). This helps models train faster and more reliably, especially when the data is noisy or the gradients are sparse.

AdamW is an improved version of Adam. Its main difference is how it handles weight decay, a technique used to prevent overfitting by discouraging large weights. In Adam, weight decay is mixed with the gradient update, which can interfere with learning. AdamW separates weight decay from the gradient step, leading to better regularization and more stable training—especially for large or complex models.

In the experiments, ResNet and DenseNet used Adam, while YOLOv8 used AdamW. This likely helped YOLOv8 achieve higher accuracy, as AdamW is better at preventing overfitting and improving generalization. For challenging tasks or bigger models, AdamW is usually the better choice.

All Experiments

Performance Snapshot: CUB-200-2011 Results
The table below summarizes the accuracy of different deep learning models and configurations on the CUB-200-2011 bird dataset.

Table 1: CUB-200-2011 Bird Classification Performance Overview
Model Optimizer Layers / Hidden Accuracy (%) Remarks
ResNet50 Adam 1 hidden (512 nodes) 60 Feature extractor, only final layers trained
ResNet50 Adam 1 hidden (2048 nodes) 58 Feature extractor, only final layers trained
ResNet18 Adam No hidden 66 Fine-tuned, all layers trainable
DenseNet121 Adam No hidden 66.82 Feature extractor, only final layers trained
DenseNet121 Adam No hidden 67.25 Feature extractor, 40 epochs
DenseNet161 Adam No hidden 69.40 Feature extractor, only final layers trained
YOLOv8n AdamW - 73.1 lr=0.000714, momentum=0.9, 100 epochs
YOLOv8s AdamW - 77.4 lr=0.000714, momentum=0.9, 100 epochs
YOLOv8m AdamW - 78.5 lr=0.000714, momentum=0.9, 100 epochs

Key Insights:

  • YOLOv8 models (n, s, m) achieved the highest accuracy (73–78%), outperforming both ResNet and DenseNet.
  • Within YOLOv8, larger models (n < s < m) consistently performed better.
  • DenseNet161 (69.4%) outperformed all ResNet variants, even as a feature extractor.
  • Fine-tuning (ResNet18, 66%) was more effective than using deeper models as fixed feature extractors (ResNet50, 58–60%).
  • All top YOLOv8 models used the AdamW optimizer and were trained for 100 epochs.

Conclusion

Key Takeaways from the Experiments

The YOLOv8 models, especially the medium variant (YOLOv8m), stood out in these experiments, reaching up to 78.5% accuracy on the CUB-200-2011 bird dataset. This shows that YOLOv8, though designed for object detection, is also a powerful tool for fine-grained classification. As the YOLOv8 models get larger (from nano to medium), their accuracy improves, highlighting the value of bigger models when you have enough data and compute.

ResNet models showed that how you use them matters as much as which one you pick. ResNet50, when used as a frozen feature extractor, reached 60% accuracy, but surprisingly, increasing the size of its classification head didn't help. On the other hand, fine-tuning a smaller ResNet18 (training all its layers) boosted accuracy to 66%. This proves that letting a model adapt to your specific dataset is often more effective than just using a deeper, pre-trained network as-is.

DenseNet models (DenseNet121 and DenseNet161) also performed well as feature extractors, with DenseNet161 hitting nearly 69.4% accuracy. DenseNet's unique architecture, which connects each layer to every other layer, helps it extract richer features—even when the backbone is frozen. However, DenseNet's high memory use can be a drawback in some situations.

The choice of optimizer made a difference too. While ResNet and DenseNet used Adam, all top-performing YOLOv8 models used AdamW, which helps prevent overfitting and improves generalization. This likely contributed to YOLOv8's strong results.

In summary: For fine-grained bird classification, modern models like YOLOv8, combined with advanced optimizers like AdamW and smart training strategies (such as fine-tuning), deliver the best results. Letting your model adapt to your data, picking the right tools and a suitable training strategy (longer training, fine-tuning when possible) can make all the difference!