Teaching AI to Drive: Mastering Car Racing with Reinforcement Learning

Imagine an AI agent, not just navigating a road, but pushing the limits on a race track, learning from every turn, every slip, and every victory. This isn't science fiction; it's the exciting frontier of reinforcement learning, and it's precisely what our Autonomous Car Racing project set out to achieve.

We've developed a sophisticated reinforcement learning agent, powered by the Proximal Policy Optimization (PPO) algorithm, to conquer the challenging CarRacing-v2 environment from Gymnasium. Using custom actor-critic neural networks built with PyTorch, our agent learns to master the art of high-speed, autonomous driving.

The Challenge: The Dynamic World of CarRacing-v2

The CarRacing-v2 environment is no simple joyride. It presents a continuous control task where the agent must learn to steer, accelerate, and brake effectively to navigate procedurally generated tracks. Every race is a new challenge, preventing the agent from simply memorizing a specific path. The dynamic nature of the tracks, combined with the need for precise, real-time control, makes it an ideal proving ground for advanced reinforcement learning algorithms.

The Brain Behind the Wheel: Proximal Policy Optimization (PPO)

At the heart of our autonomous racer is the Proximal Policy Optimization (PPO) algorithm. PPO is a state-of-the-art reinforcement learning algorithm known for its stability and effectiveness in continuous control tasks. It works by balancing the need for exploration (trying new actions) with exploitation (using what it has learned) to maximize cumulative rewards.

Our PPO agent utilizes a custom actor-critic neural network:

  • The Actor network is responsible for deciding the car's actions (steering, acceleration, braking). It outputs parameters of a Beta distribution, allowing for smooth, continuous control over these actions.
  • The Critic network evaluates the current state, predicting the expected future reward. This helps the actor learn more efficiently by providing a value estimate of its actions.
The agent's learning is driven by a carefully crafted loss function that combines a clipped PPO loss for policy updates and a value loss for accurate state-value predictions, ensuring robust and stable training.

Key Features and Innovations

  • PPO Algorithm Implementation: A robust and efficient implementation of PPO for continuous control.
  • Customizable Convolutional Neural Networks (CNNs): The agent's "eyes" are CNNs, designed to extract spatial features from the raw input frames (stacked grayscale images). We explored various CNN architectures, including:
    • Conventional CNNs
    • CNNs with Leaky ReLU activation functions
    • CNNs incorporating Batch Normalization layers
    • Architectures utilizing Residual Blocks
    • Networks with Depthwise Separable Convolutions
    • ResNet variants, both with and without ImageNet pre-trained weights
  • Domain-Randomized Tracks: Training on randomly generated tracks ensures the agent learns generalizable driving skills rather than overfitting to specific track layouts.
  • Moving Average Evaluation: This technique provides a stable measure of the agent's performance over time, smoothing out fluctuations and giving a clearer picture of learning progress.
  • Modular Environment Wrapper: A custom wrapper handles preprocessing of observations (like stacking grayscale frames) and allows for flexible reward adjustments, tailoring the learning signal to the task.
  • Built-in Logging and Visualization: Comprehensive logging and plotting utilities help track training progress and analyze performance.

The Training Ground: High-Performance Computing

Training a reinforcement learning agent to master a complex environment like CarRacing-v2 requires significant computational power. Each training epoch, even for our optimized models, took a minimum of 10 seconds per iteration. To achieve a well-performing model, we needed to train for over 1500 epochs, translating to more than 4 hours of continuous computation.

To meet these demands, we leveraged the UMD Zaratan HPC cluster, specifically utilizing 1/7th of an Nvidia A100 GPU resource for most of our models. This allowed us to efficiently iterate through different architectural designs and hyperparameters.

Here's a snapshot of the training efficiency for various models:

Model Cluster GPU # Epochs Trained / Day Mean Time / Epoch (seconds)
CNN Nvidia A100_1g.5gb 6548 12.79
CNN + LeakyReLU Nvidia A100_1g.5gb 8110 10.37
CNN + BatchNorm Nvidia A100_1g.5gb 10372 8.28
Residual Blocks Nvidia A100_1g.5gb 6014 13.5
Depthwise Separable Convolutions Nvidia A100_1g.5gb 5607 14.2
ResNet (no weights) Nvidia A100 9910 6.77
ResNet (pretrained) Nvidia A100 11707 8.06

Race Day Results: Performance Analysis

After extensive training, our agents demonstrated impressive capabilities. The performance was evaluated based on the maximum moving average score during training and various metrics during testing.

Training Performance (Maximum Moving Average Score):

  • CNN: 624.99 at epoch 5840
  • CNN with Leaky ReLU: 678.61 at epoch 2230
  • Residual Blocks: 569.75 at epoch 2430

The CNN with Leaky ReLU showed the highest training stability and reward, reaching its peak much faster than the conventional CNN.

Testing Performance (Scores over multiple runs):

Model Maximum Score Minimum Score Average Score Median Score
CNN 910.1 -17.93 676 847.77
CNN + LeakyReLU 894.6 -42.94 620.99 803.02
Residual Blocks 897.8 -39.53 616.57 827.66

These results highlight the agent's ability to achieve high scores, with the conventional CNN showing the highest single maximum score. The median and average scores provide a more robust measure of consistent performance across different track layouts.

The Road Ahead: Future Enhancements

  • Recurrent Layers (e.g., LSTM): Integrating recurrent layers would allow the agent to incorporate temporal context into its decision-making, potentially leading to smoother and more intelligent driving.
  • Parallel Training: Implementing parallel training techniques could significantly reduce overall training time, allowing for even more extensive experimentation with architectures and hyperparameters.
  • Exploring More CNN Architectures: Further research into novel convolutional neural network designs could unlock even higher levels of performance and efficiency.

Conclusion

Our Autonomous Car Racing project demonstrates the power of Proximal Policy Optimization combined with custom neural network architectures to tackle complex continuous control problems. By leveraging high-performance computing and systematically evaluating various design choices, we've successfully trained an AI agent capable of navigating and mastering the challenging CarRacing-v2 environment. This work not only pushes the boundaries of virtual autonomous driving but also provides valuable insights into building robust and intelligent reinforcement learning systems.

Want to dive deeper into the code? Check out the project repository on GitHub.