Why has Autonomous Driving failed? Perspectives from Peru and insights from NeuroAI

Blog

December 11, 2023

Modern Autonomous Driving systems that are being developed in 1st world countries such as in the United States of America (USA), United Kingdom (UK) and China are currently being released on the streets to test their levels of autonomy. Perhaps not to the surprise of many computer vision scientists (despite the outstanding progress in Computer Vision), these systems have not been operating at the level they initially hoped they could be in production a.k.a. The Real World.

Since 2022, numerous self-driving car companies have gone bankrupt, and even well-established companies backed by billions of dollars such as Waymo, Tesla and Cruise are now reducing their workforces and continuing to struggle to sustain the autonomous vehicle vision. And while there have certainly been many breakthroughs in machine vision to date -- why do these incidents continue to happen? And why do machines behave differently than humans on the driving wheel?

We think that the perceptual failure of autonomous driving are occurring due to two main reasons: 1) lack of perceptual understanding of out-of-distribution stimuli, and 2) lack of high-quality benchmarks for such out-of-distribution scenarios. In many cases, self-driving cars are being trained with real or synthetic data (see Parallel Domain) on near perfect sunny or uncrowded conditions with perfectly working traffic lights, well-paved roads, and law abiding drivers and mindful pedestrians. However, when the chips are low and cars must navigate through foggy weather, a crowd in a parade, dogs running (or playing) in the street without a leash, or checking for scooters along the side lanes -- perceptual inference on many of these systems start to second-guess.

Our hope is that by testing systems on such adversarial conditions and providing such qualitative benchmarks, we will be able to make progress in the development of perceptual inference for self-driving cars in both developed & developing countries. If an autonomous vehicle can easily navigate & identify every object and obstacle from routing videos in Lima, Hanoi or Bombay, then making perceptual inference in cities like San Francisco, London and Beijing should be trivial. Moreover, we’d suggest that training computer vision systems of Autonomous Driving vehicles should be done outside of the city of focus/deployment. In machine learning this is commonly known as Adversarial Training -- where by training systems in adverse and heavily out-of-distribution stimuli, neural network representations have a higher chance of performing better when faced with both the expected & the unexpected, in addition to being closer to learn brain-aligned representations as research at Artificio (Berrios & Deza. SVRHM 2022) and MIT (Harrington & Deza. ICLR 2022),Feather et al. (Nature Neuroscience 2023) suggests.

Given these two observations, in this blog post we propose two solutions that may help keep the Autonomous Driving dream alive : 1) Collect more out-of-distribution datasets from developing countries where these type of out-of-distribution driving patterns are more common (such as Lima, Peru) to benchmark (and train) the autonomous driving system’s computer vision on these datasets; and 2) Shift the training paradigm of such computer vision systems from big data to representational alignment (NeuroAI) -- because Scale is Not all you Need.

Starting Our Journey: A Summary of Our Efforts

Around December of 2022 (1 year ago) , we decided to embrace multiple journeys across various cities in Latin America to collect videos that would later serve the purpose of testing the performance of existing computer vision models, such as YOLO. Given the initial location of the working team, our journey began in Peru.

We chose to gather data from the cities of Lima, Cusco, and Cajamarca due to the relative ease of travel to Cusco and Cajamarca and the team's home base in Lima. Lima, in particular, with its rich vehicular and cultural diversity, presented a distinctive and compelling data collection opportunity. The expedition began in Cusco between December and January. Accordingly, the collected data includes footage of traveling markets and dense crowds in certain parts of the city. Weather conditions ranged from clear sunny days to episodes of rain and hail. The Lima leg of the journey demanded a different logistical approach given the city's remarkable range of vehicles, driving styles, and traffic density. In total, 22 out of Lima's 43 districts were covered, and the data from this city presents a formidable challenge for any computer vision model. Finally, the Cajamarca expedition captured both rural and urban settings, with a variety of vehicles under consistently sunny weather conditions.

As data collection progressed from December 2022 to February 2023, distinct behavioral patterns and variations among pedestrians and drivers, as well as a wide diversity of vehicles across different cities, became apparent. The collected data encompasses both rural and urban environments. According to TomTom Traffic, Lima ranks as the city with the worst traffic in South America. For this reason, data collection was carried out within a consistent time window (2 pm - 6 pm) to capture peak congestion. 6pm marks the height of traffic in Lima, as it does in many cities worldwide (one hour after the end of the standard working day).

A graph showing a projection of 1441 sampled frames from the total of 291 videos collected. Each video recording
lasted on average between 5 to 7 minutes.

Hardware Used for Data Collection

To begin data collection, we considered mounting a camera below the rearview mirror to capture the road ahead. This meant the camera needed to be lightweight and compact, pointing us toward dashcam-style devices. It is important to note that our initial objective was to assess how well existing models performed in routine traffic conditions, with plans to gradually incorporate additional cameras and sensors based on those findings. Copilotless's long-term goal is to equip a vehicle with multiple cameras and sensors — such as AEVA’s LIDAR — since different models operate on different types of input data. By gathering this data, we aim to test and benchmark a range of computer vision systems against rare objects and diverse street-view distributions across LATAM. We started with the Comma Three, but ultimately switched to dashcams instead as described below.

Built for permanent installation in a vehicle, the Comma Three was designed to run Openpilot. It incorporates 3 HDR cameras: 2 facing the road and one night vision camera aimed at the interior. Beyond cameras, the Comma Three includes connectivity and sensing capabilities such as LTE, Wi-Fi, IMU, high-precision GPS, and microphones. Our original plan was to use this device to stress-test Openpilot in congested, chaotic, and unpredictable environments in order to identify weaknesses and set benchmarks, but upon testing it we ran into the following problems: since the device works directly with Openpilot — which initiates its recognition only when it receives CAN signals from the vehicle — we were unable to run Openpilot because the cars available to us were not on the supported compatibility list, as stated on their website.

It was also possible to use the Comma Three in dashcam mode; however, this required specific configurations, including an external toolkit to record and store data locally without transmitting it. This introduced additional hardware dependencies (such as a PC) and led to testing cycles that took considerably longer than anticipated. Ultimately, we decided to seek out a dashcam offering comparable features to those used by the Comma Three. The cameras we selected are listed below, together with their technical specifications.

The Challenge: Out-of-Distribution Objects & Annotation

Once the data had been gathered, the logical next step was video annotation to enable benchmarking and evaluation of available open-source models. Labels are also required for supervised learning on new datasets, and we are actively training our own self-supervised models on this data to generate robust embeddings for out-of-distribution autonomous driving redundancy checks. During the annotation process, certain difficulties emerged — primarily concerning object identity — that were not easy to resolve without some form of collective agreement, as they could be interpreted differently depending on the observer. These challenges are described below:

Categories of Road Obstacles

A greater variety of obstacles tends to appear on rural roads due to less frequent inspections by authorities. From our observations, we have distinguished two categories: obstacles that require active avoidance (Avoiding Obstacles) and those that are a permanent or semi-permanent feature of the road surface (Non-Avoiding Obstacles), such as improvised speed bumps, gutters, etc. A more detailed description follows:

Powered and Unpowered Vehicles

Across Lima, Cusco and Cajamarca, we encountered a broad range of vehicle types that demanded a more refined classification scheme. Beyond conventional vehicles such as buses, cars, motorcycles, etc., we also came across non-standard vehicles such as Tuk-tuks, ice cream carts, motorcycles with front baskets, etc. Some were engine-powered while others were not, making this distinction important for downstream models that need to be aware of what is present on the road [particularly with respect to speed awareness], given the fundamentally different behavior of motorized versus non-motorized vehicles in terms of the actions and states they can exhibit in the environment.

Qualitative Benchmarking of Computer Vision Systems on the Copilotless-PER Dataset

Uncertain about how best to proceed with exhaustive annotation of all collected data, we shifted our focus to conducting a qualitative evaluation of several widely-used computer vision models that may serve as a reasonable approximation (and hopefully a lower bound) of current self-driving car perception capabilities.

We deployed YOLO-V8, EVA-01 and Detectron2 on the data collected from the cities of Lima, Cajamara and Cusco and arrived to the following qualitative conclusions :

In the majority of cases, these computer vision systems reliably detect pedestrians with minimal risk of false negatives. We are currently working toward fully annotating our dataset to enable a more precise quantification of both false negative and false positive rates in pedestrian detection and classification.
There seems to be a high risk of false positives and confusion however between cars, buses and trains. Perhaps this is not a “problem” because at the end of the day they are all moving obstacles, but in some cases we found bizarre examples such as a wall being misclassified as a train. Many of these mistakes can be seen in the video of the last section.
There is notable disagreement across models regarding detections and classifications, which suggests that an ensemble approach could benefit current AV perception pipelines.

NeuroAI: Brain-Alignment & Similarity Search as a Safety Net for Autonomous Driving

‍As a tentative solution to solve the problem of object classification we proposed a similarity search mechanism to find similar looking objects without labeling the input stimuli and by extracting geometrical properties from the images. This is also known as using a visual embedding for image retrieval and search, and is popular in many e-commerce search engine platforms where users want to buy “a similar object” -- Google Lens being the most well known visual search engine feature at a web-scale, or “similar paintings” (see our very own Seezlo Engine that was not trained on Art, and yet works exceptionally well to retrieve similar art pieces over a local dataset).

Back to Self-Driving Cars: how and why should modern perceptual systems in autonomous vehicles then use this type of technology? We propose a retrieval-based approach where vehicles may query any random object from their visual pipeline and find the most similar object that has previously been classified to potentially resolve this conflict.

Below we show and benchmark how several visual embeddings ranging from OpenAI ResNet-CLIP & ViT-CLIP (trained on 400M images), Copilotless Cortex (trained on 1M images) can perform some of these tasks for several objects that look “tricky” for humans & machines. Notice mainly that models based on NeuroAI technology (that seek to optimize for brain-alignment rather than performance on object recognition as done in deep learning), are more robust in this task and are also not data hungry.

To reinforce this further, a growing wave of NeuroAI models has been reshaping the landscape of perceptual alignment with humans — largely unnoticed by the broader Deep Learning community — precisely because these models demand neither millions of training images, nor a GPU farm, nor substantial capital investment. We anticipate that within 1-2 years, this miniaturization trend — driven primarily by neuroscience labs (and start-ups like Copilotless) — will allow these approaches to reach parity with other major AI players by 2025.

Want to read more about NeuroAI and representational alignment -- a movement that is taking this year’s NeurIPS [2023] by storm? Check out this epic paper by Sucholutsky, Muttenthaler et al. 2023 and the works published at CCN & SVRHM (a NeurIPS workshop) from 2019 to 2022.

Analogous to many modern leaderboards for LLMs, the Brain-Score organization based at MIT’s Quest for Intelligence and MIT's Center for Brains, Minds and Machines, has one of the world’s largest benchmarking platforms for current computer vision models where these models are correlated to actual primate neural data. Our foundational model Cortex, was ranked world #1 last year for predicting activation of visual area V4, and is currently #8 globally out of 242 models. It cost us no more than $15’000 to train this model, and it is safe to say that the same applies for many other NeuroAI models in this table that come from academic labs and research groups where the scientific team is a mixture of Computational Neuroscientists and AI experts.

But what *is* the advantage of Brain-Aligned models, and why should the autonomous driving community take notice? Recall Adversarial Images? Those subtle, optimized distortions designed to fool neural networks in perplexing ways? Brain-Aligned models are not immune to such stimuli — but when they are attacked, the resulting adversarial perturbations also fool humans, and no longer resemble random noise. This makes the adversarial image interpretable and provides a strong indicator of perceptual alignment with human vision. Some examples of how NeuroAI models (and our own Cortex) respond to such stimuli:

One can now imagine a future where self-driving cars have NeuroAI models as part of their perceptual inference engine because it allows them to make *reasonable* and *interpretable* errors. As legislation around self-driving cars progresses, and society's own complexities evolve, self-driving cars should be able to perceive and reason the same way humans do (because yes, even Multi-Modal LLMs make these mistakes, so neither multi-modality or big data will be the final answer).

Final thoughts: Out-of-distribution data is a feature not a bug, and could potentially solve Autonomous Driving if coupled with NeuroAI models.

To the best of our knowledge, self-driving car companies do not currently employ these techniques or this scientific methodology, and the root cause of accidents may well be over-fitting to a specific city by training (heavily) and testing on the same distribution — without sufficiently challenging the network during training, causing it to fail when presented with unexpected imagery at test time. This, combined with the prevailing belief among many self-driving car companies that accumulating ever-larger pools of identically distributed data (i.i.d.) will resolve the issue — a premise that, as argued above, may prove incorrect — risks steering the Self-Driving car market toward a serious crash unless the industry fundamentally reassesses its approach.

A comparable line of thinking was once pursued by Perceptive Automata, a Series B start-up founded in 2015 that has since shut down — because self-driving car companies did not appear to recognize this as a genuine problem until now, when C-suite executives are stepping down from their positions in response to the mounting difficulties autonomous vehicles face in the marketplace, and the growing body of evidence that these systems still fail to comprehend their environment when pushed to extremes (which is precisely when accidents happen).

The lesson here, with an ironic note of warning from Perceptive Automata to the broader self-driving car industry, is that established companies ought to consider building partnerships with startups and academic labs rather than dismissing them, working together toward the future to tackle this difficult problem. If there is any way we can contribute, we welcome collaboration with academic and industrial partners in this effort.

We are looking forward to releasing this dataset, bench-marking results and annotations too for scientific and commercial purposes soon. Stay tuned & write to us if you would like to learn more. And don’t forget to follow-us on X for more updates

‍Sincerely,

The Copilotless Team
Lima, Peru. December 2023.

Core Contributors:

Lead Research Scientists: Dunant Cusipuma & David Ortega

Supporting Research Scientist: Victor Flores

Research Director: Arturo Deza

‍

If you have found this post relevant for your research please cite as:

@misc{Ortega_Cusipuma_Flores_Deza_2023,
title={Why has autonomous driving failed? perspectives from Peru and insights from neuroai},
url={https://www.artificio.org/blog/why-has-autonomous-driving-failed-perspectives-from-peru-and-insights-from-neuroai},
journal={Online Blog Series},
publisher={Artificio Blog},
author={Ortega, David and Cusipuma, Dunant and Flores, Victor and Deza, Arturo},
year={2023}}