链接

https://arxiv.org/pdf/2402.01105v2

内容

image-20240830160226497

2 Large Language Models in AD

2.1 Overview

More and more researchers have started to apply these

  • Reasoning
  • Understanding
  • in-context learning

capabilities to address challenges in AD.

2.2 Applications in AD

2.2.1 Reasoning and Planning
  • GPT driver:not only recommends vehicle actions but also elucidates the rationale behind these suggestions, significantly enhancing the transparency and explainability of autonomous driving decisions.
  • Driving with LLMs:enhances the explainability of autonomous driving decisions
  • “Receive, Reason, and React” : instructs LLM agents to assess lane occupancy and evaluate the safety of potential actions, thereby fostering a deeper comprehension of dynamic driving scenarios

These methods not only leverage LLMs’ inherent ability to understand complex scenarios but also employ their reasoning capabilities to simulate human-like decision-making processes. Through the integration of detailed environmental descriptions and strategic prompts, LLMs contribute significantly to the planning and reasoning aspects of AD, offering insights and decisions that mirror human judgment and expertise.

image-20240830160316094

2.2.2 Prediction
  • Integrating pre-trained language encoders into trajectory prediction models for autonomous driving:an early exploration of LLM’s power to make trajectory predictions. They convert the scene representation into text prompts, and use BERT model to generate the text encoding, which is finally fused with image encoding to decode trajectory prediction. Their evaluation shows significant improvement compared with baselines only using image encoding or text encoding.
2.2.3 User Interface and Personalization
2.2.4 Simulation and Testing
  • Adept:The ADEPT system [11] uses GPT to extract key information from NHTSA accident reports using QA approach, and was able to generate diverse scene code used for simulation and testing.
  • TARGET: TARGET [12] system is able to use GPT to translate traffic rules from the natural language to the domain-specific language, which is used for generating testing scenarios.
  • LCTGen:LCTGen [13] uses LLM as a powerful interpreter translating user’s text query into structured specifications of map lanes and vehicle locations for traffic simulation scenarios.

2.3 Methods and Techniques

2.3.1 Prompt Engineering

Prompt engineering adopts sophisticated designs of input prompts and questions to guide the Large Language Model to generate our desired answers.

  • Driving with LLMs [7] has diving rules covering aspects like traffic light transition and left or right driving side.
  • [15] proposes a module called common-sense module, which stores the rules and instructions for human driving, for example, avoiding collision, and maintaining safety distances.
  • LanguageMPC [16] adopts a top-down decision-making system: given different situations, the vehicle has different possible actions. LLM agent is also instructed to identify important agents in the scenario and output attention, weight, and bias matrices to select from pre-defined actions.

Memory modules are also introduced in some papers, which store past driving scenarios. At inference time, the relevant examples are retrieved and added as the context in the prompt, and LLM agent can better leverage few-shot learning capabilities and reflect on the most relevant scenarios.

  • DILU[17] proposes a memory module, which stores text descriptions of driving scenarios in the vector database, and the system can retrieve top-k scenarios for few-shot learning.
  • [15] has a two-stage retrieval process: the first stage uses k-nearest-neighbor search to retrieve relevant past examples in the database, and the second stage asks LLM to rank these examples.

Agents/Tools

More papers built complex systems to manage tasks in the prompt generation, which trigger function calls to other modules or sub-systems to obtain required information for decision-making.

  • [15] has created libraries and function API calls to interact with perception, prediction, and mapping systems so that the LLM can fully leverage all available information.
  • LanguageMPC [16] uses LangChain to create tools and interfaces needed by LLM to get relevant vehicles, possible situations, and available actions.
2.3.2 Fine-tuning v.s. In-context Learning

Most papers are focused on in-context learning, but only a few papers utilize finetuning. Researchers have mixed results on which one is the better:

  • [15] compared both approaches and found that few-shot learning is slightly more effective.
  • GPT-Driver [6] has a different conclusion that using OpenAI fine-tuning performs significantly better than few-shot learning.
  • [7] also compared training from scratch and finetuning approaches, and found using the pre-trained LLaMA model with LoRA-based fine-tuning can perform better than training from scratch.
2.3.3 Reinforcement Learning and Human Feedback
  • DILU [17] proposes reflection modules, which store good driving examples and bad driving examples with human corrections to enhance its reasoning capabilities further. In this way, the LLM can learn to reason about what action is safe and unsafe and continuously reflect on a large amount of past driving experiences.
  • Surreal Driver [18] interviewed 24 drivers and used their descriptions of driving behavior as chain-of-thought prompts to develop a ‘coach agent’ module, which can instruct the LLM model to have a human-like driving style.

2.4 Limitations and Future Directions

2.4.1 Hallucination and Harmfulness
  • According to evaluation result [6], the LLM model for autonomous driving has a 0.44% collision rate, higher than other methods.
  • [7] proposes a method to reduce hallucination by asking questions without enough information to make decisions, and instructs LLM to answer “I don’t know”.
  • More human-in-the-loop training and alignment (like RLHF [20] and DPO [21]) can reduce hallucinations and harmful driving decisions.
2.4.2 Latency and Efficiency
2.4.3 Dependency on Perception System
2.4.4 Sim to Real Gap

2.5 Summary

image-20240830160340419

3 Vision Foundation Model

  • DINO [28] uses vision-transformer architecture, and is trained in a self-supervised manner, predicting global image features given local image patches
  • DINOV2 [29] scales the training with one billion parameters and a diversely curated dataset of 1.2 billion images and achieves state-of-the-art results in multiple tasks.
  • Segment-anything model [30] is a foundation model for image-segmentation. The model is trained with different types of prompts (points, boxes, or texts) to generate segmentation masks. Trained with billions of segmentation masks in the dataset, the model shows zero-shot transfer capability to segment new objects given the appropriate prompt.
  • Diffusion model [31] is a generative foundation model widely used for image generation.
  • Stable-Diffusion [33] model uses VAE [35] to encode images to latent representation and use UNet [36] to decode from latentvariable to pixel-wise images. It also has an optional text encoder and applies the crossattention mechanism to generate images conditional on prompts (text description or other images).
  • DALL-E [37] model was trained with billions of image and text pairs and uses stable diffusion to generate high-fidelity images and creative arts following human instructions. DALL-E 2 [38], an extension of DALL-E [37], integrates a CLIP encoder with a diffusion decoder to handle both image generation and editing tasks
  • Building on this, DALL-E 3 focuses on enhancing prompt adherence and caption quality. It first trains a robust image captioner capable of generating detailed and accurate image descriptions, and then uses this captioner to produce even more refined and detailed captions. There is growing interest in the application of vision foundation models in autonomous driving, mainly for 3D perception and video generation tasks.

image-20240830160357105

3.1 Perception

  • SAM3D [39] applies SAM(Segment-anything model) to 3D object detection in autonomous driving. Lidar point clouds are projected to BEV(bird-eye-view) images, and it uses 32x32 mesh grids to generate point prompts to detect masks for foreground objects. It leverages the SAM model’s zero-shot transfer capability to generate segmentation masks and 2D boxes. Then it uses vertical attributes of those lidar points inside 2D boxes to generate 3D boxes. However, the Waymo Open Dataset evaluation shows the average-precision metrics are still far from existing state-of-the-art 3D object detection models. They observed that SAM trained foundation model can not handle those sparse and noisy points very well, and often results in false negatives for distant objects.
  • SAM is applied to domain adaptation for 3D segmentation tasks, leveraging the SAM model’s feature space which contains more semantic information and generalization capability. [40] proposes SAM-guided feature alignment, learning unified representation of 3D point cloud features from different domains. It uses the SAM feature extractor to generate the camera image’s feature embedding and projects 3D point clouds into camera images to obtain SAM features.
  • SAM and Grounding-DINO[41] are used to create a unified segmentation and tracking framework leveraging temporal consistency between video frames**[42].** GroundingDINO is an open-set object detector that takes input from text descriptions of objects and outputs the corresponding bounding boxes. Given the text prompts of object classes related to autonomous driving, it can detect objects in video frames and generate bounding boxes of vehicles and pedestrians. SAM model further takes these boxes as prompts and generates segmentation masks for detected objects

3.2 Video Generation and World Model

  • GAIA-1[43] is developed by Wayve to generate realistic driving videos. The world model uses camera images, text descriptions, and vehicle control signals as input tokens and predicts the next frame. The paper uses pre-trained DINO[28] model’s embedding and cosine similarity loss to distill more semantic knowledge to image token embedding. They use the video diffusion model[44] to decode high-fidelity driving scenes from the predicted image token. There are two separate tasks to train the diffusion model: image generation and video generation. The image generation task helps the decoder generate high-quality images, while the video generation task uses temporal attention to generate temporally consistent video frames. The generated video follows high-level real-world constraints and has realistic scene dynamics, such as the object’s location, interactions, traffic rules, and road structures. The video also shows diversity and creativity, which have realistic possible outcomes conditioned on different text descriptions and the ego vehicle’s action.
  • DriveDreamer [45] also uses the world model and diffusion model to generate video for autonomous driving. In addition to images, text descriptions, and vehicle actions, the model also uses more structural traffic information as input, such as HDMap and object 3D boxes, so that the model can better understand higher-level structural constraints of traffic scenes. The model training has two stages: the first stage is video generation using the diffusion model conditioned on structured traffic information. It was built on a pre-trained Stable-Diffusion model[33] with parameters frozen. In the second stage, the model is trained with both future video prediction tasks and action prediction tasks to better learn future prediction and interactions between objects.
  • [46] built a point cloud-based world model that achieves SOTA performance in point cloud forecasting tasks. They propose a VQVAE-like [47] tokenizer to represent 3D point clouds as latent BEV tokens and use discrete diffusion to forecast future point clouds given the past BEV tokens and ego vehicle’s actions tokens.

3.3 Limitations and Future Directions

The current state-of-the-art foundation model like SAM doesn’t have good enough zero-shot transfer ability for 3D autonomous driving perception tasks, such as object detection, and segmentation. Autonomous driving perception relies on multiple cameras, lidars, and sensor fusions to obtain the highest accuracy object detection result, which is much different from image datasets randomly collected from the web. The scale of current public datasets for autonomous driving perception tasks is still not large enough to train a foundational model and cover all possible long-tail scenarios. Despite the limitation, the existing 2D vision foundation models can serve as useful feature extractors for knowledge distillation, which helps models better incorporate semantic information. In the domain of video generation and forecasting tasks, we have already seen promising progress leveraging existing diffusion models for video generation and point cloud forecasting, which can be further applied to creating high-fidelity scenarios for autonomous driving simulation and testing.

4 Multi-modal Foundation Models

  • One of the most well-known multi-modal foundation models is CLIP[48]. The model is pre-trained using the contrastive pre-training method. The inputs are noisy images and text pairs, and the model is trained to predict if the given image and text are a correct pair. The model is trained to maximize the cosine similarity of embedding from the image encoder and text encoder.
  • Multi-modal foundation models, like LLaVA[49], LISA[50], and CogVLM[51] can be used for the general-purpose visual AI agent, which demonstrates superior performance in vision tasks, such as object segmentation, detection, localization, and spatial reasoning.
  • Video-LLaMA[52] can further perceive video and audio data, which may help autonomous vehicles better understand the world from temporal images and audio sequences

Transferring general knowledge from large-scale pre-training datasets to autonomous driving, the multi-modal foundation models can be used for object detection, visual understanding, and spatial reasoning, which enables more powerful applications in autonomous driving.

4.1 Visual Understanding and Reasoning

With the help of the multi-modal foundation models, we can generate explanations and the reasoning process of the model to better investigate the issues.

  • To further improve the perception system, HiLM-D[54] utilizes multi-modal foundation models for ROLISP(Risk Object Localization and Intention and Suggestion Prediction). It uses natural language to identify risky objects from camera images and provide suggestions on the ego vehicle’s actions. To overcome the drawback of missing small objects, it proposes a pipeline with both high-resolution and low-resolution branches.
  • Talk2BEV[56] proposes an innovative bird’s-eye view (BEV) representation of the scene fusing both visual and semantic information. The pipeline first generates the BEV map from image and lidar data and uses general-purpose visual-language foundation models to add more detailed text descriptions of cropped images of objects. The JSON text representation of the BEV map is then passed to general-purpose LLM to perform Visual QA, which covers spatial and visual reasoning tasks. The result shows a good understanding of detailed instance attributes and also higher-level intent of objects, and the ability to provide free-formed advice on the ego vehicle’s actions.
  • LiDAR-LLM[57] uses a novel approach that combines point cloud data with the advanced reasoning abilities of Large Language Models to interpret real-world 3D environments and achieves excellent performance in 3D captioning, grounding, and QA tasks. The model employs a unique three-stage training and a View-Aware Transformer(VAT) to align 3D data with text embedding, enhancing spatial comprehension. Their examples show the model can understand the traffic scenes and provide suggestions for autonomous driving planning tasks.
  • [58] focus on the the explainability of vehicle’s actions using a visual QA approach. They collected driving videos in simulated environments from 5 different action categories(like going straight and turning left) and used manually labeled explanations of actions to train the model. The model was able to explain the driving decision based on road geometry and clearance of obstacles. They find it promising to apply state-ofthe-art multi-modal foundation models to generate structured explanations of vehicle actions.

4.2 Unified Perception and Planning

  • [59] performed an early exploration of GPT-4Vision[5]’s application in perception and planning tasks, and evaluated its capabilities in several scenarios. It shows that GPT4Vision can understand weather, traffic signs, and traffic lights and identify traffic participants in the scene. It can also provide more detailed semantic descriptions of these objects, such as vehicle rear lights, intents like U-turn, and detailed vehicle types(e.g. cement mixer truck, trailer, and SUV). It also shows the foundation model’s potential for understanding point cloud data, GPT-4V can identify vehicles from point cloud contours projected in BEV images. They also evaluated the model’s performance on planning tasks. Given the traffic scenario, GPT4-V is asked to describe its observation and decision on the vehicle’s action. The results show good interaction with other traffic participants and compliance with the traffic rules and common sense, e.g. following the car at a safety distance, yielding to cyclists at a crosswalk, remaining stopped until the light turns green. It can even handle some long-tail scenarios very well, such as the gated parking lot.
  • Instruction tuning is used to better adapt general-purpose multi-modal foundation models to autonomous driving tasks. DriveGPT4 [60] created an instruction-following dataset, where ChatGPT, YOLOV8 [61] and ground truth vehicle control signals from BBD-X dataset [62] are used to generate question and answers about common objects detections, spatial relations, traffic light signals, the ego vehicle’s actions. Following LLaVA, it used the pre-trained CLIP[48] encoder and LLM weights and fine-tuned the model with their instruction-following dataset specifically designed for autonomous driving. They were able to build an end-to-end interpretable autonomous driving system, which is able to have a good understanding of the surrounding environment and make decisions on vehicle actions with jurisdictions and lower-level control commands.

4.3 Limitations and Future Directions

  • The multi-modal foundation models show capability for spatial and visual reasoning, which is required by autonomous driving tasks. Compared to traditional object detection, classification model trained on the closed-set dataset, the visual reasoning capability and free-formed text description can provide more abundant semantic information, which can solve many long-tail detection problems, such as classification of special vehicles, and understanding of hand signals from the police officers and traffic controllers. The multi-modal foundation models have good generalization capability and can handle some challenging long-tail scenarios very well using common sense, like stopping at a gate with controlled access. Further leveraging its reasoning capability for planning tasks, the vision-language models can be used for unified perception planning and end-to-end autonomous driving.
  • There are still limitations of multi-foundation models in autonomous driving. [59] shows the GPT-4V model still suffers from hallucination and generates unclear responses or false answers in several examples. The model also shows incompetence in utilizing multi-view cameras and lidar data for accurate 3D object detections and localization, because the pre-training dataset only contains 2D images from the web. More domain-specific fine-tuning or pre-training is required to train multi-modal foundation models to better understand point cloud data and sensor fusion to achieve comparable performance of the state-of-the-art perception system.

5 Conclusion and Future Directions

We have summarized and categorized recent papers applying foundation models to autonomous driving. We build a new taxonomy based on modality and functions in autonomous driving. We have detailed discussions on methods and techniques for adapting foundation models to autonomous driving, e.g. in-context learning, finetuning, reinforcement learning, and visual instruction tuning. We also analyze the limitations of foundation models in autonomous driving, e.g. hallucination, latency, and efficiency as well as the domain gap in the dataset, and thereby propose the following research directions:

  • Domain-specific pre-training or fine-tuning on autonomous driving dataset
  • Reinforcement Learning, and Human-in-the-loop alignment to improve safety and reduce hallucinations
  • Adaptation of 2D foundation models to 3D, e.g. language guided sensor fusion, fine-tuning, or few-shot learning on the 3D dataset
  • Latency and memory optimization, model compression, and knowledge distillation for deployment of foundation models to vehicles

We also notice that the dataset is one of the biggest obstacles in the future development of foundation models in autonomous driving. The existing open-sourced dataset**[63]** for autonomous driving at the scale of 1000 hours, is far less than pretraining datasets used for state-of-the-art LLMs. The web dataset used for existing foundation models doesn’t leverage all modalities required by autonomous driving, such as lidar and surround cameras. The web data domain is also quite different from the real driving scenes.

We propose the longer-term future road map in Figure 6.

image-20240830160407229

In the first stage, we can collect a large-scale 2D dataset that can cover all data distribution, diversity, and complexity of driving scenes in the real-world environment for pre-training or fine-tuning. Most vehicles can be equipped with front cameras to collect the data in different cities, at various times of the day. In the second stage, we can use smaller but higher-quality 3D datasets with lidar to improve the foundation model’s 3D perception and reasoning, for example, we can use existing state-of-the-art 3D object detection models as teachers to fine-tune the foundation model. Finally, we can leverage human driving examples or annotations in planning and reasoning for alignment, reaching the utmost safety goal of autonomous driving.

Reference

[6] GPT-Driver: Learning to Drive with GPT

https://arxiv.org/abs/2310.01415

https://arxiv.org/pdf/2310.01415

https://github.com/PointsCoder/GPT-Driver

image-20240830160414768

image-20240830160438897

❤️ [7] Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

https://arxiv.org/abs/2310.01957

https://arxiv.org/pdf/2310.01957

https://github.com/wayveai/Driving-with-LLMs

image-20240830160455427

image-20240830160513137

image-20240830160522058

[8] Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles

https://arxiv.org/pdf/2310.08034

image-20240830160533183

image-20240830160541529

[9] Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving

image-20240830160549979

image-20240830160556413

[10] Human-Centric Autonomous Systems With LLMs for User Command Reasoning

https://arxiv.org/abs/2311.08206

https://arxiv.org/pdf/2311.08206

image-20240830160614568

[11] Adept: A testing platform for simulated autonomous driving

https://chentaolue.github.io/pub-papers/ase22-adept.pdf

image-20240830160625590

❤️ [12] TARGET: Automated Scenario Generation from Traffic Rules for Testing Autonomous Vehicles

https://arxiv.org/pdf/2305.06018

image-20240830160634450

image-20240830160641208

image-20240830160649840

image-20240830160656333

❤️ [13] Language Conditioned Traffic Generation

https://arxiv.org/abs/2307.07947

https://arxiv.org/pdf/2307.07947

https://ariostgx.github.io/lctgen/

image-20240830160717382

image-20240830160726676

❤️ [15] A Language Agent for Autonomous Driving

https://arxiv.org/abs/2311.10813

https://arxiv.org/pdf/2311.10813

image-20240830160736008

image-20240830160744391

image-20240830160756262

image-20240830160804472

❤️ [16] LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

https://arxiv.org/abs/2310.03026

https://arxiv.org/pdf/2310.03026

image-20240830160820610

image-20240830160828999

image-20240830160840155

image-20240830160849506

[17] DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

https://arxiv.org/abs/2309.16292

https://pjlab-adg.github.io/DiLu/

image-20240830160900832

image-20240830160908280

image-20240830160927615

image-20240830160936331

❤️ [18] SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers’ Driving-thinking Data

https://arxiv.org/abs/2309.13193

Our data consists of 24 driver interview videos, with a duration ranging from 1.5 to 2 hours. We transcribed the audio recordings into written documents and organized the participants’ descriptions of their driving decision processes for each scenario encountered during the experiments. Each participant’s data was processed by two to three trained coders, and a coding consistency check was performed.

D11 (expert): ”No matter right or left, I must look at the direction that I turn to first because that’s the road that I will take. However, I also look in the opposite direction. Basically, I look twice. The first time is to look at both sides; the second time is to confirm. Then I take the turns.”

D06 (expert): ”Look at the left rearview mirror first, mainly about the speed of the back car. If the speed is slow, I can step on gases and go directly. If the speed is fast, I can pause and wait. I can go after they pass by.”

image-20240830160948685

image-20240830161000142

[19] Incorporating Voice Instructions in Model-Based Reinforcement Learning for Self-Driving Cars

https://arxiv.org/pdf/2206.10249

[39] SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model

https://arxiv.org/abs/2306.02245

https://arxiv.org/pdf/2306.02245

https://github.com/DYZhang09/SAM3D

image-20240830161011226

image-20240830161020095

[40] Learning to Adapt SAM for Segmenting Cross-domain Point Clouds

https://arxiv.org/abs/2310.08820

https://arxiv.org/pdf/2310.08820

image-20240830161028458

image-20240830161037395

[41] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

https://arxiv.org/abs/2303.05499

https://github.com/IDEA-Research/GroundingDINO

image-20240830161049236

image-20240830161058347

[42] Segment and Track Anything

https://arxiv.org/abs/2305.06558

https://arxiv.org/pdf/2305.06558

https://github.com/z-x-yang/Segment-and-Track-Anything

This report presents a framework called Segment And Track Anything (SAMTrack) that allows users to precisely and effectively segment and track any object in a video. Additionally, SAM-Track employs multimodal interaction methods that enable users to select multiple objects in videos for tracking, corresponding to their specific requirements. These interaction methods comprise click, stroke, and text, each possessing unique benefits and capable of being employed in combination. As a result, SAM-Track can be used across an array of fields, ranging from drone technology, autonomous driving, medical imaging, augmented reality, to biological analysis.

image-20240830161105651

image-20240830161113516

[43] GAIA-1: A Generative World Model for Autonomous Driving

https://arxiv.org/abs/2309.17080

https://arxiv.org/pdf/2309.17080

https://wayve.ai/thinking/introducing-gaia1/

Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle’s actions as the world evolves.

To address this challenge, we introduce GAIA-1 (‘Generative AI for Autonomy’), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1’s learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.

image-20240830161124751

image-20240830161133511

[45] DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

https://arxiv.org/abs/2309.09777

https://drivedreamer2.github.io/

World models, especially in autonomous driving, are trending and drawing extensive attention due to their capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios.

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

image-20240830161141714

World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which builds upon the framework of DriveDreamer and incorporates a Large Language Model (LLM) to generate user-defined driving videos. Specifically, an LLM interface is initially incorporated to convert a user’s query into agent trajectories. Subsequently, a HDMap, adhering to traffic regulations, is generated based on the trajectories. Ultimately, we propose the Unified Multi-View Model to enhance temporal and spatial coherence in the generated driving videos. DriveDreamer-2 is the first world model to generate customized driving videos, it can generate uncommon driving videos (e.g., vehicles abruptly cut in) in a user-friendly manner. Besides, experimental results demonstrate that the generated videos enhance the training of driving perception methods (e.g., 3D detection and tracking). Furthermore, video generation quality of DriveDreamer-2 surpasses other state-of-the-art methods, showcasing FID and FVD scores of 11.2 and 55.7, representing relative improvements of 30% and 50%.

image-20240830161154018

image-20240830161207909

image-20240830161222894

image-20240830161238009

![image-20240830161256635](../Library/Application Support/typora-user-images/image-20240830161256635.png)

[46] Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

https://arxiv.org/abs/2311.01017

https://arxiv.org/pdf/2311.01017

https://waabi.ai/copilot-4d/

Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.

image-20240830161307240

image-20240830161316132

image-20240830161328209

image-20240830161340006

[54] HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

https://arxiv.org/abs/2309.05186

https://arxiv.org/pdf/2309.05186

Autonomous driving systems generally employ separate models for different tasks resulting in intricate designs. For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i.e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task. ROLISP uses natural language to simultaneously identify and interpret risk objects, understand ego-vehicle intentions, and provide motion suggestions, eliminating the necessity for task-specific architectures.

image-20240830161349978

image-20240830161402661

image-20240830161417120

data,prompt❤️[56] Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving

https://arxiv.org/abs/2310.02251

https://arxiv.org/pdf/2310.02251

https://github.com/llmbev/talk2bev

Talk2BEV is a large vision-language model (LVLM) interface for bird’s-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representations, eliminating the need for task-specific models. This enables a single system to cater to a variety of autonomous driving tasks encompassing visual and spatial reasoning, predicting the intents of traffic actors, and decision-making based on visual cues. We extensively evaluate Talk2BEV on a large number of scene understanding tasks that rely on both the ability to interpret free-form natural language queries, and in grounding these queries to the visual context embedded into the language-enhanced BEV map. To enable further research in LVLMs for autonomous driving scenarios, we develop and release Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.

image-20240830161431563

image-20240830161448860

image-20240830161456991

image-20240830161510751

[57] Lidar-llm: Exploring the potential of large language models for 3d lidar understanding

https://arxiv.org/abs/2312.14074

https://arxiv.org/pdf/2312.14074

In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding space of LLM. Furthermore, we design a View-Aware Transformer (VAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM’s spatial orientation comprehension of visual features. Our experiments show that LiDAR-LLM possesses favorable capabilities to comprehend various instructions regarding 3D scenes and engage in complex spatial reasoning. LiDAR-LLM attains a 40.9 BLEU-1 on the 3D captioning task and achieves a 63.1% classification accuracy and a 14.3% BEV mIoU on the 3D grounding task.

image-20240830161519088

image-20240830161527697

image-20240830161539348

[58] Explaining autonomous driving actions with visual question answering

https://github.com/Shahin-01/VQA-AD

https://arxiv.org/abs/2307.10408

https://arxiv.org/pdf/2307.10408

The end-to-end learning ability of self-driving vehicles has achieved significant milestones over the last decade owing to rapid advances in deep learning and computer vision algorithms. However, as autonomous driving technology is a safety-critical application of artificial intelligence (AI), road accidents and established regulatory principles necessitate the need for the explainability of intelligent action choices for self-driving vehicles. To facilitate interpretability of decision-making in autonomous driving, we present a Visual Question Answering (VQA) framework, which explains driving actions with question-answering-based causal reasoning. To do so, we first collect driving videos in a simulation environment using reinforcement learning (RL) and extract consecutive frames from this log data uniformly for five selected action categories. Further, we manually annotate the extracted frames using question-answer pairs as justifications for the actions chosen in each scenario. Finally, we evaluate the correctness of the VQA-predicted answers for actions on unseen driving scenes. The empirical results suggest that the VQA mechanism can provide support to interpret real-time decisions of autonomous vehicles and help enhance overall driving safety.

image-20240830161553775

image-20240830161600697

image-20240830161607925

image-20240830161620969

image-20240830161634763

[59] On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

https://arxiv.org/abs/2311.05332

https://github.com/PJLab-ADG/GPT4V-AD-Exploration

The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios. We explore the model’s abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.

Here’s a glimpse into some of the fascinating results from our report:

  • Weather Understanding: This image showcases GPT-4V’s capability to understand different weather conditions, a critical factor in autonomous driving.

image-20240830161658289

  • Corner Cases: An illustration of how GPT-4V handles complex and unusual traffic scenarios, which are often challenging for autonomous systems.

image-20240830161737176

  • Serving as a Driving Agent: A demonstration of GPT-4V showcasing its capabilities as a driving agent, making real-world decisions in various driving scenarios.

image-20240830161806083

[60] Drivegpt4: Interpretable end-to-end autonomous driving via large language model

https://arxiv.org/abs/2310.01412

Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion. These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V. The code and dataset will be publicly available.

image-20240830161820525

image-20240830161830830

image-20240830161843702

image-20240830161858547

[63] Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

https://arxiv.org/abs/2312.03408

https://arxiv.org/pdf/2312.03408

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

https://arxiv.org/pdf/2401.14159

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.

image-20240830161911020

image-20240830161925552

image-20240830161942150

Dino

https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/

https://medium.com/aimonks/clip-vs-dinov2-in-image-similarity-6fa5aa7ed8c6

https://encord.com/blog/grounding-dino-sam-vs-mask-rcnn-comparison/

https://encord.com/blog/dinov2-self-supervised-learning-explained/

https://medium.com/@AIBites/dino-v2-learning-robust-visual-features-without-supervision-model-explained-6f641e051a0

https://medium.com/@sumiteshn/computer-vision-models-comparison-84363ccc9a97

https://roboflow.com/train/dinov2-and-yolov8

https://blog.roboflow.com/grounding-dino-zero-shot-object-detection/

Further Reading

https://yuhuang-63908.medium.com/autonomous-driving-with-large-scale-foundation-models-c3106f12dac0

https://github.com/PJLab-ADG/awesome-knowledge-driven-AD

Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/

[202405] Prospective Role of Foundation Models in Advancing Autonomous Vehicles

[202311] A Survey on Multimodal Large Language Models for Autonomous Driving

[202308] LLM4Drive: A Survey of Large Language Models for Autonomous Driving

[202401] Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving