CLIP微调方法
思路:在CLIP model的基础上,增加一个线性分类器,训练的是这个分类器
-
https://medium.com/aimonks/a-guide-to-fine-tuning-clip-models-with-custom-data-6c7c0d1416fb
-
vit-finetuned-on-fashion-dataset:https://www.kaggle.com/code/sunny7712/vit-finetuned-on-fashion-dataset
-
Tutorial To Leverage Open AI’s CLIP Model For Fashion Industry:https://www.labellerr.com/blog/fine-tuning-clip-on-custom-dataset/
自动驾驶中的CLIP
CarLLaVA: Vision language models for camera-only closed-loop driving
A Survey for Foundation Models in Autonomous Driving
DriveCLIP: Zero-shot transfer for distracted driving activity understanding using CLIP
CLIP + 分类器:线性探针
提示词技巧:用句子driver is [action] while driving a car,而不是词语。
TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip
claude:https://claude.ai/chat/2f1ea22a-9d51-4e38-a1b3-f6645776ef84
kimi:https://kimi.moonshot.cn/chat/cqpjmamc2kumginlb9c0
论文中的TSIC-CLIP是如何利用原有的CLIP模型的?
TSIC-CLIP模型在构建过程中充分利用了原有的CLIP模型,并对其进行了一些关键的改进和调整,以适应交通场景图像描述任务。以下是TSIC-CLIP如何利用原有CLIP模型的几个主要方面:
预训练的CLIP模型:TSIC-CLIP模型使用了预训练的CLIP模型作为其特征提取的一部分。CLIP模型通过对比学习的方式,能够将图像和文本映射到一个共同的特征空间,这使得它在图像和文本的联合表示学习方面表现出色。
冻结预训练参数:在TSIC-CLIP模型中,CLIP模型的参数被冻结,这意味着在微调过程中不会对这些参数进行更新。这样做可以保留CLIP模型在大规模数据集上预训练得到的强大特征提取能力,同时减少训练成本。
适配器层(Adapter Layer):为了使CLIP模型更好地适应交通场景图像描述任务,TSIC-CLIP引入了一个适配器层。这个适配器层被添加到CLIP模型的顶部,并在BUUISE-Image交通场景数据集上进行微调。适配器层帮助模型学习特定于交通场景的特征,增强个性化描述的能力。
特征提取:TSIC-CLIP模型利用CLIP模型提取全局图像特征,并与局部特征(由预训练的Faster R-CNN模型提取)和文本特征结合。这种多模态特征融合方法可以提供更丰富的语义信息,从而生成更准确的图像描述。
文本检索模块:TSIC-CLIP还使用基于CLIP的文本检索模块来获取与图像子区域描述相关的文本特征。这些文本特征与图像特征一起被输入到图像描述模型中,增强了描述的准确性和相关性。
全局加权注意力池化(Global Weighted Attention Pooling, WGA-Pooling):为了减少模型参数并提高效率,TSIC-CLIP模型在Transformer架构中引入了WGA-Pooling层,代替了传统的多头自注意力机制。这种新型的池化层可以更好地捕获全局上下文信息,并与局部特征进行有效融合。
通过这些方法,TSIC-CLIP模型不仅继承了CLIP模型的强大能力,还通过创新的设计和调整,使其更适合于交通场景图像描述的特定需求。
根据论文中的描述,BUUISE-Image数据集是专门为交通场景图像标注任务构建的数据集。以下是关于该数据集构建和特点的主要信息:
数据来源:
数据集中的图像是从北京、天津、越南等多个城市收集的。
经过筛选和清洗处理。
数据规模:
包含超过10,000张图像。
每张图像配有5个人工生成的标题。
标注特点:
重点关注标题的准确性和多样性。
每张图像的多个标题涵盖了不同方面的信息,如物体、属性、关系、场景等。
元数据信息:
提供了丰富的元数据信息,如图像的时间、位置和标记等。
这些额外信息可用于更广泛的图像理解任务。
与现有数据集相比,BUUISE-Image数据集具有以下优势:
专门性:专门针对交通场景设计,更适合交通相关的图像标注任务。
多样性:涵盖多个城市的交通场景,增加了数据的多样性和代表性。
丰富的标注:每张图像有多个标题,提供了更全面的描述信息。
元数据支持:附加的元数据信息为更深入的研究和应用提供了可能。
规模适中:超过10,000张图像的规模,既能提供足够的训练数据,又不至于过于庞大难以处理。
针对性强:相比通用数据集(如MS-COCO),BUUISE-Image更专注于交通场景,可能包含更多与交通相关的细节和特征。
实际应用价值:由于专注于交通场景,这个数据集在智能交通、自动驾驶等领域有很高的应用价值。
总的来说,BUUISE-Image数据集的构建填补了交通场景专用图像标注数据集的空白,为相关研究和应用提供了有价值的资源。
Generalized Predictive Model for Autonomous Driving
Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
自动驾驶数据集构造
参考:
-
https://www.perplexity.ai/search/how-to-fine-tune-the-openai-cl-4RPJZ8rpTq2LCTR1Fgs2EQ
-
https://github.com/Thinklab-SJTU/Awesome-LLM4AD?tab=readme-ov-file#datasets
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning
Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, Mykel Kochenderfer
Publisher: Honda Research Institute, Stanford University
Publish Date: 2023.09.10
Summary:
A multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance.
Introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset.
Publisher: Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Luo, Ping and Geiger, Andreas and Li, Hongyang ECCV 2024
Dataset: DriveLM
Publish Date: 2023.08
Summary:
Construct dataset based on the nuScenes dataset.
Perception questions require the model to recognize objects in the scene.
Prediction questions ask the model to predict the future status of important objects in the scene.
Planning questions prompt the model to give reasonable planning actions and avoid dangerous ones.
WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models
Aboli Marathe, Deva Ramanan, Rahee Walambe, Ketan Kotecha. CVPR 2023
Publisher: Carnegie Mellon University, Symbiosis International University
Dataset: WEDGE
Publish Date: 2023.05.12
Summary:
- A multi-weather autonomous driving dataset built from generative vision-language models.
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang
Publisher: Fudan University
Dataset: NuScenes-QA
Summary:
NuScenes-QA provides 459,941 question-answer pairs based on the 34,149 visual scenes, with 376,604 questions from 28,130 scenes used for training, and 83,337 questions from 6,019 scenes used for testing, respectively.
The multi-view images and point clouds are first processed by the feature extraction backbone to obtain BEV features.
DRAMA: Joint Risk Localization and Captioning in Driving
Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, Jiachen Li
Publisher:
Datasets: DRAMA
Summary:
- Introduce a novel dataset DRAMA that provides linguistic descriptions (with the focus on reasons) of driving risks associated with important objects and that can be used to evaluate a range of visual captioning capabilities in driving scenarios.
Language Prompt for Autonomous Driving
Datasets: Nuprompt(Not open)
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
Datasets: official, data collection using RL experts in simulator.
Textual Explanations for Self-Driving Vehicles
Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata ECCV 2018.
Publisher: University of California, Berkeley, Saarland Informatics Campus, University of Amsterdam
Grounding Human-To-Vehicle Advice for Self-Driving Vehicles
Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen, Ashish Tawari, John Canny CVPR 2019
Publisher: UC Berkeley, Honda Research Institute USA, Inc.
DRAMA: Joint Risk Localization and Captioning in Driving
https://arxiv.org/abs/2209.10767
Due to the lack of standard benchmarks, we collected a large-scale dataset, DRAMA (Driving Risk Assessment Mechanism with A captioning module), which consists of 17,785 interactive driving scenarios collected in Tokyo, Japan. Our DRAMA dataset accommodates video- and object-level questions on driving risks with associated important objects to achieve the goal of visual captioning as a free-form language description utilizing closed and open-ended responses for multi-level questions, which can be used to evaluate a range of visual captioning capabilities in driving scenarios
https://usa.honda-ri.com/drama
The DRAMA Dataset is captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo.
Contains different annotations: Video-level Q/A, Object-level Q/A, Risk object bounding box, Free-form caption, and separate labels for ego-car intention, scene classifier and suggestions to the driver
The free-form descriptions of reasoning include 992 unique words with total occurrences of 306,708 times


申请中
Language Prompt for Autonomous Driving
https://arxiv.org/abs/2309.04379
To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack
https://github.com/wudongming97/Prompt4Driving
https://github.com/wudongming97/Prompt4Driving/blob/master/docs/data.md
NuPrompt Dataset


Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
https://arxiv.org/abs/2310.01957
We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data.
https://github.com/wayveai/Driving-with-LLMs/tree/main?tab=readme-ov-file
https://github.com/wayveai/Driving-with-LLMs/tree/paper_code
VQA Dataset

https://github.com/wayveai/Driving-with-LLMs/blob/main/data/
❤️ Textual Explanations for Self-Driving Vehicles
https://arxiv.org/abs/1807.11546
We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at this https URL.
https://github.com/JinkyuKimUCB/explainable-deep-driving?tab=readme-ov-file
https://github.com/JinkyuKimUCB/BDD-X-dataset
We focus on generating textual descriptions and explanations, such as the pair: “Vehicle slows down” (description) and “Because it is approaching an intersection and the light is red” (explanation)
Our dataset is composed of over 77 hours of driving within 6,970 videos. The videos are taken in diverse driving conditions, e.g. day/night, highway/city/countryside, summer/winter etc. On average 40 seconds long, each video contains around 3-4 actions, e.g. speeding up, slowing down, turning right etc., all of which are annotated with a description and an explanation. Our dataset contains over 26K activities in over 8.4M frames. We introduce a training, a validation and a test set, containing 5,597, 717 and 656 videos, respectively.
https://drive.google.com/file/d/1Na-jxLK7VxOc6wcT5lnLQytfFi6rqhTF/view
BDD-X Dataset
https://drive.google.com/file/d/1Na-jxLK7VxOc6wcT5lnLQytfFi6rqhTF/view
| Input.Video | Answer.1start | Answer.1end | Answer.1action | Answer.1justification | Answer.2start | Answer.2end | Answer.2action | Answer.2justification | Answer.3start | Answer.3end | Answer.3action | Answer.3justification | Answer.4start | Answer.4end | Answer.4action | Answer.4justification | Answer.5start | Answer.5end | Answer.5action | Answer.5justification | Answer.6start | Answer.6end | Answer.6action | Answer.6justification | Answer.7start | Answer.7end | Answer.7action | Answer.7justification | Answer.8start | Answer.8end | Answer.8action | Answer.8justification | Answer.9start | Answer.9end | Answer.9action | Answer.9justification | Answer.10start | Answer.10end | Answer.10action | Answer.10justification | Answer.11start | Answer.11end | Answer.11action | Answer.11justification | Answer.12start | Answer.12end | Answer.12action | Answer.12justification | Answer.13start | Answer.13end | Answer.13action | Answer.13justification | Answer.14start | Answer.14end | Answer.14action | Answer.14justification | Answer.15start | Answer.15end | Answer.15action | Answer.15justification |
| https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/06d501fd-a9ffc960.mov | 0 | 11 | The car accelerates | because the light has turned green. | 12 | 19 | The car is moving at a steady speed | because traffic is clear. | 20 | 22 | The car slows slightly | because it’s turning into the right lane. | 23 | 36 | The car stops | because it turns to the right. | 37 | 40 | The car accelerates | because traffic is clear. | ||||||||||||||||||||||||||||||||||||||||
| https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/01b0505f-5f564e84.mov | 0 | 16 | The car is stopped. | The car is at an intersection with a red light. | 16 | 20 | The car is accelerating through the intersection. | The light at the intersection has changed to green | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/06d501fd-fd237e38.mov | 0 | 20 | The car accelerates slowly to a maintained speed | because the light has turned green and traffic is flowing smoothly. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/06d54ae6-26a3446e.mov | 0 | 6 | The car is driving forward | as traffic flows freely. | 7 | 9 | The car merges into the lane to its left | to get around a slower car in front of it. | 10 | 22 | The car drives at a normal speed | as traffic moves freely. | 23 | 25 | The car slows and veers slightly right | due to a car in the neighboring lane entering the car’s lane. | 26 | 40 | The car drives forward | as there are no nearby cars in its lane. | ||||||||||||||||||||||||||||||||||||||||
| https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/01b4e4b9-e21fe0a3.mov | 0 | 12 | The car is stopped | because the light is red. | 13 | 18 | The car is accelerating | as the light turned green. | 19 | 27 | The car is slowing | because traffic in front of it is stopped. | 28 | 38 | The car is driving forward slowly | due to slow traffic in front of it. | 39 | 40 | The car is merging into the left lane | to make a left turn. |
Grounding Human-to-Vehicle Advice for Self-driving Vehicles
https://arxiv.org/abs/1911.06978
we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice giving, where we train an end-to-end vehicle controller that accepts advice.
We evaluate our model on a novel advisable driving dataset with manually annotated human-to-vehicle advice called Honda Research Institute-Advice Dataset (HAD). We show that taking advice improves the performance of the end-to-end network, while the network cues on a variety of visual features that are provided by advice. The dataset is available at this https URL.
The Honda Research Institute-Advice Dataset (HAD) is driving data with natural language advices. The driving videos are gathered from HDD dataset, a large-scale naturalistic driving dataset collected in San Francisco Bay Area. HAD consists of following features:
30 hours of driving video data with natural language advices
Videos with can-bus signal data
Advices consist of (1) goal-oriented advice (top-down signal) – to influence the vehicle in a avigation task and (2) stimulus-driven advice (bottom-up signal) – conveys some visual stimuli that the user expects their attention to be actively looked by the vehicle controller
https://usa.honda-ri.com/dataset-request-form?dataset=had
申请中
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning
https://arxiv.org/abs/2309.06597
this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios.
没有开放
DriveLM: Driving with Graph Visual Question Answering
https://github.com/OpenDriveLab/DriveLM
We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
🏁 DriveLM serves as a main track in the
[CVPR 2024 Autonomous Driving Challenge](https://opendrivelab.com/challenge2024/#driving_with_language). Everything you need for the challenge is HERE, including baseline, test data and submission format and evaluation pipeline!
https://github.com/OpenDriveLab/DriveLM?tab=readme-ov-file#drivelmdata
https://github.com/OpenDriveLab/DriveLM/blob/main/docs/data_details.md/#features
Features of the DriveLM-Data
🛣 Completeness in functionality (covering Perception, Prediction, and Planning QA pairs).
🔜 Reasoning for future events that have not yet happened.
Many “What If”-style questions: imagine the future by language.
♻ Task-driven decomposition.
One scene-level description into many frame-level trajectories & planning QA pairs.
How about the annotation process?
The annotation process is different for DriveLM-nuScenes and DriveLM-CARLA.
For DriveLM-nuScenes, we divide the annotation process into three steps:
1️⃣ Keyframe selection. Given all frames in one clip, the annotator selects the keyframes that need annotation. The criterion is that those frames should involve changes in ego-vehicle movement status (lane changes, sudden stops, start after a stop, etc.).
2️⃣ Key objects selection. Given keyframes, the annotator needs to pick up key objects in the six surrounding images. The criterion is that those objects should be able to affect the action of the ego vehicle (traffic signals, pedestrians crossing the road, other vehicles that move in the direction of the ego vehicle, etc.).
3️⃣ Question and answer annotation. Given those key objects, we automatically generate questions regarding single or multiple objects about perception, prediction, and planning. More details can be found in our data.
For DriveLM-CARLA, we employ an automated annotation approach:
We collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework with a privileged rule-based expert. We set up a series of routes in urban, residential, and rural areas and execute the expert on these routes. During this process, we collect the necessary sensor data, generate relevant QAs based on privileged information about objects and the scene, and organize the logical relationships to connect this series of QAs into a graph.
❤️ DriveLM-nuScenes QA Dataset
https://github.com/OpenDriveLab/DriveLM/blob/main/docs/data_prep_nus.md
We kindly hope you to fill out the form before downloading. To get started, download nuScenes subset image data and DriveLM-nuScenes QA json files below. For v1.1 data, please visit the DriveLM/challenge folder.
| nuScenes subset images | DriveLM-nuScenes version-1.0 |
| Google Drive | Google Drive |
| Baidu Netdisk | Baidu Netdisk |
| HuggingFace | HuggingFace |
You can also download the full nuScenes dataset HERE to enable video input.
https://huggingface.co/spaces/AGC2024/driving-with-language-official
https://github.com/OpenDriveLab/DriveLM/tree/main/challenge#drivelm
https://github.com/OpenDriveLab/DriveLM/tree/main?tab=readme-ov-file#drivelmdata
暂时无法在飞书文档外展示此内容
{
"f0f120e4d4b0441da90ec53b16ee169d": {
"scene_description": "The ego vehicle proceeds through the intersection, continuing along the current roadway.",
"key_frames": {
"4a0798f849ca477ab18009c3a20b7df2": {
"key_object_infos": {
"<c1,CAM_BACK,1088.3,497.5>": {
"Category": "Vehicle",
"Status": "Moving",
"Visual_description": "Brown SUV.",
"2d_bbox": [
966.6,
403.3,
1224.1,
591.7
]
},
"<c2,CAM_BACK,864.2,468.3>": {
"Category": "Vehicle",
"Status": "Moving",
"Visual_description": "Black sedan.",
"2d_bbox": [
816.7,
431.6,
917.2,
505.0
]
},
"<c3,CAM_FRONT,1043.2,82.2>": {
"Category": "Traffic element",
"Status": null,
"Visual_description": "Green light.",
"2d_bbox": [
676.4,
0.0,
1452.6,
171.5
]
}
},
"QA": {
"perception": [
{
"Q": "What are objects to the front right of the ego car?",
"A": "There are many barriers and one construction vehicle to the front right of the ego car.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "What are objects to the front left of the ego car?",
"A": "There is one truck and one barrier to the front left of the ego car.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "What is the status of the truck that is to the front left of the ego car?",
"A": "One truck is moving.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
。。。
"prediction": [
{
"Q": "Is <c1,CAM_BACK,1088.3,497.5> a traffic sign or a road barrier?",
"A": "No.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "Which object is most likely to be occluded by <c1,CAM_BACK,1088.3,497.5>? Would this object affect the ego vehicle? Based on this object, what action of the ego vehicle is dangerous?",
"A": "None, no, none.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "Would <c1,CAM_BACK,1088.3,497.5> be in the moving direction of the ego vehicle?",
"A": "No.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "What is the future state of <c1,CAM_BACK,1088.3,497.5>?",
"A": "Turn left.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "Is <c2,CAM_BACK,864.2,468.3> a traffic sign or a road barrier?",
"A": "No.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
。。。
"planning": [
{
"Q": "Is <c1,CAM_BACK,1088.3,497.5> an object that the ego vehicle should consider in the current scene?",
"A": "No.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "What actions could the ego vehicle take based on <c1,CAM_BACK,1088.3,497.5>? Why take this action and what's the probability?",
"A": "The action is to keep going at the same speed. The reason is to follow the traffic rules, which has a high probability.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
{
"Q": "What is the probability of colliding with <c1,CAM_BACK,1088.3,497.5> after the ego vehicle goes straight and keeps the same speed?",
"A": "Low.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
},
。。。
"behavior": [
{
"Q": "Predict the behavior of the ego vehicle.",
"A": "The ego vehicle is going straight. The ego vehicle is driving fast.",
"C": null,
"con_up": null,
"con_down": null,
"cluster": null,
"layer": null
}
]WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models
https://browse.arxiv.org/abs/2305.07528
To aid adversarial robustness in perception, we introduce WEDGE (WEather images by DALL-E GEneration): a synthetic dataset generated with a vision-language generative model via prompting. WEDGE consists of 3360 images in 16 extreme weather conditions manually annotated with 16513 bounding boxes, supporting research in the tasks of weather classification and 2D object detection. We have analyzed WEDGE from research standpoints, verifying its effectiveness for extreme-weather autonomous perception.
The repository for this work and dataset is available at this https URL.
WEDGE Dataset
https://github.com/Infernolia/WEDGE
https://github.com/Infernolia/WEDGE/tree/main/Dataset/WEDGE
https://drive.google.com/file/d/1gmnoZWw9Oh-A60HE_qhXlwsNj2n1-iZ3/view


NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario
https://arxiv.org/abs/2305.14836
We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues.
To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates.
Codes and dataset are available at this https URL
NuScenes-QA Dataset:from nuScenes
https://github.com/qiantianwen/NuScenes-QA

https://drive.google.com/drive/folders/1jIkICT23wZWZYPrWCa0x-ubjpClSzOuU
{
"info": {
"split": "val",
"version": "1.0",
"date": "2023-02-27"
},
"questions": [
{
"split": "val",
"sample_token": "fd8420396768425eabec9bdddf7e64b6",
"question": "Are any moving bicycles visible?",
"answer": "no",
"num_hop": 0,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "fd8420396768425eabec9bdddf7e64b6",
"question": "Are any moving pedestrians visible?",
"answer": "yes",
"num_hop": 0,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "fd8420396768425eabec9bdddf7e64b6",
"question": "Are there any traffic cones to the back of me?",
"answer": "no",
"num_hop": 1,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "fd8420396768425eabec9bdddf7e64b6",
"question": "Are there any moving cars to the back of me?",
"answer": "yes",
"num_hop": 1,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "fd8420396768425eabec9bdddf7e64b6",
"question": "There is a parked thing to the front of me; what is it?",
"answer": "car",
"num_hop": 1,
"template_type": "object"
},
{
"split": "val",
"sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
"question": "There is a pedestrian to the back of me; what is its status?",
"answer": "moving",
"num_hop": 1,
"template_type": "status"
},
{
"split": "val",
"sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
"question": "What number of traffic cones are there?",
"answer": "3",
"num_hop": 0,
"template_type": "count"
},
{
"split": "val",
"sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
"question": "How many cars are to the back right of me?",
"answer": "10",
"num_hop": 1,
"template_type": "count"
},
{
"split": "val",
"sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
"question": "Are there any things?",
"answer": "yes",
"num_hop": 0,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
"question": "Are any standing pedestrians visible?",
"answer": "no",
"num_hop": 0,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
"question": "Are there any traffic cones to the front of me?",
"answer": "yes",
"num_hop": 1,
"template_type": "exist"
},
{
"split": "val",
"sample_token": "b10f0cd792b64d16a1a5e8349b20504c",
"question": "What number of moving cars are there?",
"answer": "5",
"num_hop": 0,
"template_type": "count"
},Embodied Understanding of Driving Scenarios
https://github.com/OpenDriveLab/ELM

Pre-training data. We collect driving videos from YouTube, nuScenes, Waymo, and Ego4D. Here we provide a sample of 🔗 YouTube video list we used. For privacy considerations, we are temporarily keeping the complete data labels private.
Fine-tuning data. The full set of question and answer pairs for the benchmark can be obtained through this 🔗data link. You may need to download the corresponding image data from the official nuScenes and Ego4D channels. For a
quick verificationof the pipeline, we recommend downloading the subset dataset of DriveLM and organizing the data in line with the format.Please make sure to soft link
nuScenesandego4ddatasets underdata/xxfolder. You may need to runtools/video_clip_processor.pyto pre-process data first. Besides, we provide some script used during auto-labeling, you may use these as a reference if you want to customize data.
MAPLM-QA
https://llvm-ad.github.io/challenge/
❗ The MAPLM-QA v1.5 dataset has been released at this link.
The challenge track is based on the MAPLM-QA benchmark, a subset of the MAPLM dataset designed for visual question answering in traffic scene understanding. Participants will develop innovative methods to accurately answer multi-choice questions about complex traffic scenes using high-resolution panoramic images and 2.5D bird’s-eye view representations. Top-performing teams will be recognized with certificates and honorariums.
MAPLM-QA Dataset

https://huggingface.co/datasets/LLVM-AD/maplm_v1_5/tree/main/data/val
{
"frame_id": "FR10018",
"conversations": [
{
"question": "What type of road scene is depicted in the images?",
"options": [
"Normal city road.",
"Construction road.",
"Undeveloped road.",
"Road mark repainting.",
"Roundabout.",
"None of the above."
],
"answer": 0,
"question_type": "SCN"
},
{
"question": "What is the quality of the point cloud data in the current road area of this image?",
"options": [
"Very clear.",
"Not clear, road markings are worn.",
"Not clear, road markings are occluded by vehicles.",
"Not clear, road markings are worn and occluded by vehicles."
],
"answer": 0,
"question_type": "QLT"
},
{
"question": "How many lanes are present on the current road?",
"options": [
"2",
"3",
"4",
"5",
"6"
],
"answer": 2,
"question_type": "LAN"
},
{
"question": "Describe the lane attributes of the current road.",
"options": [],
"answer": "There are 4 lanes in the image, with lane attributes from left to right being: motorway | motorway | motorway | motorway.",
"question_type": "DES"
},
{
"question": "Are there any road crossings, intersections, or lane change zones on the main road?",
"options": [
"No.",
"Yes, there is one crossroad in the image.",
"Yes, there is one T-junction in the image.",
"Yes, there is a small road intersection in the image.",
"Yes, there is one lane change zone in the image.",
"Yes, there is one irregular intersection in the image."
],
"answer": 0,
"question_type": "INT"
},
{
"question": "Provide a detailed description of the current driving scene.",
"options": [],
"answer": "The scene depicts a multi-lane road with vehicles traveling in both directions. The road is straight with clear lane markings, and there are trees and streetlights along the sides.",
"question_type": "CAP"
},
{
"question": "Identify any unusual or unfamiliar objects present in the image.",
"options": [],
"answer": "There are no unusual or unfamiliar objects present in the image that require special attention.",
"question_type": "OBJ"
}
],
"images": [
"data/val/images/FR10018/photo_forward.jpg",
"data/val/images/FR10018/photo_lef_back.jpg",
"data/val/images/FR10018/photo_rig_back.jpg",
"data/val/images/FR10018/point_cloud_bev.jpg"
]
},MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

In several pioneering datasets and benchmark papers, the authors have explored languageguided visual understanding tasks in driving scenes. These datasets can be split to two types: (1) Added additional texts for existing NuScenes [6] dataset such as Talk2Car [13], NuScenes-QA [47], NuScenes-MQA [27], DriveLM [8], and NuPrompt [68], NuInstruct [16]; (2) Independent collected datasets such as Rank2Tell [50], BDDX [28], SUTD-TrafficQA [71], DRAMA [39], and LINGOQA [43]. However, limited by data scale and data quality, current datasets can not serve as useful benchmarks to evaluate multi-modal LLMs for driving scenes. Besides, the newest techniques like GPT-4V [44] in the general domain has already been trained with plenty of open-source traffic and driving scene datasets. Those vision language datasets annotated on nuScenes [6] can not serve as reliable benchmarks to validate existing models. Thus, we need new out of domain large-scale datasets and benchmarks that contain more corner cases of various traffic and driving scenarios and related HD map annotations.
2.3. LLMs for Autonomous Driving
LLMs have shown remarkable potential in complicated scenarios such as driving scene understanding and decisionmaking [10, 26, 38, 41]. Recent advancements focus on building visual-language models to generate driving policies such as DiLu [64], DriveGPT4 [72], GPT-Driver [40], HiLM-D [15], DriveMLM [60], and DriveVLM [57]. Talk2BEV [14] and LiDAR-LLM [75] also explored the connection between LLMs, VLMs and bird’s-eye view (BEV), LiDAR point cloud in autonomous driving contexts. Besides, LLMs can also enhance the interaction between passengers and vehicles, improving the personalization and responsiveness of autonomous driving experiences [9, 20]. An equally crucial area of research is the development of language-guided closed-loop autonomous driving systems. These systems leverage multi-modal sensor data from simulators, as demonstrated by LimSim++ [19] and LMDrive [52]. Additionally, RAG-Driver [81] introduces a novel retrieval-augmented in-context learning approach, significantly enhancing the zero-shot generalization capabilities of driving LLMs. From the industry, Wayve proposed the first open-loop driving commentator LINGO1 [62].


[2019]Talk2car: Taking control of your self-driving car
https://arxiv.org/abs/1909.10838
We consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars.
Talk2car Dataset: from nuScenes
https://macchina-ai.cs.kuleuven.be/
https://macchina-ai.cs.kuleuven.be/downloads.html
https://github.com/talk2car/Talk2Car
Talk2Car: Taking Control of Your Self-Driving Car
The Talk2Car dataset finds itself at the intersection of various research domains, promoting the development of cross-disciplinary solutions for improving the state-of-the-art in grounding natural language into visual space. The annotations were gathered with the following aspects in mind:
Free-form high quality natural language commands, that stimulate the development of solutions that can operate in the wild.
A realistic task setting. Specifically, we consider an autonomous driving setting, where a passenger can control the actions of an Autonomous Vehicle by giving commands in natural language.
The Talk2Car dataset was build on top of the nuScenes dataset to include an extensive suite of sensor modalities, i.e. semantic maps, GPS, LIDAR, RADAR and 360-degree RGB images annotated with 3D bounding boxes. Such variety of input modalities sets the object referral task on the Talk2Car dataset apart from related challenges, where additional sensor modalities are generally missing.

https://github.com/talk2car/Talk2Car/tree/master/data
{
'scene_token': f92422ed4b4e427194a4958ccf15709a, # nuScenes scene token
'sample_token': c32d636e44604d77a1734386b3fe4a0d, # nuScenes sample token
'translation': [-13.49250542687401, 0.43033061594724364, 59.28095610405408], # Translation
'size': [0.81, 0.73, 1.959], # Size
'rotation': ['-0.38666213835670615', '-0.38076281276237284', '-0.5922192111910205', '0.5956412318459762'], # Rotation,
'command': 'turn left to pick up the pedestrian at the corner', # Command
'obj_name': 'human.pedestrian.adult', # Class name of the reffered object
'box_token': '0183ed8a474f411f8a3394eb78df7838' # nuScenes box token,
'command_token': '4175173f5f60d19ecfc3712e960a1103' # A unique command identifier,
'2d_box': [200, 300, 50, 50] # The 2d bounding box of the referred object in the frontal view. Follows the format [x,y,w,h]
'':
}指令形式的数据。
[202312]Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations
https://arxiv.org/abs/2312.06352
Visual Question Answering (VQA) is one of the most important tasks in autonomous driving, which requires accurate recognition and complex situation evaluations. However, datasets annotated in a QA format, which guarantees precise language generation and scene recognition from driving scenes, have not been established yet. In this work, we introduce Markup-QA, a novel dataset annotation technique in which QAs are enclosed within markups. This approach facilitates the simultaneous evaluation of a model’s capabilities in sentence generation and VQA. Moreover, using this annotation methodology, we designed the NuScenes-MQA dataset. This dataset empowers the development of vision language models, especially for autonomous driving tasks, by focusing on both descriptive capabilities and precise QA. The dataset is available at this https URL.
https://arxiv.org/pdf/2312.06352
https://arxiv.org/html/2312.06352v1
居然还有ppt
NuScenes-MQA Dataset: from nuScenes
https://github.com/turingmotors/NuScenes-MQA
https://drive.google.com/drive/u/0/folders/1PQy0qhTtbdueIVlVnn4jC6xvANZUynRZ


文件:
| sample_token | question | answer | question_type |
| 2ec05dce9faf45d39cdf8188471e9ea7 | Highlight what’s in | In the | important_object_count_and_direction |
| 2ec05dce9faf45d39cdf8188471e9ea7 | In terms of objects, how many are in the | Scanning through the ego car’s | important_object_count_and_direction |
| 2ec05dce9faf45d39cdf8188471e9ea7 | In the | object_presence_confirmation | |
| 2ec05dce9faf45d39cdf8188471e9ea7 | Identify the count of | In the | object_presence_confirmation |
| 2ec05dce9faf45d39cdf8188471e9ea7 | Point to the object in | In the scope of the | important_object_count_and_direction |









