OpenAI CLIP模型微调调研

CLIP微调方法

https://marqo.ai/course/fine-tuning-clip-models
- https://github.com/marqo-ai/fine-tuning-embedding-models-course/blob/main/10_fine_tuning_CLIP_models.ipynb

思路：在CLIP model的基础上，增加一个线性分类器，训练的是这个分类器

https://medium.com/aimonks/a-guide-to-fine-tuning-clip-models-with-custom-data-6c7c0d1416fb
- 直接修改CLIP的参数，更符合我们的场景
- 使用的是Indo-fashion-dataset数据集：https://www.kaggle.com/datasets/validmodel/indo-fashion-dataset/data
- https://www.kaggle.com/code/sahil112/utilising-clip-for-indian-fasion：类似的工作
- 代码：https://github.com/shashnkvats/Indofashionclip/tree/main
- https://github.com/openai/CLIP/issues/83
vit-finetuned-on-fashion-dataset：https://www.kaggle.com/code/sunny7712/vit-finetuned-on-fashion-dataset
Tutorial To Leverage Open AI’s CLIP Model For Fashion Industry：https://www.labellerr.com/blog/fine-tuning-clip-on-custom-dataset/

自动驾驶中的CLIP

CarLLaVA: Vision language models for camera-only closed-loop driving

A Survey for Foundation Models in Autonomous Driving

DriveCLIP: Zero-shot transfer for distracted driving activity understanding using CLIP

kimi：https://kimi.moonshot.cn/chat/cqpjfrjdqn3aeh6mbicg

claude：https://claude.ai/chat/ea94c83b-8d0d-43a4-8cd9-020e0b06f11f

CLIP + 分类器：线性探针

提示词技巧：用句子driver is [action] while driving a car，而不是词语。

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

claude：https://claude.ai/chat/2f1ea22a-9d51-4e38-a1b3-f6645776ef84

kimi：https://kimi.moonshot.cn/chat/cqpjmamc2kumginlb9c0

论文中的TSIC-CLIP是如何利用原有的CLIP模型的？

TSIC-CLIP模型在构建过程中充分利用了原有的CLIP模型，并对其进行了一些关键的改进和调整，以适应交通场景图像描述任务。以下是TSIC-CLIP如何利用原有CLIP模型的几个主要方面：

预训练的CLIP模型：TSIC-CLIP模型使用了预训练的CLIP模型作为其特征提取的一部分。CLIP模型通过对比学习的方式，能够将图像和文本映射到一个共同的特征空间，这使得它在图像和文本的联合表示学习方面表现出色。

冻结预训练参数：在TSIC-CLIP模型中，CLIP模型的参数被冻结，这意味着在微调过程中不会对这些参数进行更新。这样做可以保留CLIP模型在大规模数据集上预训练得到的强大特征提取能力，同时减少训练成本。

适配器层（Adapter Layer）：为了使CLIP模型更好地适应交通场景图像描述任务，TSIC-CLIP引入了一个适配器层。这个适配器层被添加到CLIP模型的顶部，并在BUUISE-Image交通场景数据集上进行微调。适配器层帮助模型学习特定于交通场景的特征，增强个性化描述的能力。

特征提取：TSIC-CLIP模型利用CLIP模型提取全局图像特征，并与局部特征（由预训练的Faster R-CNN模型提取）和文本特征结合。这种多模态特征融合方法可以提供更丰富的语义信息，从而生成更准确的图像描述。

文本检索模块：TSIC-CLIP还使用基于CLIP的文本检索模块来获取与图像子区域描述相关的文本特征。这些文本特征与图像特征一起被输入到图像描述模型中，增强了描述的准确性和相关性。

全局加权注意力池化（Global Weighted Attention Pooling, WGA-Pooling）：为了减少模型参数并提高效率，TSIC-CLIP模型在Transformer架构中引入了WGA-Pooling层，代替了传统的多头自注意力机制。这种新型的池化层可以更好地捕获全局上下文信息，并与局部特征进行有效融合。

通过这些方法，TSIC-CLIP模型不仅继承了CLIP模型的强大能力，还通过创新的设计和调整，使其更适合于交通场景图像描述的特定需求。

根据论文中的描述，BUUISE-Image数据集是专门为交通场景图像标注任务构建的数据集。以下是关于该数据集构建和特点的主要信息：

数据来源：

数据集中的图像是从北京、天津、越南等多个城市收集的。

经过筛选和清洗处理。

数据规模：

包含超过10,000张图像。

每张图像配有5个人工生成的标题。

标注特点：

重点关注标题的准确性和多样性。

每张图像的多个标题涵盖了不同方面的信息，如物体、属性、关系、场景等。

元数据信息：

提供了丰富的元数据信息，如图像的时间、位置和标记等。

这些额外信息可用于更广泛的图像理解任务。

与现有数据集相比，BUUISE-Image数据集具有以下优势：

专门性：专门针对交通场景设计，更适合交通相关的图像标注任务。

多样性：涵盖多个城市的交通场景，增加了数据的多样性和代表性。

丰富的标注：每张图像有多个标题，提供了更全面的描述信息。

元数据支持：附加的元数据信息为更深入的研究和应用提供了可能。

规模适中：超过10,000张图像的规模，既能提供足够的训练数据，又不至于过于庞大难以处理。

针对性强：相比通用数据集（如MS-COCO），BUUISE-Image更专注于交通场景，可能包含更多与交通相关的细节和特征。

实际应用价值：由于专注于交通场景，这个数据集在智能交通、自动驾驶等领域有很高的应用价值。

总的来说，BUUISE-Image数据集的构建填补了交通场景专用图像标注数据集的空白，为相关研究和应用提供了有价值的资源。

Generalized Predictive Model for Autonomous Driving

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

自动驾驶数据集构造

参考：

Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning

Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, Mykel Kochenderfer

Publisher: Honda Research Institute, Stanford University

Publish Date: 2023.09.10

Summary:

A multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance.

Introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset.

DriveLM: Drive on Language

Publisher: Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Luo, Ping and Geiger, Andreas and Li, Hongyang ECCV 2024

Dataset: DriveLM

Publish Date: 2023.08

Summary:

Construct dataset based on the nuScenes dataset.

Perception questions require the model to recognize objects in the scene.

Prediction questions ask the model to predict the future status of important objects in the scene.

Planning questions prompt the model to give reasonable planning actions and avoid dangerous ones.

WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models

Aboli Marathe, Deva Ramanan, Rahee Walambe, Ketan Kotecha. CVPR 2023

Publisher: Carnegie Mellon University, Symbiosis International University

Dataset: WEDGE

Publish Date: 2023.05.12

Summary:

A multi-weather autonomous driving dataset built from generative vision-language models.

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang

Publisher: Fudan University

Dataset: NuScenes-QA

Summary:

NuScenes-QA provides 459,941 question-answer pairs based on the 34,149 visual scenes, with 376,604 questions from 28,130 scenes used for training, and 83,337 questions from 6,019 scenes used for testing, respectively.

The multi-view images and point clouds are first processed by the feature extraction backbone to obtain BEV features.

DRAMA: Joint Risk Localization and Captioning in Driving

Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, Jiachen Li

Publisher:

Datasets: DRAMA

Summary:

Introduce a novel dataset DRAMA that provides linguistic descriptions (with the focus on reasons) of driving risks associated with important objects and that can be used to evaluate a range of visual captioning capabilities in driving scenarios.

Language Prompt for Autonomous Driving

Datasets: Nuprompt(Not open)

Previous summary

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Datasets: official, data collection using RL experts in simulator.

Previous summary

Textual Explanations for Self-Driving Vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata ECCV 2018.

Publisher: University of California, Berkeley, Saarland Informatics Campus, University of Amsterdam

BDD-X dataset

Grounding Human-To-Vehicle Advice for Self-Driving Vehicles

Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen, Ashish Tawari, John Canny CVPR 2019

Publisher: UC Berkeley, Honda Research Institute USA, Inc.

HAD dataset

DRAMA: Joint Risk Localization and Captioning in Driving

https://arxiv.org/abs/2209.10767

Due to the lack of standard benchmarks, we collected a large-scale dataset, DRAMA (Driving Risk Assessment Mechanism with A captioning module), which consists of 17,785 interactive driving scenarios collected in Tokyo, Japan. Our DRAMA dataset accommodates video- and object-level questions on driving risks with associated important objects to achieve the goal of visual captioning as a free-form language description utilizing closed and open-ended responses for multi-level questions, which can be used to evaluate a range of visual captioning capabilities in driving scenarios

https://usa.honda-ri.com/drama

The DRAMA Dataset is captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo.

Contains different annotations: Video-level Q/A, Object-level Q/A, Risk object bounding box, Free-form caption, and separate labels for ego-car intention, scene classifier and suggestions to the driver

The free-form descriptions of reasoning include 992 unique words with total occurrences of 306,708 times

申请中

Language Prompt for Autonomous Driving

https://arxiv.org/abs/2309.04379

To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack

https://github.com/wudongming97/Prompt4Driving

https://github.com/wudongming97/Prompt4Driving/blob/master/docs/data.md

NuPrompt Dataset

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

https://arxiv.org/abs/2310.01957

We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data.

https://github.com/wayveai/Driving-with-LLMs/tree/main?tab=readme-ov-file

https://github.com/wayveai/Driving-with-LLMs/tree/paper_code

VQA Dataset

https://github.com/wayveai/Driving-with-LLMs/blob/main/data/

❤️ Textual Explanations for Self-Driving Vehicles

https://arxiv.org/abs/1807.11546

We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at this https URL.

https://github.com/JinkyuKimUCB/explainable-deep-driving?tab=readme-ov-file

https://github.com/JinkyuKimUCB/BDD-X-dataset

We focus on generating textual descriptions and explanations, such as the pair: “Vehicle slows down” (description) and “Because it is approaching an intersection and the light is red” (explanation)

Our dataset is composed of over 77 hours of driving within 6,970 videos. The videos are taken in diverse driving conditions, e.g. day/night, highway/city/countryside, summer/winter etc. On average 40 seconds long, each video contains around 3-4 actions, e.g. speeding up, slowing down, turning right etc., all of which are annotated with a description and an explanation. Our dataset contains over 26K activities in over 8.4M frames. We introduce a training, a validation and a test set, containing 5,597, 717 and 656 videos, respectively.

https://drive.google.com/file/d/1Na-jxLK7VxOc6wcT5lnLQytfFi6rqhTF/view

BDD-X Dataset

https://drive.google.com/file/d/1Na-jxLK7VxOc6wcT5lnLQytfFi6rqhTF/view


Input.Video	Answer.1start	Answer.1end	Answer.1action	Answer.1justification	Answer.2start	Answer.2end	Answer.2action	Answer.2justification	Answer.3start	Answer.3end	Answer.3action	Answer.3justification	Answer.4start	Answer.4end	Answer.4action	Answer.4justification	Answer.5start	Answer.5end	Answer.5action	Answer.5justification	Answer.6start	Answer.6end	Answer.6action	Answer.6justification	Answer.7start	Answer.7end	Answer.7action	Answer.7justification	Answer.8start	Answer.8end	Answer.8action	Answer.8justification	Answer.9start	Answer.9end	Answer.9action	Answer.9justification	Answer.10start	Answer.10end	Answer.10action	Answer.10justification	Answer.11start	Answer.11end	Answer.11action	Answer.11justification	Answer.12start	Answer.12end	Answer.12action	Answer.12justification	Answer.13start	Answer.13end	Answer.13action	Answer.13justification	Answer.14start	Answer.14end	Answer.14action	Answer.14justification	Answer.15start	Answer.15end	Answer.15action	Answer.15justification
https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/06d501fd-a9ffc960.mov	0	11	The car accelerates	because the light has turned green.	12	19	The car is moving at a steady speed	because traffic is clear.	20	22	The car slows slightly	because it’s turning into the right lane.	23	36	The car stops	because it turns to the right.	37	40	The car accelerates	because traffic is clear.
https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/01b0505f-5f564e84.mov	0	16	The car is stopped.	The car is at an intersection with a red light.	16	20	The car is accelerating through the intersection.	The light at the intersection has changed to green
https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/06d501fd-fd237e38.mov	0	20	The car accelerates slowly to a maintained speed	because the light has turned green and traffic is flowing smoothly.
https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/06d54ae6-26a3446e.mov	0	6	The car is driving forward	as traffic flows freely.	7	9	The car merges into the lane to its left	to get around a slower car in front of it.	10	22	The car drives at a normal speed	as traffic moves freely.	23	25	The car slows and veers slightly right	due to a car in the neighboring lane entering the car’s lane.	26	40	The car drives forward	as there are no nearby cars in its lane.
https://s3-us-west-2.amazonaws.com/sq8geewpqu/samples-1k/01b4e4b9-e21fe0a3.mov	0	12	The car is stopped	because the light is red.	13	18	The car is accelerating	as the light turned green.	19	27	The car is slowing	because traffic in front of it is stopped.	28	38	The car is driving forward slowly	due to slow traffic in front of it.	39	40	The car is merging into the left lane	to make a left turn.

Grounding Human-to-Vehicle Advice for Self-driving Vehicles

https://arxiv.org/abs/1911.06978

we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice giving, where we train an end-to-end vehicle controller that accepts advice.

We evaluate our model on a novel advisable driving dataset with manually annotated human-to-vehicle advice called Honda Research Institute-Advice Dataset (HAD). We show that taking advice improves the performance of the end-to-end network, while the network cues on a variety of visual features that are provided by advice. The dataset is available at this https URL.

https://usa.honda-ri.com/HAD

The Honda Research Institute-Advice Dataset (HAD) is driving data with natural language advices. The driving videos are gathered from HDD dataset, a large-scale naturalistic driving dataset collected in San Francisco Bay Area. HAD consists of following features:

30 hours of driving video data with natural language advices

Videos with can-bus signal data

Advices consist of (1) goal-oriented advice (top-down signal) – to influence the vehicle in a avigation task and (2) stimulus-driven advice (bottom-up signal) – conveys some visual stimuli that the user expects their attention to be actively looked by the vehicle controller

https://usa.honda-ri.com/dataset-request-form?dataset=had

申请中

Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning

https://arxiv.org/abs/2309.06597

this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios.

没有开放

DriveLM: Driving with Graph Visual Question Answering

https://github.com/OpenDriveLab/DriveLM

We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.

🏁 DriveLM serves as a main track in the [CVPR 2024 Autonomous Driving Challenge](https://opendrivelab.com/challenge2024/#driving_with_language). Everything you need for the challenge is HERE, including baseline, test data and submission format and evaluation pipeline!

https://github.com/OpenDriveLab/DriveLM?tab=readme-ov-file#drivelmdata

https://github.com/OpenDriveLab/DriveLM/blob/main/docs/data_details.md/#features

Features of the DriveLM-Data

🛣 Completeness in functionality (covering Perception, Prediction, and Planning QA pairs).

🔜 Reasoning for future events that have not yet happened.

Many “What If”-style questions: imagine the future by language.

♻ Task-driven decomposition.

One scene-level description into many frame-level trajectories & planning QA pairs.

How about the annotation process?

The annotation process is different for DriveLM-nuScenes and DriveLM-CARLA.

For DriveLM-nuScenes, we divide the annotation process into three steps:

1️⃣ Keyframe selection. Given all frames in one clip, the annotator selects the keyframes that need annotation. The criterion is that those frames should involve changes in ego-vehicle movement status (lane changes, sudden stops, start after a stop, etc.).

2️⃣ Key objects selection. Given keyframes, the annotator needs to pick up key objects in the six surrounding images. The criterion is that those objects should be able to affect the action of the ego vehicle (traffic signals, pedestrians crossing the road, other vehicles that move in the direction of the ego vehicle, etc.).

3️⃣ Question and answer annotation. Given those key objects, we automatically generate questions regarding single or multiple objects about perception, prediction, and planning. More details can be found in our data.

For DriveLM-CARLA, we employ an automated annotation approach:

We collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework with a privileged rule-based expert. We set up a series of routes in urban, residential, and rural areas and execute the expert on these routes. During this process, we collect the necessary sensor data, generate relevant QAs based on privileged information about objects and the scene, and organize the logical relationships to connect this series of QAs into a graph.

❤️ DriveLM-nuScenes QA Dataset

https://github.com/OpenDriveLab/DriveLM/blob/main/docs/data_prep_nus.md

We kindly hope you to fill out the form before downloading. To get started, download nuScenes subset image data and DriveLM-nuScenes QA json files below. For v1.1 data, please visit the DriveLM/challenge folder.


nuScenes subset images	DriveLM-nuScenes version-1.0
Google Drive	Google Drive
Baidu Netdisk	Baidu Netdisk
HuggingFace	HuggingFace

You can also download the full nuScenes dataset HERE to enable video input.

https://huggingface.co/spaces/AGC2024/driving-with-language-official

https://github.com/OpenDriveLab/DriveLM/tree/main/challenge#drivelm

https://github.com/OpenDriveLab/DriveLM/tree/main?tab=readme-ov-file#drivelmdata

暂时无法在飞书文档外展示此内容

{
    "f0f120e4d4b0441da90ec53b16ee169d": {
        "scene_description": "The ego vehicle proceeds through the intersection, continuing along the current roadway.",
        "key_frames": {
            "4a0798f849ca477ab18009c3a20b7df2": {
                "key_object_infos": {
                    "<c1,CAM_BACK,1088.3,497.5>": {
                        "Category": "Vehicle",
                        "Status": "Moving",
                        "Visual_description": "Brown SUV.",
                        "2d_bbox": [
                            966.6,
                            403.3,
                            1224.1,
                            591.7
                        ]
                    },
                    "<c2,CAM_BACK,864.2,468.3>": {
                        "Category": "Vehicle",
                        "Status": "Moving",
                        "Visual_description": "Black sedan.",
                        "2d_bbox": [
                            816.7,
                            431.6,
                            917.2,
                            505.0
                        ]
                    },
                    "<c3,CAM_FRONT,1043.2,82.2>": {
                        "Category": "Traffic element",
                        "Status": null,
                        "Visual_description": "Green light.",
                        "2d_bbox": [
                            676.4,
                            0.0,
                            1452.6,
                            171.5
                        ]
                    }
                },
                "QA": {
                    "perception": [
                        {
                            "Q": "What are objects to the front right of the ego car?",
                            "A": "There are many barriers and one construction vehicle to the front right of the ego car.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "What are objects to the front left of the ego car?",
                            "A": "There is one truck and one barrier to the front left of the ego car.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "What is the status of the truck that is to the front left of the ego car?",
                            "A": "One truck is moving.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                    。。。    
"prediction": [
                        {
                            "Q": "Is <c1,CAM_BACK,1088.3,497.5> a traffic sign or a road barrier?",
                            "A": "No.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "Which object is most likely to be occluded by <c1,CAM_BACK,1088.3,497.5>? Would this object affect the ego vehicle? Based on this object, what action of the ego vehicle is dangerous?",
                            "A": "None, no, none.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "Would <c1,CAM_BACK,1088.3,497.5> be in the moving direction of the ego vehicle?",
                            "A": "No.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "What is the future state of <c1,CAM_BACK,1088.3,497.5>?",
                            "A": "Turn left.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "Is <c2,CAM_BACK,864.2,468.3> a traffic sign or a road barrier?",
                            "A": "No.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
    。。。
"planning": [
                        {
                            "Q": "Is <c1,CAM_BACK,1088.3,497.5> an object that the ego vehicle should consider in the current scene?",
                            "A": "No.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "What actions could the ego vehicle take based on <c1,CAM_BACK,1088.3,497.5>? Why take this action and what's the probability?",
                            "A": "The action is to keep going at the same speed. The reason is to follow the traffic rules, which has a high probability.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
                        {
                            "Q": "What is the probability of colliding with <c1,CAM_BACK,1088.3,497.5> after the ego vehicle goes straight and keeps the same speed?",
                            "A": "Low.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        },
。。。
  "behavior": [
                        {
                            "Q": "Predict the behavior of the ego vehicle.",
                            "A": "The ego vehicle is going straight. The ego vehicle is driving fast.",
                            "C": null,
                            "con_up": null,
                            "con_down": null,
                            "cluster": null,
                            "layer": null
                        }
                    ]

WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models

https://browse.arxiv.org/abs/2305.07528

To aid adversarial robustness in perception, we introduce WEDGE (WEather images by DALL-E GEneration): a synthetic dataset generated with a vision-language generative model via prompting. WEDGE consists of 3360 images in 16 extreme weather conditions manually annotated with 16513 bounding boxes, supporting research in the tasks of weather classification and 2D object detection. We have analyzed WEDGE from research standpoints, verifying its effectiveness for extreme-weather autonomous perception.

The repository for this work and dataset is available at this https URL.

WEDGE Dataset

https://github.com/Infernolia/WEDGE

https://github.com/Infernolia/WEDGE/tree/main/Dataset/WEDGE

https://drive.google.com/file/d/1gmnoZWw9Oh-A60HE_qhXlwsNj2n1-iZ3/view

https://arxiv.org/abs/2305.14836

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues.

To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates.

Codes and dataset are available at this https URL

NuScenes-QA Dataset：from nuScenes

https://github.com/qiantianwen/NuScenes-QA

https://drive.google.com/drive/folders/1jIkICT23wZWZYPrWCa0x-ubjpClSzOuU

{
    "info": {
        "split": "val",
        "version": "1.0",
        "date": "2023-02-27"
    },
    "questions": [
        {
            "split": "val",
            "sample_token": "fd8420396768425eabec9bdddf7e64b6",
            "question": "Are any moving bicycles visible?",
            "answer": "no",
            "num_hop": 0,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "fd8420396768425eabec9bdddf7e64b6",
            "question": "Are any moving pedestrians visible?",
            "answer": "yes",
            "num_hop": 0,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "fd8420396768425eabec9bdddf7e64b6",
            "question": "Are there any traffic cones to the back of me?",
            "answer": "no",
            "num_hop": 1,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "fd8420396768425eabec9bdddf7e64b6",
            "question": "Are there any moving cars to the back of me?",
            "answer": "yes",
            "num_hop": 1,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "fd8420396768425eabec9bdddf7e64b6",
            "question": "There is a parked thing to the front of me; what is it?",
            "answer": "car",
            "num_hop": 1,
            "template_type": "object"
        },
        {
            "split": "val",
            "sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
            "question": "There is a pedestrian to the back of me; what is its status?",
            "answer": "moving",
            "num_hop": 1,
            "template_type": "status"
        },
        {
            "split": "val",
            "sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
            "question": "What number of traffic cones are there?",
            "answer": "3",
            "num_hop": 0,
            "template_type": "count"
        },
        {
            "split": "val",
            "sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
            "question": "How many cars are to the back right of me?",
            "answer": "10",
            "num_hop": 1,
            "template_type": "count"
        },
        {
            "split": "val",
            "sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
            "question": "Are there any things?",
            "answer": "yes",
            "num_hop": 0,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
            "question": "Are any standing pedestrians visible?",
            "answer": "no",
            "num_hop": 0,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "6eb8a3ff0abf4f3a9380a48f2a0b87ef",
            "question": "Are there any traffic cones to the front of me?",
            "answer": "yes",
            "num_hop": 1,
            "template_type": "exist"
        },
        {
            "split": "val",
            "sample_token": "b10f0cd792b64d16a1a5e8349b20504c",
            "question": "What number of moving cars are there?",
            "answer": "5",
            "num_hop": 0,
            "template_type": "count"
        },

Embodied Understanding of Driving Scenarios

https://github.com/OpenDriveLab/ELM

Pre-training data. We collect driving videos from YouTube, nuScenes, Waymo, and Ego4D. Here we provide a sample of 🔗 YouTube video list we used. For privacy considerations, we are temporarily keeping the complete data labels private.

Fine-tuning data. The full set of question and answer pairs for the benchmark can be obtained through this 🔗data link. You may need to download the corresponding image data from the official nuScenes and Ego4D channels. For a quick verification of the pipeline, we recommend downloading the subset dataset of DriveLM and organizing the data in line with the format.

Please make sure to soft link nuScenes and ego4d datasets under data/xx folder. You may need to run tools/video_clip_processor.py to pre-process data first. Besides, we provide some script used during auto-labeling, you may use these as a reference if you want to customize data.

MAPLM-QA

https://llvm-ad.github.io/challenge/

❗ The MAPLM-QA v1.5 dataset has been released at this link.

The challenge track is based on the MAPLM-QA benchmark, a subset of the MAPLM dataset designed for visual question answering in traffic scene understanding. Participants will develop innovative methods to accurately answer multi-choice questions about complex traffic scenes using high-resolution panoramic images and 2.5D bird’s-eye view representations. Top-performing teams will be recognized with certificates and honorariums.

MAPLM-QA Dataset

https://huggingface.co/datasets/LLVM-AD/maplm_v1_5/blob/main/data/val/images/FR10018/photo_forward.jpg

https://huggingface.co/datasets/LLVM-AD/maplm_v1_5/tree/main/data/val

  {
    "frame_id": "FR10018",
    "conversations": [
      {
        "question": "What type of road scene is depicted in the images?",
        "options": [
          "Normal city road.",
          "Construction road.",
          "Undeveloped road.",
          "Road mark repainting.",
          "Roundabout.",
          "None of the above."
        ],
        "answer": 0,
        "question_type": "SCN"
      },
      {
        "question": "What is the quality of the point cloud data in the current road area of this image?",
        "options": [
          "Very clear.",
          "Not clear, road markings are worn.",
          "Not clear, road markings are occluded by vehicles.",
          "Not clear, road markings are worn and occluded by vehicles."
        ],
        "answer": 0,
        "question_type": "QLT"
      },
      {
        "question": "How many lanes are present on the current road?",
        "options": [
          "2",
          "3",
          "4",
          "5",
          "6"
        ],
        "answer": 2,
        "question_type": "LAN"
      },
      {
        "question": "Describe the lane attributes of the current road.",
        "options": [],
        "answer": "There are 4 lanes in the image, with lane attributes from left to right being: motorway | motorway | motorway | motorway.",
        "question_type": "DES"
      },
      {
        "question": "Are there any road crossings, intersections, or lane change zones on the main road?",
        "options": [
          "No.",
          "Yes, there is one crossroad in the image.",
          "Yes, there is one T-junction in the image.",
          "Yes, there is a small road intersection in the image.",
          "Yes, there is one lane change zone in the image.",
          "Yes, there is one irregular intersection in the image."
        ],
        "answer": 0,
        "question_type": "INT"
      },
      {
        "question": "Provide a detailed description of the current driving scene.",
        "options": [],
        "answer": "The scene depicts a multi-lane road with vehicles traveling in both directions. The road is straight with clear lane markings, and there are trees and streetlights along the sides.",
        "question_type": "CAP"
      },
      {
        "question": "Identify any unusual or unfamiliar objects present in the image.",
        "options": [],
        "answer": "There are no unusual or unfamiliar objects present in the image that require special attention.",
        "question_type": "OBJ"
      }
    ],
    "images": [
      "data/val/images/FR10018/photo_forward.jpg",
      "data/val/images/FR10018/photo_lef_back.jpg",
      "data/val/images/FR10018/photo_rig_back.jpg",
      "data/val/images/FR10018/point_cloud_bev.jpg"
    ]
  },

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

In several pioneering datasets and benchmark papers, the authors have explored languageguided visual understanding tasks in driving scenes. These datasets can be split to two types: (1) Added additional texts for existing NuScenes [6] dataset such as Talk2Car [13], NuScenes-QA [47], NuScenes-MQA [27], DriveLM [8], and NuPrompt [68], NuInstruct [16]; (2) Independent collected datasets such as Rank2Tell [50], BDDX [28], SUTD-TrafficQA [71], DRAMA [39], and LINGOQA [43]. However, limited by data scale and data quality, current datasets can not serve as useful benchmarks to evaluate multi-modal LLMs for driving scenes. Besides, the newest techniques like GPT-4V [44] in the general domain has already been trained with plenty of open-source traffic and driving scene datasets. Those vision language datasets annotated on nuScenes [6] can not serve as reliable benchmarks to validate existing models. Thus, we need new out of domain large-scale datasets and benchmarks that contain more corner cases of various traffic and driving scenarios and related HD map annotations.

2.3. LLMs for Autonomous Driving

LLMs have shown remarkable potential in complicated scenarios such as driving scene understanding and decisionmaking [10, 26, 38, 41]. Recent advancements focus on building visual-language models to generate driving policies such as DiLu [64], DriveGPT4 [72], GPT-Driver [40], HiLM-D [15], DriveMLM [60], and DriveVLM [57]. Talk2BEV [14] and LiDAR-LLM [75] also explored the connection between LLMs, VLMs and bird’s-eye view (BEV), LiDAR point cloud in autonomous driving contexts. Besides, LLMs can also enhance the interaction between passengers and vehicles, improving the personalization and responsiveness of autonomous driving experiences [9, 20]. An equally crucial area of research is the development of language-guided closed-loop autonomous driving systems. These systems leverage multi-modal sensor data from simulators, as demonstrated by LimSim++ [19] and LMDrive [52]. Additionally, RAG-Driver [81] introduces a novel retrieval-augmented in-context learning approach, significantly enhancing the zero-shot generalization capabilities of driving LLMs. From the industry, Wayve proposed the first open-loop driving commentator LINGO1 [62].

[2019]Talk2car: Taking control of your self-driving car

https://arxiv.org/abs/1909.10838

We consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars.

Talk2car Dataset: from nuScenes

https://macchina-ai.cs.kuleuven.be/

https://macchina-ai.cs.kuleuven.be/downloads.html

https://talk2car.github.io/

https://github.com/talk2car/Talk2Car

Talk2Car: Taking Control of Your Self-Driving Car

The Talk2Car dataset finds itself at the intersection of various research domains, promoting the development of cross-disciplinary solutions for improving the state-of-the-art in grounding natural language into visual space. The annotations were gathered with the following aspects in mind:

Free-form high quality natural language commands, that stimulate the development of solutions that can operate in the wild.

A realistic task setting. Specifically, we consider an autonomous driving setting, where a passenger can control the actions of an Autonomous Vehicle by giving commands in natural language.

The Talk2Car dataset was build on top of the nuScenes dataset to include an extensive suite of sensor modalities, i.e. semantic maps, GPS, LIDAR, RADAR and 360-degree RGB images annotated with 3D bounding boxes. Such variety of input modalities sets the object referral task on the Talk2Car dataset apart from related challenges, where additional sensor modalities are generally missing.

Talk2Car-Destination - Paper - Dataset
Talk2Car-RegSeg - Paper - Dataset
Talk2Car-Trajectory - Paper - Dataset

https://github.com/talk2car/Talk2Car/tree/master/data

{
    'scene_token': f92422ed4b4e427194a4958ccf15709a, # nuScenes scene token
    'sample_token': c32d636e44604d77a1734386b3fe4a0d, # nuScenes sample token
    'translation': [-13.49250542687401, 0.43033061594724364, 59.28095610405408], # Translation
    'size': [0.81, 0.73, 1.959], # Size
    'rotation':  ['-0.38666213835670615', '-0.38076281276237284', '-0.5922192111910205', '0.5956412318459762'], # Rotation,
    'command': 'turn left to pick up the pedestrian at the corner', # Command
    'obj_name': 'human.pedestrian.adult', # Class name of the reffered object 
    'box_token': '0183ed8a474f411f8a3394eb78df7838' # nuScenes box token,
    'command_token': '4175173f5f60d19ecfc3712e960a1103' # A unique command identifier,
    '2d_box': [200, 300, 50, 50] # The 2d bounding box of the referred object in the frontal view. Follows the format [x,y,w,h]
    '':
}

指令形式的数据。

[202312]Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations

https://arxiv.org/abs/2312.06352

Visual Question Answering (VQA) is one of the most important tasks in autonomous driving, which requires accurate recognition and complex situation evaluations. However, datasets annotated in a QA format, which guarantees precise language generation and scene recognition from driving scenes, have not been established yet. In this work, we introduce Markup-QA, a novel dataset annotation technique in which QAs are enclosed within markups. This approach facilitates the simultaneous evaluation of a model’s capabilities in sentence generation and VQA. Moreover, using this annotation methodology, we designed the NuScenes-MQA dataset. This dataset empowers the development of vision language models, especially for autonomous driving tasks, by focusing on both descriptive capabilities and precise QA. The dataset is available at this https URL.

https://openaccess.thecvf.com/content/WACV2024W/LLVM-AD/papers/Inoue_NuScenes-MQA_Integrated_Evaluation_of_Captions_and_QA_for_Autonomous_Driving_WACVW_2024_paper.pdf

https://arxiv.org/pdf/2312.06352

https://arxiv.org/html/2312.06352v1

居然还有ppt

https://docs.google.com/presentation/d/1mUtU9S7VVBmDy7nZo_PzmuVxSpeQracD9x_1B2zc-2g/edit#slide=id.g2a7ef2f5709_0_226

NuScenes-MQA Dataset: from nuScenes

https://github.com/turingmotors/NuScenes-MQA

https://drive.google.com/drive/u/0/folders/1PQy0qhTtbdueIVlVnn4jC6xvANZUynRZ

文件：


sample_token	question	answer	question_type
2ec05dce9faf45d39cdf8188471e9ea7	Highlight what’s in front left of the ego car.	In the front left, 2 bicycles and 1 car are detected.:There are 2 bicycles and 1 car in the front left.:The visuals from the front left make it clear that 2 bicycles and 1 car are seen.:Viewing the front left display, 2 bicycles and 1 car are distinct.:The perspective from the front left reveals that 2 bicycles and 1 car exist.	important_object_count_and_direction
2ec05dce9faf45d39cdf8188471e9ea7	In terms of objects, how many are in the front left of the ego car?	Scanning through the ego car’s front left, we can observe 2 bicycles and 1 car.:From the frame of the ego car’s front left, the surroundings hold 2 bicycles and 1 car.:There are 2 bicycles and 1 car in the front left of the ego car.:The front left on the ego car reveals 2 bicycles and 1 car.:The ego car’s front left showcases 2 bicycles and 1 car.	important_object_count_and_direction
2ec05dce9faf45d39cdf8188471e9ea7	In the front left of the ego car, how many bicycles do you see?	2 bicycles.:The front left has captured 2 bicycles.:From the visual feed of the front left, 2 bicycles are clear.:The observation from the front left indicates 2 bicycles.:The front left shows 2 bicycles.	object_presence_confirmation
2ec05dce9faf45d39cdf8188471e9ea7	Identify the count of cars in the front left of the ego car.	In the front left, 1 car is detected.:In the viewpoint of the front left, there’s 1 car.:The front left shows 1 car.:In the scene from the front left, we can identify 1 car.:You can see 1 car from the front left.	object_presence_confirmation
2ec05dce9faf45d39cdf8188471e9ea7	Point to the object in front around the ego vehicle.	In the scope of the front, 4 cars, 1 bicycle, 1 bicycle rack and 1 truck are observable.:Upon scanning the front, it’s clear that 4 cars, 1 bicycle, 1 bicycle rack and 1 truck are visible.:From what the front showcases, 4 cars, 1 bicycle, 1 bicycle rack and 1 truck are evident.:From the feed of the front, it’s clear that 4 cars, 1 bicycle, 1 bicycle rack and 1 truck are in the area.:Through the perspective of the front, 4 cars, 1 bicycle, 1 bicycle rack and 1 truck are spotted.	important_object_count_and_direction

🪴 无人之路

Explorer