In autonomous driving, we aim to achieve both (1) fast online planning with a modularized differentiable end-to-end (e2e) architecture, and (2) human-like reasoning process with a vision-language model (VLM). Our key idea is to distill the reasoning process and commonsense knowledge from a VLM to the e2e driving model.

Abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of applying commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior, but the long inference time makes them impractical to deploy. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous DrIving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in \( \ell_{2} \) distance, while maintaining high inference speed.


Approach

We align language features acquired from the VLM reasoning module with driving features acquired from the e2e driving model. Specifically, we do so for each of the perception, prediction, and planning submodules to effectively embed the VLM reasoning process into the e2e model.

VLM Reasoning Module

To acquire language features, we use the ground truth trajectory to query a VLM for the reasoning process behind the driving behavior. We ask in the order of perception, prediction, and planning, about other agents and objects in spatial order, other agents' behavior, and the ego vehicle's behavior. After obtaining textual responses from the VLM, we map them to the latent space with a language encoder.

VLM Reasoning Diagram

Aligning with the e2e Model

To obtain driving features, we introduce a learnable Progressive Feature Projector (PFP) to each of the perception, prediction, and planning submodules. The PFPs map driving features to the same latent space as the language features, and align them with a cosine similarity loss.

Experiments

We show superior performance against our direct baseline VAD, a supervised e2e model without embedded reasoning. The video shows the multi-view camera observations on the left and the BEV view on the right. Each of the example shows our successful performance on the perception, prediction, and planning modules. We also show the VLM text response for each testing case to demonstrate that VERDI has successfully distilled VLM's reasoning and commonsense knowledge.

Method Requires VLM
@ Inference
FPS \( \uparrow \) \( \ell_2 \) (1s) \( \downarrow \) \( \ell_2 \) (2s) \( \downarrow \) \( \ell_2 \) (3s) \( \downarrow \) \( \ell_2 \) (avg.) \( \downarrow \) Ego Status
DriveVLM \( \checkmark \) 2.43 0.18 0.34 0.68 0.40 \( \checkmark \)
OpenEMMA \( \checkmark \) NA 1.45 3.21 3.76 2.81 \( \checkmark \)
OmniDrive \( \checkmark \) 0.44 1.15 1.96 2.84 1.98 -
UniAD - 1.8 0.48 0.96 1.05 0.83 -
VAD-Base - 4.5 0.41 0.70 1.05 0.72 -
VERDI - 4.5 0.36 0.62 0.96 0.65 -

Our method performs 10% better than the direct baseline VAD. Methods are compared according to: (1) Whether a VLM is required at inference; (2) Inference speed (FPS); (3) Trajectory accuracy, measured as the \( \ell_{2} \) distance to the expert trajectory at 1s, 2s, and 3s horizon; and (4) Whether precise historical ego-vehicle state is used in planning. In a fair comparison with methods not having privileged access to ego status, including location, VERDI achieves the best performance across all metrics.

BibTeX

@misc{feng2025verdivlmembeddedreasoningautonomous,
        title={VERDI: VLM-Embedded Reasoning for Autonomous Driving}, 
        author={Bowen Feng and Zhiting Mei and Baiang Li and Julian Ost and Roger Girgis and Anirudha Majumdar and Felix Heide},
        year={2025},
        eprint={2505.15925},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2505.15925}, 
  }

The website design was adapted from Nerfies.