Version: Next

Evaluation Details

Our evaluation of VQA primarily focuses on the correctness of the answers. Since rule-based methods are not well-suited for evaluating natural language, Bench2Drive-VL introduces an additional large language model as an evaluator to assess the natural language answers.

The simplest evaluation method is to let the evaluator directly score the response. However, this is too subjective and may introduce uncertainty. Therefore, for some questions, Bench2Drive-VL defines strict scoring rules to reduce instability in evaluation.

Question 19: Important Objects and Their Order in the Scene

In this question, the VLM is required to identify important objects in the scene and rank them in order of importance. In the DriveCommenter pipeline:

Important objects are first identified.
Each object is assigned attributes:
- If the object plays a role in the current context, its is_role is set to true.
- If the object is likely to collide with the ego vehicle, is_dangerous is set to true.
The objects are then sorted using the following priority:
- is_role objects > is_dangerous objects > normal objects.
- Within the same priority, nearer objects come first.

The evaluation model considers both object correctness and order correctness. The procedure is as follows:

A script extracts all objects involved in the ground truth answer.
The VLM’s answer is passed to the evaluator to extract mentioned objects.
To encourage consistency, the GT object list is also provided to the evaluator for reference.

Position Weight $\mathbf{p}_i$

Let the ground truth contain $n$ objects: $\mathbf{obj}_i, i \in [0, n)$ . The smaller the index, the higher the priority. Define ORDER_RATIO (typically 5). Then:

\mathbf{p}_i = \frac{N \cdot \texttt{ORDER\_RATIO}}{\texttt{ORDER\_RATIO} - 1} - i

Base Weight $\mathbf{b}_i$

Defined as:

\mathbf{b}_i = \begin{cases} 3, & \text{if } \text{is\_role}(i) = \text{True} \text{ or } \text{is\_dangerous}(i) = \text{True} \\ 1, & \text{otherwise} \end{cases}

Final Weight $\mathbf{w}_i$

For any object in the ground truth:

\mathbf{w}_i = \mathbf{b}_i \cdot \mathbf{p}_i

Objects not in the ground truth are assigned a penalty weight, equal to the minimum position weight divided by EXTRA_RATIO (typically 2).

NDCG

Let $k$ be the number of overlapping objects between VLM and GT.

Discounted Cumulative Gain (DCG):

\mathbf{DCG} = \sum_{i=1}^{k} \frac{\mathbf{w}_{r_i}}{\log_2(i+1)}

where $r_i$ is the index of the $i$ -th predicted object in the GT list.

Ideal DCG (IDCG):

\mathbf{IDCG} = \sum_{i=1}^{k} \frac{\mathbf{w}_{i}^{\text{ideal}}}{\log_2(i+1)}

Then:

\mathbf{NDCG} = \frac{\mathbf{DCG}}{\mathbf{IDCG}}

Weighted F1 Score

Let:

TP = true positives
FP = false positives
FN = false negatives

Then:

\begin{align} \mathbf{P} &= \frac{\sum_{\text{TP}} \mathbf{w}_i}{\sum_{\text{TP}} \mathbf{w}_i + \sum_{\text{FP}} \mathbf{w}_i} \\ \mathbf{R} &= \frac{\sum_{\text{TP}} \mathbf{w}_i}{\sum_{\text{TP}} \mathbf{w}_i + \sum_{\text{FN}} \mathbf{w}_i} \\ \mathbf{F1} &= \frac{2 \cdot \mathbf{P} \cdot \mathbf{R}}{\mathbf{P} + \mathbf{R}} \end{align}

Final Score

\mathbf{score}_{19} = \mathbf{F1} \cdot \mathbf{NDCG}

Questions Involving Multiple Objects

This category includes Question 27, Question 24, Question 25, Question 28, Question 29, Question 46, and Question 47. These questions share a common requirement: the VLM must identify all objects in the current scene that meet a specified condition and then analyze each identified object individually. The final score for this category consists of two components: the F1-score and an object-level accuracy metric, denoted as the object-score.

Weights

Similar to the previous type, the function first builds a reference list of target objects based on the ground truth annotations. It then uses the evaluation model to extract the list of objects actually mentioned in the VLM’s response.

For each object that appears in the ground truth, its base weight $\textbf{b}_i$ is assigned according to the same rule as Question 19. For objects not present in the ground truth, the weight varies depending on the nature of the question. In cases where mentioning additional objects is not considered a major error (e.g., Q24: “Identify important objects and describe their motion states”), these objects are assigned a relatively small weight, defaulting to $0.25$ . In contrast, for cases where extra objects are misleading (e.g., “Identify objects that will collide with the ego vehicle”), a higher penalty weight is assigned, typically $1.5$ .

Object Scores

Next, for each object that appears in both the ground truth and the VLM’s response, the evaluation model is called again to score only the portion of the answer that pertains to that specific object. This yields an individual score $\mathbf{s}_i$ for each object. The overall object-score is then computed as:

\mathbf{object}\text{-score} = \frac{\sum_{i} \mathbf{s}_i \cdot \mathbf{b_i}}{\sum_{i} \mathbf{b_i}}

The final score for this category is calculated as:

\mathbf{score}_{\text{listed}} = \mathbf{F1} \cdot \mathbf{object}\text{-score}

Question 43: Natural Language Description of the Ego Vehicle’s Intended Action; Question 50: Current Control Key Values

To evaluate the correctness of the responses to these questions, the evaluation model first parses the VLM’s natural language description to extract the corresponding control key values. These predicted key values are then compared with the ground truth key values to compute the F1-score.

Additionally, a speed penalty is introduced, which multiplies the F1-score to obtain the final score. The speed penalty is defined as follows:

speed_penalty = {
    'KEEP': {'ACCELERATE': 1.2, 'DECELERATE': 1.1, 'STOP': 1.0},
    'ACCELERATE': {'KEEP': 1.0, 'DECELERATE': 1.0, 'STOP': 1.0},
    'DECELERATE': {'KEEP': 1.0, 'ACCELERATE': 0.8, 'STOP': 1.0},
    'STOP': {'KEEP': 0.5, 'ACCELERATE': 0.25, 'DECELERATE': 1.0}
}

This means that, for example, predicting ACCELERATE when the correct answer is STOP incurs the heaviest penalty, while predicting ACCELERATE instead of KEEP may actually be favorable for traffic flow and thus reduces the penalty.

The final score is then given by:

\mathbf{score}_{\text{action}} = \mathbf{F1} \cdot \text{speed\_penalty}

Extra Guiding Prompts

For certain questions, Bench2Drive-VL provides guiding prompts to help steer the evaluation model's attention. These are given in the form of tailored prompts.

Question 15: About the current speed limit

Considering the importance of traffic regulation compliance and driving safety, this question imposes a higher penalty for overestimating the speed limit compared to underestimating it.

Question 8: About braking

In this case, it is possible that the ground truth requires the vehicle to stop, but the current vehicle speed is already zero. In such situations, the VLM may answer with “No braking is needed because the vehicle is already stopped.” This is also considered a valid response. The evaluation system will provide the evaluation model with an additional prompt to allow for multiple acceptable answers.

Question 19: Important Objects and Their Order in the Scene​

Position Weight pi\mathbf{p}_ipi​​

Base Weight bi\mathbf{b}_ibi​​

Final Weight wi\mathbf{w}_iwi​​

NDCG​

Weighted F1 Score​

Final Score​

Questions Involving Multiple Objects​

Weights​

Object Scores​

Question 43: Natural Language Description of the Ego Vehicle’s Intended Action; Question 50: Current Control Key Values​

Extra Guiding Prompts​

Question 15: About the current speed limit​

Question 8: About braking​