Evaluation Details
Our evaluation of VQA primarily focuses on the correctness of the answers. Since rule-based methods are not well-suited for evaluating natural language, Bench2Drive-VL introduces an additional large language model as an evaluator to assess the natural language answers.
The simplest evaluation method is to let the evaluator directly score the response. However, this is too subjective and may introduce uncertainty. Therefore, for some questions, Bench2Drive-VL defines strict scoring rules to reduce instability in evaluation.
Question 19: Important Objects and Their Order in the Scene
In this question, the VLM is required to identify important objects in the scene and rank them in order of importance. In the DriveCommenter pipeline:
-
Important objects are first identified.
-
Each object is assigned attributes:
- If the object plays a role in the current context, its
is_role
is set totrue
. - If the object is likely to collide with the ego vehicle,
is_dangerous
is set totrue
.
- If the object plays a role in the current context, its
-
The objects are then sorted using the following priority:
is_role
objects >is_dangerous
objects > normal objects.- Within the same priority, nearer objects come first.
The evaluation model considers both object correctness and order correctness. The procedure is as follows:
- A script extracts all objects involved in the ground truth answer.
- The VLM’s answer is passed to the evaluator to extract mentioned objects.
- To encourage consistency, the GT object list is also provided to the evaluator for reference.
Position Weight
Let the ground truth contain objects: . The smaller the index, the higher the priority. Define ORDER_RATIO
(typically 5). Then:
Base Weight
Defined as:
Final Weight
For any object in the ground truth:
Objects not in the ground truth are assigned a penalty weight, equal to the minimum position weight divided by EXTRA_RATIO
(typically 2).
NDCG
Let be the number of overlapping objects between VLM and GT.
- Discounted Cumulative Gain (DCG):
where is the index of the -th predicted object in the GT list.
- Ideal DCG (IDCG):
Then:
Weighted F1 Score
Let:
- TP = true positives
- FP = false positives
- FN = false negatives
Then:
Final Score
Questions Involving Multiple Objects
This category includes Question 27, Question 24, Question 25, Question 28, Question 29, Question 46, and Question 47. These questions share a common requirement: the VLM must identify all objects in the current scene that meet a specified condition and then analyze each identified object individually. The final score for this category consists of two components: the F1-score and an object-level accuracy metric, denoted as the object-score.
Weights
Similar to the previous type, the function first builds a reference list of target objects based on the ground truth annotations. It then uses the evaluation model to extract the list of objects actually mentioned in the VLM’s response.
For each object that appears in the ground truth, its base weight is assigned according to the same rule as Question 19. For objects not present in the ground truth, the weight varies depending on the nature of the question. In cases where mentioning additional objects is not considered a major error (e.g., Q24: “Identify important objects and describe their motion states”), these objects are assigned a relatively small weight, defaulting to . In contrast, for cases where extra objects are misleading (e.g., “Identify objects that will collide with the ego vehicle”), a higher penalty weight is assigned, typically .
Object Scores
Next, for each object that appears in both the ground truth and the VLM’s response, the evaluation model is called again to score only the portion of the answer that pertains to that specific object. This yields an individual score for each object. The overall object-score is then computed as:
The final score for this category is calculated as:
Question 43: Natural Language Description of the Ego Vehicle’s Intended Action; Question 50: Current Control Key Values
To evaluate the correctness of the responses to these questions, the evaluation model first parses the VLM’s natural language description to extract the corresponding control key values. These predicted key values are then compared with the ground truth key values to compute the F1-score.
Additionally, a speed penalty is introduced, which multiplies the F1-score to obtain the final score. The speed penalty is defined as follows:
speed_penalty = {
'KEEP': {'ACCELERATE': 1.2, 'DECELERATE': 1.1, 'STOP': 1.0},
'ACCELERATE': {'KEEP': 1.0, 'DECELERATE': 1.0, 'STOP': 1.0},
'DECELERATE': {'KEEP': 1.0, 'ACCELERATE': 0.8, 'STOP': 1.0},
'STOP': {'KEEP': 0.5, 'ACCELERATE': 0.25, 'DECELERATE': 1.0}
}
This means that, for example, predicting ACCELERATE
when the correct answer is STOP
incurs the heaviest penalty, while predicting ACCELERATE
instead of KEEP
may actually be favorable for traffic flow and thus reduces the penalty.
The final score is then given by:
Extra Guiding Prompts
For certain questions, Bench2Drive-VL provides guiding prompts to help steer the evaluation model's attention. These are given in the form of tailored prompts.
Question 15: About the current speed limit
Considering the importance of traffic regulation compliance and driving safety, this question imposes a higher penalty for overestimating the speed limit compared to underestimating it.
Question 8: About braking
In this case, it is possible that the ground truth requires the vehicle to stop, but the current vehicle speed is already zero. In such situations, the VLM may answer with “No braking is needed because the vehicle is already stopped.” This is also considered a valid response. The evaluation system will provide the evaluation model with an additional prompt to allow for multiple acceptable answers.