Skip to main content
Version: Next

Graph Configurations

Take this config as an example:

"CHAIN": {
"NODE": [19, 15, 7, 24, 13, 47, 8, 43, 50],
"EDGE": {
"19": [24, 13, 8],
"15": [7, 8],
"7": [8],
"24": [13, 47],
"13": [47, 8, 43],
"47": [8],
"8": [43],
"43": [50],
"50": []
},
"INHERIT": {
"19": [43, 7],
"15": [7]
},
"USE_GT": [24]
}

In this config file, all numbers are questions' qids.

Fields

  • NODE (type: list): List of all questions used in inference.

  • EDGE (type: dict[str, list]): A dictionary representing all edges in the graph, where each key is a predecessor node and its value is the list of successor nodes.

  • INHERIT (type: dict[str, list]): A dictionary where each key is a question that uses context from previous-frame questions, and each value is the list of those inherited questions.

  • USE_GT (type: list): List of questions that are answered directly using ground truth, without involving the VLM.

Inference procedure

Before reasoning begins, the program performs a topological sort on all nodes, then proceeds with node-by-node inference. Note that for some object-wise questions, a single node may correspond to multiple questions for different objects. For each question,

  1. All immediate predecessor nodes and the VLM answers to those nodes will be included in the context of the current question's prompt.

  2. If the current question ID is a key in the INHERIT field, the program will look up all qids listed in the corresponding value list from the previous frame and also include them and their answers in the context.

  3. If any of the context questions belong to the USE_GT field, the corresponding ground truth will be used as this question's answer directly.

context order

The order of questions in the context respects the same topological order used during inference.

Example

For the example given above, a valid single-frame inference sequence is as follows:

  1. Infer question 19 using the question–answer pairs of question 43 and question 7 from the previous frame as context.
  2. Infer question 24, but since it uses ground truth (GT), skip.
  3. Infer question 13 using the question–answer pairs of question 19 and question 24 from the current frame as context, where the answer to question 24 is taken from GT.
  4. Infer question 47 using the question–answer pairs of question 24 and question 19 from the current frame as context, where the answer to question 24 is taken from GT.
  5. Infer question 15 using the question–answer pair of question 7 from the previous frame as context.
  6. Infer question 7 using the question–answer pair of question 15 from the current frame as context.
  7. Infer question 8 using the question–answer pairs of questions 13, 47, 15, and 7 from the current frame as context.
  8. Infer question 43 using the question–answer pairs of questions 13 and 8 from the current frame as context.
  9. Infer question 50 using the question–answer pair of question 43 from the current frame as context.

corresponding graph