Add SGI-Bench-1.0 #1358

vaporhug · Dec 15, 2025

No description provided.

…experimental reasoning)

…date readme of SGI-Bench-1.0

mzr1996 · Dec 23, 2025

vlmeval/dataset/SGI_Bench_1_0/experimental_reasoning.py

+import re
+from typing import Any, Dict, List
+from datasets import load_dataset
+from vlmeval import *


Avoid multiple from xxx import *

mzr1996 · Dec 23, 2025

vlmeval/dataset/SGI_Bench_1_0/experimental_reasoning.py

+        return assistant_response
+
+
+judge = VLM('o4-mini')


Use judge_kwargs and build_judge to construct judge model.

mzr1996 · Dec 23, 2025

vlmeval/dataset/SGI_Bench_1_0/experimental_reasoning.py

+
+            try:
+                images = [decode_base64_to_image(b64) for b64 in row['step_images'] ]
+                llm_judge = judge(images, judge_prompt).strip()


better to use track_progress_rich to parellel run api judge.

mzr1996 · Dec 23, 2025

vlmeval/dataset/SGI_Bench_1_0/dry_experiment.py

+    def evaluate(self, eval_file, **judge_kwargs):
+        save_dir_last = 'sgi_code_logs'
+        global save_dir
+        work_dir = judge_kwargs.get('work_dir','./outputs')


Do not define work dir in the dataset code. Use eval_file path to get the evaluation work dir.

mzr1996 · Dec 23, 2025

vlmeval/dataset/SGI_Bench_1_0/dry_experiment.py

+            os.makedirs(os.path.join(tmp_data_dir, "0200"), exist_ok=True)
+            os.makedirs(os.path.join(tmp_data_dir, "0236"), exist_ok=True)
+
+            download_file("https://raw.githubusercontent.com/InternScience/SGI-Bench/main/evaluation/task_3_dry_experiment/data/SGI_DryExperiment_0206/t10k-images-idx3-ubyte.gz", tmp_data_dir+"/0206")


Download the extra resource file to the subdir of LMUDataRoot(), instead of the evaluation work dir.

vaporhug added 3 commits December 15, 2025 18:15

Add SGI-Bench-1.0 (deep research , dry experiment , wet experiment , …

a5ad992

…experimental reasoning)

Add readme for SGI-Bench-1.0

d05fa50

add llm as judge for deep research benchmark of SGI-Bench-1.0 and up…

e95e4cf

…date readme of SGI-Bench-1.0

mzr1996 requested changes Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SGI-Bench-1.0 #1358

Add SGI-Bench-1.0 #1358

vaporhug commented Dec 15, 2025

Uh oh!

mzr1996 Dec 23, 2025

Uh oh!

mzr1996 Dec 23, 2025

Uh oh!

mzr1996 Dec 23, 2025

Uh oh!

mzr1996 Dec 23, 2025

Uh oh!

mzr1996 Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Add SGI-Bench-1.0 #1358

Are you sure you want to change the base?

Add SGI-Bench-1.0 #1358

Conversation

vaporhug commented Dec 15, 2025

Uh oh!

mzr1996 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

mzr1996 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

mzr1996 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

mzr1996 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

mzr1996 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants