SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring

Chuming Shen1 Wei Wei1 Xiaoye Qu1 Yu Cheng2
1Huazhong University of Science and Technology 2The Chinese University of Hong Kong
Paper Model Code VQA-Verify Data
Comparison of Reasoning Patterns

Figure 1. Comparison of reasoning patterns. R1-like free-form reasoning (left) exhibits diffused attention, while SATORI's Glance-Focus-Think paradigm (right) guides the model's focus to task-relevant regions.

Abstract

DeepSeek-R1 has demonstrated powerful textual reasoning capabilities through reinforcement learning (RL). Recent multi-modal studies often directly apply RL to generate R1-like free-form reasoning for multi-modal reasoning tasks. Unlike textual tasks, multi-modal tasks inherently demand comprehensive visual understanding to effectively address complex challenges. Therefore, such free-form reasoning faces two critical limitations in these tasks: (1) Extended reasoning chains diffuse visual focus away from task-relevant regions, degrading answer accuracy. (2) Unverifiable intermediate steps may substantially increase policy-gradient variance and computational costs overhead.


To this end, we introduce SATORI (Spatially Anchored Task Optimization with ReInforcement Learning), which explicitly structures multimodal reasoning process through a Glance-Focus-Think paradigm, converting free-form inference into verifiable reasoning. Specifically, SATORI generates global image captions, and shifts visual attention to task-focus regions via key bounding boxes, and finally leverages RL over verifiable reasoning patterns to yield the accurate and interpretable answer. Furthermore, we introduce VQA-Verify, a 12k dataset with answer-aligned captions and bounding boxes to facilitate the three-stage training. Experiments demonstrate that SATORI achieves consistent performance improvements across ten multimodal reasoning benchmarks, achieving up to 15.7% accuracy improvement over R1-like baselines.

Insight I: Quantifying Visual Focus

A key insight of our work is that extended free-form reasoning often leads to "visual attention deficiency," where the model loses focus on the image as the text sequence lengthens.

To quantify this, we measure Region Attention Density (RAD), which calculates the concentration of attention weights within task-relevant ground-truth regions. As shown in the figure, SATORI maintains significantly higher focus on relevant regions compared to free-form reasoning.

RAD and Accuracy Distribution

Figure 2. RAD and accuracy distributions. The analysis reveals a positive correlation between attention density (RAD) and answer accuracy.

Insight II: Gradient Variance Reduction

Beyond visual focus, verifiable intermediate steps significantly stabilize training. In standard free-form reasoning, rewards are sparse and the thinking process is unstructured, leading to high variance in policy gradients.

By introducing verifiable intermediate rewards (Caption and BBox), SATORI provides a smoother optimization landscape. This results in a ~27% reduction in gradient variance and significantly faster convergence during RL training compared to baselines.

Variance Comparison

(a) Variance

Gradient Norm

(b) Grad Norm

Figure 6. SATORI exhibits lower gradient variance (left) and faster convergence (right) compared to free-form reasoning.

Methodology: SATORI

SATORI replaces the black-box free-form reasoning of standard MLLMs with a structured, verifiable pipeline. By mandating explicit grounding before reasoning, we reduce hallucination and improve convergence.

SATORI Method Overview

Figure 3. Overview of SATORI. The model follows a three-stage process: capturing global information (Glance), analyzing critical regions (Focus), and verifiable reasoning (Think), supervised by specific rewards.

Verifiable Reward Functions

We define explicit rewards for caption quality and spatial grounding accuracy.

$$ \mathcal{R}_{caption} = \frac{1}{2}(BLEU\text{-}4_{smooth} + ROUGE\text{-}L_{F1}) $$

$$ \mathcal{R}_{bbox} = \text{Union IoU}(\mathcal{P}, \mathcal{G}) $$

Union IoU Algorithm

Union IoU Algorithm

VQA-Verify Dataset

To support explicit supervision, we introduce VQA-Verify, the first multimodal VQA dataset with both bounding box and caption annotations for reasoning tasks. It spans 17 benchmark datasets across three hierarchical categories.

  • Size: 12,000 Annotated Samples
  • Annotations: (Image, Question, Answer, Caption, BBox)
  • Categories: Perception, Reasoning, Multilingual
VQA-Verify Dataset Overview

Figure 4. The VQA-Verify dataset composition, covering 17 benchmarks across 3 categories.

Experimental Results

SATORI achieves state-of-the-art performance among comparable models, significantly outperforming free-form reasoning baselines.

Method MathVista Math-V MathVerse Olypamid WeMath MMStar MMBench MMMU
Closed-Source Model
GPT-4o 63.830.339.435.068.865.184.370.7
Claude-3.5 Sonnet 61.838.0---65.181.766.4
Open-Source General Model (2-3B)
Qwen2.5-VL-3B 61.221.247.610.322.156.360.851.2
InternVL3-2B 57.621.725.39.622.461.178.048.7
Open-Source Reasoning Model (2-3B)
R1-VL-2B 52.117.126.2--49.8--
Aquila-VL-2B 59.018.426.2--54.975.246.9
InternVL2.5-2B-MPO 53.4----54.970.744.6
VLAA-Thinker-3B 61.024.436.4-23.2---
Our Model (3B)
SATORI-3B w/o thinking 60.921.732.210.925.655.976.554.7
SATORI-3B 67.426.139.813.530.156.776.956.9
Open-Source General Model (7-11B)
InternVL2.5-8B 64.419.739.512.353.563.282.556.2
InternVL3-8B 71.629.339.8-37.168.782.162.2
Qwen2.5-VL-7B 68.225.447.920.262.164.182.258.0
Open-Source Reasoning Model (7-11B)
Adora-7B 73.523.050.120.164.2---
InternVL2.5-8B-MPO 68.921.535.57.853.562.576.5-
R1-Onevision-7B 64.123.547.117.361.8---
OpenVLThinker-7B 70.225.347.920.164.3---
MM-Eureka-7B 73.026.950.320.166.1---
VL-Rethinker-7B 73.730.154.6----56.7
MMR1-7B 72.031.855.4-----
Our Model (7B)
SATORI-7B w/o thinking 71.330.249.220.464.169.782.060.6
SATORI-7B 76.232.756.923.765.269.582.963.6

Training Stability Analysis

A major benefit of SATORI is the reduction of policy-gradient variance. By introducing verifiable intermediate rewards (Caption and BBox), we provide a smoother optimization landscape compared to the sparse rewards of free-form reasoning.

Variance Comparison

(a) Variance Comparison

Gradient Norm Convergence

(b) Gradient Norm

Figure 6. SATORI exhibits lower gradient variance (left) and faster convergence (right) compared to free-form reasoning.

Citation

@article{shen2025satori,
  title={SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring},
  author={Shen, Chuming and Wei, Wei and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint},
  year={2025}
}