Figure 1. Comparison of reasoning patterns. R1-like free-form reasoning (left) exhibits diffused attention, while SATORI's Glance-Focus-Think paradigm (right) guides the model's focus to task-relevant regions.
Abstract
DeepSeek-R1 has demonstrated powerful textual reasoning capabilities through reinforcement learning (RL). Recent multi-modal studies often directly apply RL to generate R1-like free-form reasoning for multi-modal reasoning tasks. Unlike textual tasks, multi-modal tasks inherently demand comprehensive visual understanding to effectively address complex challenges. Therefore, such free-form reasoning faces two critical limitations in these tasks: (1) Extended reasoning chains diffuse visual focus away from task-relevant regions, degrading answer accuracy. (2) Unverifiable intermediate steps may substantially increase policy-gradient variance and computational costs overhead.
To this end, we introduce SATORI (Spatially Anchored Task Optimization with ReInforcement Learning), which explicitly structures multimodal reasoning process through a Glance-Focus-Think paradigm, converting free-form inference into verifiable reasoning. Specifically, SATORI generates global image captions, and shifts visual attention to task-focus regions via key bounding boxes, and finally leverages RL over verifiable reasoning patterns to yield the accurate and interpretable answer. Furthermore, we introduce VQA-Verify, a 12k dataset with answer-aligned captions and bounding boxes to facilitate the three-stage training. Experiments demonstrate that SATORI achieves consistent performance improvements across ten multimodal reasoning benchmarks, achieving up to 15.7% accuracy improvement over R1-like baselines.
Insight I: Quantifying Visual Focus
A key insight of our work is that extended free-form reasoning often leads to "visual attention deficiency," where the model loses focus on the image as the text sequence lengthens.
To quantify this, we measure Region Attention Density (RAD), which calculates the concentration of attention weights within task-relevant ground-truth regions. As shown in the figure, SATORI maintains significantly higher focus on relevant regions compared to free-form reasoning.
Figure 2. RAD and accuracy distributions. The analysis reveals a positive correlation between attention density (RAD) and answer accuracy.
Insight II: Gradient Variance Reduction
Beyond visual focus, verifiable intermediate steps significantly stabilize training. In standard free-form reasoning, rewards are sparse and the thinking process is unstructured, leading to high variance in policy gradients.
By introducing verifiable intermediate rewards (Caption and BBox), SATORI provides a smoother optimization landscape. This results in a ~27% reduction in gradient variance and significantly faster convergence during RL training compared to baselines.
(a) Variance
(b) Grad Norm
Figure 6. SATORI exhibits lower gradient variance (left) and faster convergence (right) compared to free-form reasoning.
Methodology: SATORI
SATORI replaces the black-box free-form reasoning of standard MLLMs with a structured, verifiable pipeline. By mandating explicit grounding before reasoning, we reduce hallucination and improve convergence.
Figure 3. Overview of SATORI. The model follows a three-stage process: capturing global information (Glance), analyzing critical regions (Focus), and verifiable reasoning (Think), supervised by specific rewards.
Verifiable Reward Functions
We define explicit rewards for caption quality and spatial grounding accuracy.
$$ \mathcal{R}_{bbox} = \text{Union IoU}(\mathcal{P}, \mathcal{G}) $$
Union IoU Algorithm
VQA-Verify Dataset
To support explicit supervision, we introduce VQA-Verify, the first multimodal VQA dataset with both bounding box and caption annotations for reasoning tasks. It spans 17 benchmark datasets across three hierarchical categories.
- Size: 12,000 Annotated Samples
- Annotations: (Image, Question, Answer, Caption, BBox)
- Categories: Perception, Reasoning, Multilingual
Figure 4. The VQA-Verify dataset composition, covering 17 benchmarks across 3 categories.
Experimental Results
SATORI achieves state-of-the-art performance among comparable models, significantly outperforming free-form reasoning baselines.
| Method | MathVista | Math-V | MathVerse | Olypamid | WeMath | MMStar | MMBench | MMMU |
|---|---|---|---|---|---|---|---|---|
| Closed-Source Model | ||||||||
| GPT-4o | 63.8 | 30.3 | 39.4 | 35.0 | 68.8 | 65.1 | 84.3 | 70.7 |
| Claude-3.5 Sonnet | 61.8 | 38.0 | - | - | - | 65.1 | 81.7 | 66.4 |
| Open-Source General Model (2-3B) | ||||||||
| Qwen2.5-VL-3B | 61.2 | 21.2 | 47.6 | 10.3 | 22.1 | 56.3 | 60.8 | 51.2 |
| InternVL3-2B | 57.6 | 21.7 | 25.3 | 9.6 | 22.4 | 61.1 | 78.0 | 48.7 |
| Open-Source Reasoning Model (2-3B) | ||||||||
| R1-VL-2B | 52.1 | 17.1 | 26.2 | - | - | 49.8 | - | - |
| Aquila-VL-2B | 59.0 | 18.4 | 26.2 | - | - | 54.9 | 75.2 | 46.9 |
| InternVL2.5-2B-MPO | 53.4 | - | - | - | - | 54.9 | 70.7 | 44.6 |
| VLAA-Thinker-3B | 61.0 | 24.4 | 36.4 | - | 23.2 | - | - | - |
| Our Model (3B) | ||||||||
| SATORI-3B w/o thinking | 60.9 | 21.7 | 32.2 | 10.9 | 25.6 | 55.9 | 76.5 | 54.7 |
| SATORI-3B | 67.4 | 26.1 | 39.8 | 13.5 | 30.1 | 56.7 | 76.9 | 56.9 |
| Open-Source General Model (7-11B) | ||||||||
| InternVL2.5-8B | 64.4 | 19.7 | 39.5 | 12.3 | 53.5 | 63.2 | 82.5 | 56.2 |
| InternVL3-8B | 71.6 | 29.3 | 39.8 | - | 37.1 | 68.7 | 82.1 | 62.2 |
| Qwen2.5-VL-7B | 68.2 | 25.4 | 47.9 | 20.2 | 62.1 | 64.1 | 82.2 | 58.0 |
| Open-Source Reasoning Model (7-11B) | ||||||||
| Adora-7B | 73.5 | 23.0 | 50.1 | 20.1 | 64.2 | - | - | - |
| InternVL2.5-8B-MPO | 68.9 | 21.5 | 35.5 | 7.8 | 53.5 | 62.5 | 76.5 | - |
| R1-Onevision-7B | 64.1 | 23.5 | 47.1 | 17.3 | 61.8 | - | - | - |
| OpenVLThinker-7B | 70.2 | 25.3 | 47.9 | 20.1 | 64.3 | - | - | - |
| MM-Eureka-7B | 73.0 | 26.9 | 50.3 | 20.1 | 66.1 | - | - | - |
| VL-Rethinker-7B | 73.7 | 30.1 | 54.6 | - | - | - | - | 56.7 |
| MMR1-7B | 72.0 | 31.8 | 55.4 | - | - | - | - | - |
| Our Model (7B) | ||||||||
| SATORI-7B w/o thinking | 71.3 | 30.2 | 49.2 | 20.4 | 64.1 | 69.7 | 82.0 | 60.6 |
| SATORI-7B | 76.2 | 32.7 | 56.9 | 23.7 | 65.2 | 69.5 | 82.9 | 63.6 |
Training Stability Analysis
A major benefit of SATORI is the reduction of policy-gradient variance. By introducing verifiable intermediate rewards (Caption and BBox), we provide a smoother optimization landscape compared to the sparse rewards of free-form reasoning.
(a) Variance Comparison
(b) Gradient Norm
Figure 6. SATORI exhibits lower gradient variance (left) and faster convergence (right) compared to free-form reasoning.
Citation
@article{shen2025satori,
title={SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring},
author={Shen, Chuming and Wei, Wei and Qu, Xiaoye and Cheng, Yu},
journal={arXiv preprint},
year={2025}
}