Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Phillip Y. Lee¹, Jihyeon Je², Chanho Park¹, Mikaela Angelina Uy³, Leonidas Guibas², Minhyuk Sung¹

¹KAIST, ²Stanford University, ³NVIDIA

ICCV 2025

💬 Introduction

This repository contains the official implementation of Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation (ICCV 2025).

We propose a framework that enables Vision-Language Models to perform spatial reasoning in arbitrary perspectives.

🔧 Get Started

We have tested on Python 3.10, CUDA 12.4, and PyTorch 2.4.1. Please follow the below scripts for setting up the environment.

# create conda env
conda create -n apc_vlm python=3.10 -y
conda activate apc_vlm
# install torch
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 -c pytorch -y 
# install vision module dependencies & download checkpoints
bash setup/setup_vision_modules.sh
# install other dependencies
pip install -r setup/requirements.txt

✏️ How to Run

We provide an easy-to-use notebook, run_APC.ipynb, for quickly testing our APC framework.

Alternatively, you can run inference directly with run_APC.py. For example:

python run_APC.py \
    --config apc/configs/qwenvl2_5_7b_instruct.yaml \
    --device_vlm cuda:0 \
    --device_vision cuda:0 \
    --image_path demo/sample_image_man.jpg \
    --prompt "If I stand at the person’s position facing where it is facing, is the table on the left or on the right of me?" \
    --save_dir outputs/demo/man_table \
    --visualize_trace \
    --return_conv_history

An example of the saved conversation history from APC is as follows:

🙌 Acknowledgements

Our implementation is built upon amazing projects including Qwen2.5-VL, Grounding DINO, Depth Pro, SAM, Orient Anything, Omni3D, Ovmono3D, and trimesh. We greatly thank all authors and contributors for open-sourcing their code and model checkpoints.

🔖 Citation

If you find our work useful, please consider citing:

@inproceedings{lee2025perspective,
  title={Perspective-aware reasoning in vision-language models via mental imagery simulation},
  author={Lee, Phillip Y and Je, Jihyeon and Park, Chanho and Uy, Mikaela Angelina and Guibas, Leonidas and Sung, Minhyuk},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
apc		apc
assets		assets
demo		demo
setup		setup
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_APC.ipynb		run_APC.ipynb
run_APC.py		run_APC.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

💬 Introduction

🔧 Get Started

✏️ How to Run

🙌 Acknowledgements

🔖 Citation

About

Uh oh!

Releases

Packages

Languages

License

KAIST-Visual-AI-Group/APC-VLM

Folders and files

Latest commit

History

Repository files navigation

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

💬 Introduction

🔧 Get Started

✏️ How to Run

🙌 Acknowledgements

🔖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages