Phillip Y. Lee1, Jihyeon Je2, Chanho Park1, Mikaela Angelina Uy3, Leonidas Guibas2, Minhyuk Sung1
1KAIST, 2Stanford University, 3NVIDIA
ICCV 2025
This repository contains the official implementation of Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation (ICCV 2025).
We propose a framework that enables Vision-Language Models to perform spatial reasoning in arbitrary perspectives.
We have tested on Python 3.10, CUDA 12.4, and PyTorch 2.4.1. Please follow the below scripts for setting up the environment.
# create conda env
conda create -n apc_vlm python=3.10 -y
conda activate apc_vlm
# install torch
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 -c pytorch -y
# install vision module dependencies & download checkpoints
bash setup/setup_vision_modules.sh
# install other dependencies
pip install -r setup/requirements.txt
We provide an easy-to-use notebook, run_APC.ipynb, for quickly testing our APC framework.
Alternatively, you can run inference directly with run_APC.py
. For example:
python run_APC.py \
--config apc/configs/qwenvl2_5_7b_instruct.yaml \
--device_vlm cuda:0 \
--device_vision cuda:0 \
--image_path demo/sample_image_man.jpg \
--prompt "If I stand at the person’s position facing where it is facing, is the table on the left or on the right of me?" \
--save_dir outputs/demo/man_table \
--visualize_trace \
--return_conv_history
An example of the saved conversation history from APC is as follows:
Our implementation is built upon amazing projects including Qwen2.5-VL, Grounding DINO, Depth Pro, SAM, Orient Anything, Omni3D, Ovmono3D, and trimesh. We greatly thank all authors and contributors for open-sourcing their code and model checkpoints.
If you find our work useful, please consider citing:
@inproceedings{lee2025perspective,
title={Perspective-aware reasoning in vision-language models via mental imagery simulation},
author={Lee, Phillip Y and Je, Jihyeon and Park, Chanho and Uy, Mikaela Angelina and Guibas, Leonidas and Sung, Minhyuk},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}