Skip to content

KAIST-Visual-AI-Group/APC-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

APC Logo

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Teaser

arXiv arXiv

Phillip Y. Lee1, Jihyeon Je2, Chanho Park1, Mikaela Angelina Uy3, Leonidas Guibas2, Minhyuk Sung1

1KAIST, 2Stanford University, 3NVIDIA

ICCV 2025

💬 Introduction

This repository contains the official implementation of Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation (ICCV 2025).

We propose a framework that enables Vision-Language Models to perform spatial reasoning in arbitrary perspectives.

🔧 Get Started

We have tested on Python 3.10, CUDA 12.4, and PyTorch 2.4.1. Please follow the below scripts for setting up the environment.

# create conda env
conda create -n apc_vlm python=3.10 -y
conda activate apc_vlm
# install torch
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 -c pytorch -y 
# install vision module dependencies & download checkpoints
bash setup/setup_vision_modules.sh
# install other dependencies
pip install -r setup/requirements.txt

✏️ How to Run

We provide an easy-to-use notebook, run_APC.ipynb, for quickly testing our APC framework.

Alternatively, you can run inference directly with run_APC.py. For example:

python run_APC.py \
    --config apc/configs/qwenvl2_5_7b_instruct.yaml \
    --device_vlm cuda:0 \
    --device_vision cuda:0 \
    --image_path demo/sample_image_man.jpg \
    --prompt "If I stand at the person’s position facing where it is facing, is the table on the left or on the right of me?" \
    --save_dir outputs/demo/man_table \
    --visualize_trace \
    --return_conv_history

An example of the saved conversation history from APC is as follows:

Inference example

🙌 Acknowledgements

Our implementation is built upon amazing projects including Qwen2.5-VL, Grounding DINO, Depth Pro, SAM, Orient Anything, Omni3D, Ovmono3D, and trimesh. We greatly thank all authors and contributors for open-sourcing their code and model checkpoints.

🔖 Citation

If you find our work useful, please consider citing:

@inproceedings{lee2025perspective,
  title={Perspective-aware reasoning in vision-language models via mental imagery simulation},
  author={Lee, Phillip Y and Je, Jihyeon and Park, Chanho and Uy, Mikaela Angelina and Guibas, Leonidas and Sung, Minhyuk},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

About

[ICCV 2025] Official code for Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published