🧠 CMG: Cross Modal Generalization

Welcome to the official PyTorch implementation of our series of works on Cross Modal Generalization (CMG) and Multimodal Unified Representations.

📌 Projects Overview

🔶 ICCV 2025 · Open-set Cross Modal Generalization (OSCMG)

Paper: Open-set Cross Modal Generalization via Multimodal Unified Representation
Conference: ICCV 2025
Code: 📂 ICCV25-OSCMG

🔷 ACL 2025 (Findings) · Feature Disentangling & Training-Free Optimization

Paper: Enhancing Multimodal Unified Representations for Cross Modal Generalization
Conference: ACL 2025 (Findings)
Code: 📂 ACL25-FCID&TOC

🔰 NeurIPS 2023 · Foundational CMG Framework

Paper: Achieving Cross Modal Generalization with Multimodal Unified Representation
Conference: NeurIPS 2023
Code: Current directory (root of this repo)

📝Requirements and Installation

Getting Started

Due to the version conflict between bert_embedding's dependency on NumPy and other libraries, directly installing according to requirements.txt may cause issues. For more details, you can refer to this issue."

git clone https://github.com/haihuangcode/CMG
cd CMG
# You don't actually have to install all the libraries in the txt file, you can choose to install them as needed.
# It is recommended to use Python 3.7, as some libraries used do not support higher versions of Python.
conda create -n your_env_name python=3.7
pip install -r requirements.txt

Pretrain

# Before you begin pretraining, please make sure to modify the file paths under `args.dataset_name == 'vggsound_AVT'` in `pretrain.py` to your own paths.
# Additionally, update the `file_path` and `self.label2prompt = pd.read_csv('')` paths in `dataset/VGGSOUND_dataset.py`.
# The model save path is located under `--model_save_path` in `configs/opts.py`.
# Please also remember to modify the paths related to downstream tasks and the corresponding dataset paths to your own paths.
cd CMG/code/src
./pretrain.sh

AVE_downstream

cd CMG/code/src
./ave.sh

AVVP_downstream

cd CMG/code/src
./avvp.sh

AVE_AVVP_downstream

cd CMG/code/src
./ave_avvp.sh

UCF_VGGSOUND_downstream

cd CMG/code/src
./ucf_vggsound.sh

AVS_downstream

cd CMG/code/AVSBench_downstream/avs_scripts/avs_s4
./train.sh
./test.sh

🎓Cite

If you find this work useful, please consider citing it.

@article{huang2025open,
  title={Open-set Cross Modal Generalization via Multimodal Unified Representation},
  author={Huang, Hai and Xia, Yan and Wang, Shulei and Wang, Hanting and Fang, Minghui and Ji, Shengpeng and Zhou, Sashuai and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2507.14935},
  year={2025}
}

@article{huang2024enhancing,
  title={Enhancing Multimodal Unified Representations for Cross Modal Generalization},
  author={Huang, Hai and Xia, Yan and Ji, Shengpeng and Wang, Shulei and Wang, Hanting and Fang, Minghui and Zhu, Jieming and Dong, Zhenhua and Zhou, Sashuai and Zhao, Zhou},
  journal={arXiv preprint arXiv:2403.05168},
  year={2024}
}

@article{xia2024achieving,
  title={Achieving Cross Modal Generalization with Multimodal Unified Representation},
  author={Xia, Yan and Huang, Hai and Zhu, Jieming and Zhao, Zhou},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

✏Model Checkpoints And Date Feature

You can choose to download from either Baidu Netdisk, Google Drive or Huggingface.

Baidu Netdisk (If you are unable to access Google Drive, please use the following two links)

data (pwd: 1234) patch (pwd: 1234) This is a patch for the previous data errors. Please download the complete data from the above and replace the csv files in the patch with the ones in data/vggsound40k/data, specifically replacing vggsound-avel40k.csv and video_name_vggsound40k_checked.csv. The previous #13 regarding unsatisfactory model training results were caused by the incomplete csv files that were uploaded earlier, which only contained 20k data entries. I apologize for not noticing this earlier /(ㄒoㄒ)/~~

Google Drive (Includes the complete data and patch)

data+patch

Huggingface (Includes the complete data and patch)

data+patch

✏Directory

CMG
├── checkpoint
├── cnt.pkl
├── code
├── data
├── figs
├── paper
├── README.md
└── requirements.txt

✏Note

For the video and audio feature extraction method, please refer to AVE, text is based on the label to generate a description-focused statement of approximately 10 words in length.
There is no validation set for the pre-training process, in this paper it is done by testing the performance of each model on the downstream of the AVE, and the model with the best performance tests the rest of the downstream tasks, so the AVE can be regarded as a validation set and the model with the best pre-training appears in the first 5 epochs.
Pretraining can be performed using just one GPU, such as 4090 or A100. The experimental results in the paper were obtained by running on 4090 or A100. Multi-GPU parallel training yielded poorer model performance, possibly due to issues between the mutual information minimization design in DCID and Pytorch (but this was an early experimental observation, and was not re-verified after the code was finalized, since single GPU pretraining was sufficient).

👍Acknowledgments

Our code is based on AVE, AVVP, PSP, CPSP, VGGSOUND, AVS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 CMG: Cross Modal Generalization

📌 Projects Overview

🔶 ICCV 2025 · Open-set Cross Modal Generalization (OSCMG)

🔷 ACL 2025 (Findings) · Feature Disentangling & Training-Free Optimization

🔰 NeurIPS 2023 · Foundational CMG Framework

📝Requirements and Installation

Getting Started

Pretrain

AVE_downstream

AVVP_downstream

AVE_AVVP_downstream

UCF_VGGSOUND_downstream

AVS_downstream

🎓Cite

✏Model Checkpoints And Date Feature

Baidu Netdisk (If you are unable to access Google Drive, please use the following two links)

Google Drive (Includes the complete data and patch)

Huggingface (Includes the complete data and patch)

✏Directory

✏Note

👍Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
ACL25-FCID&TOC		ACL25-FCID&TOC
ICCV25-OSCMG		ICCV25-OSCMG
code		code
figs		figs
Achieving Cross Modal Generalization with Multimodal Unified Representation.pdf		Achieving Cross Modal Generalization with Multimodal Unified Representation.pdf
README.md		README.md
cnt.pkl		cnt.pkl
requirements.txt		requirements.txt

haihuangcode/CMG

Folders and files

Latest commit

History

Repository files navigation

🧠 CMG: Cross Modal Generalization

📌 Projects Overview

🔶 ICCV 2025 · Open-set Cross Modal Generalization (OSCMG)

🔷 ACL 2025 (Findings) · Feature Disentangling & Training-Free Optimization

🔰 NeurIPS 2023 · Foundational CMG Framework

📝Requirements and Installation

Getting Started

Pretrain

AVE_downstream

AVVP_downstream

AVE_AVVP_downstream

UCF_VGGSOUND_downstream

AVS_downstream

🎓Cite

✏Model Checkpoints And Date Feature

Baidu Netdisk (If you are unable to access Google Drive, please use the following two links)

Google Drive (Includes the complete data and patch)

Huggingface (Includes the complete data and patch)

✏Directory

✏Note

👍Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages