Ayush Zenith · Arnold Zumbrun · Neel Raut · Jing Lin
The performance of machine learning models depends heavily on the quality of training data. Scarcity of large, well-annotated datasets poses significant challenges for building robust models. Synthetic data—generated via simulations and generative models—offers a promising solution by increasing dataset diversity and improving model performance, reliability, and resilience. However, evaluating the quality of synthetic data requires an effective metric.
Synthetic Dataset Quality Metric (SDQM) is introduced to assess data quality for object detection tasks without requiring full model training. SDQM enables efficient generation and selection of synthetic datasets, addressing key challenges in resource-constrained environments. In our experiments, SDQM showed a strong correlation with mean Average Precision (mAP) scores of YOLOv11, outperforming previous metrics that only achieved moderate or weak correlations. SDQM also provides actionable insights for improving dataset quality, reducing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data.
If you found this code/work to be useful in your own research, please consider citing as follows:
@misc{zenith2025sdqmsyntheticdataquality,
title={SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation},
author={Ayush Zenith and Arnold Zumbrun and Neel Raut and Jing Lin},
year={2025},
eprint={2510.06596},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.06596},
}
This repository provides code to calculate each of SDQM's submetrics:
- MAUVE
- Frontier Integral
- V-Info
- α-Precision
- β-Recall
- Authenticity
- Cluster Metric
- Dataset Separability
- Spatial Distribution Difference
- Label Overlap
- Pixel Intensity Match
- Bounding Box Match
and the integrated super metric, SDQM.
Checkout the deepwiki for more info: https://deepwiki.com/ayushzenith/SDQM
First, install the required packages:
cd SDQM
pip install -r requirements.txtThen, install the customized ultralytics package:
cd dataset_interpretability/v_info/ultralytics
pip install -e .The dimo dataset is available at https://pderoovere.github.io/dimo/.
To convert the DIMO dataset to YOLO format, use the conversion/dimo_convert_to_yolo.py script.
The RarePlanes dataset is available at https://www.iqt.org/library/the-rareplanes-dataset.
To convert the RarePlanes dataset to YOLO format, use the conversion/rareplanes/run.py script.
To replicate the experiment conducted in the paper, use the replicate_experiments.py script.
The scripts takes in paths to the WASABI real and WASABI synthetic, RarePlanes real and RarePlanes synthetic, and DIMO real and DIMO synthetic datasets. The script will then split the datasets as necessary and perform evolutionary selection based on metric values, calculate SDQM, and perform regression on the SDQM values.
Example usage:
python3 replicate_experiments.py --wasabi_real_yolo_dir data/wasabi/real --wasabi_synthetic_yolo_dir data/wasabi/synthetic --rareplanes_real_yolo_dir data/rareplanes/real --rareplanes_synthetic_yolo_dir data/rareplanes/synthetic --dimo_real_yolo_dir data/dimo/real --dimo_synthetic_yolo_dir data/dimo/synthetic --output_dir data/experimentThis repository contains three main scripts: sdqm.py, dataset_selection/select_datasets.py, and regression.py,
The sdqm.py script calculates SDQM given a real and synthetic dataset pair.
The select_datasets.py script selects the desired real and synthetic dataset pairs given a set of real and synthetic datasets. The script takes in the following arguments:
-
--num_datasets:
Type:int
Default:1
Description: Number of real dataset splits to create. -
--input_yolo_dir:
Type:str
Default:"data/yolo"
Description: Path to the real dataset YOLO directory. -
--synthetic_yolo_dir:
Type:str
Default:"data/yolo"
Description: Path to the synthetic dataset YOLO directory. -
--output_dir:
Type:str
Default:"data"
Description: Path to output the selected datasets.
--model_text:
Type:str
Default:"Vehicle"
Description: Text to use for the grounding dino embedding model.
-
--scene_function:
Type:str
Choices:"wasabi_scene","rareplanes_real_scene","rareplanes_synthetic_scene","dimo_scene"
Description: Scene function to use for splitting datasets. -
--synthetic_scene_function:
Type:str
Choices:"wasabi_scene","rareplanes_real_scene","rareplanes_synthetic_scene","dimo_scene"
Description: Scene function to use for synthetic dataset splitting.
-
--input_splits:
Type:str
Nargs:+
Default:["train", "val", "test"]
Description: Splits from input datasets to use. -
--train_split:
Type:float
Description: Train split proportion. -
--val_split:
Type:float
Description: Validation split proportion. -
--test_split:
Type:float
Description: Test split proportion.
--imgsz:
Type:int
Default:640
Description: Image size for YOLO training.
The regression.py script performs regression on SDQM datapoints from a CSV file.
Example usage:
python3 regression.py --input_path val.csv train.csv --output_path regression_output --all_methods --shuffle_split --lastThe script takes in the following arguments:
-
--input_path(required):
Type:str
Nargs:+
Description: Path(s) to the input CSV file(s).
Example:--input_path file1.csv file2.csv -
--val_input_path:
Type:str
Nargs:+
Description: Path(s) to the validation CSV file(s). -
--output_path:
Type:str
Description: Path to the output directory.
-
--start_column:
Type:int
Description: Column to start regression from. -
--y_column:
Type:int
Description: Column to use as the target variable. -
--last:
Action:store_true
Description: Use the last column as the target variable.
--load_results:
Action:store_true
Description: Loadresults.jsonand calculate the Pearson coefficient with new data.
-
--standardize:
Action:store_true
Description: Standardize the input data. -
--pca:
Type:int
Default:None
Description: Number of principal components to keep.
-
--method:
Type:str
Default:linear
Choices:linear,ridge,lasso,decision_tree,random_forest,xgboost,svr
Description: Regression method to use. -
--all_methods:
Action:store_true
Description: Run all regression methods.
-
--shuffle_split:
Action:store_true
Description: Shuffle and split data into train and validation sets. -
--test_size:
Type:float
Default:0.2
Description: Proportion of data to use as the test set. -
--k_folds:
Type:int
Default:None
Description: Number of folds for k-fold cross-validation.
-
--feature_removal_test:
Action:store_true
Description: Remove each feature one by one and measure the effect on correlation coefficients. -
--correlation_threshold:
Type:float
Default:None
Description: Remove features whose absolute Pearson correlation with the target is below this threshold. -
--sequential_test:
Type:int
Default:None
Description: Perform sequential feature selection test.
-
--scaler:
Type:str
Default:None
Description: Path to a savedStandardScalerobject. -
--separately_scale:
Action:store_true
Description: Standardize each CSV file separately.
We acknowledge the AFRL Internship Program to support Ayush Zenith, Arnold Zumbrun, and Neel Raut's work. This material is based upon work supported by the Air Force Research Laboratory under agreement number FA8750-20-3-1004. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the U.S. Air Force. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
Approved for Public Release; Distribution Unlimited: AFR/PA Case No. AFRL-2025-4672.