SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Ayush Zenith · Arnold Zumbrun · Neel Raut · Jing Lin

Paper | Code

Abstract

The performance of machine learning models depends heavily on the quality of training data. Scarcity of large, well-annotated datasets poses significant challenges for building robust models. Synthetic data—generated via simulations and generative models—offers a promising solution by increasing dataset diversity and improving model performance, reliability, and resilience. However, evaluating the quality of synthetic data requires an effective metric.

Synthetic Dataset Quality Metric (SDQM) is introduced to assess data quality for object detection tasks without requiring full model training. SDQM enables efficient generation and selection of synthetic datasets, addressing key challenges in resource-constrained environments. In our experiments, SDQM showed a strong correlation with mean Average Precision (mAP) scores of YOLOv11, outperforming previous metrics that only achieved moderate or weak correlations. SDQM also provides actionable insights for improving dataset quality, reducing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data.

Citation

If you found this code/work to be useful in your own research, please consider citing as follows:

@misc{zenith2025sdqmsyntheticdataquality,
      title={SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation}, 
      author={Ayush Zenith and Arnold Zumbrun and Neel Raut and Jing Lin},
      year={2025},
      eprint={2510.06596},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.06596}, 
}

Introduction

This repository provides code to calculate each of SDQM's submetrics:

MAUVE
Frontier Integral
V-Info
α-Precision
β-Recall
Authenticity
Cluster Metric
Dataset Separability
Spatial Distribution Difference
Label Overlap
Pixel Intensity Match
Bounding Box Match

and the integrated super metric, SDQM.

Checkout the deepwiki for more info: https://deepwiki.com/ayushzenith/SDQM

Installation

First, install the required packages:

cd SDQM
pip install -r requirements.txt

Then, install the customized ultralytics package:

cd dataset_interpretability/v_info/ultralytics
pip install -e .

Getting the Data

DIMO

The dimo dataset is available at https://pderoovere.github.io/dimo/.

To convert the DIMO dataset to YOLO format, use the conversion/dimo_convert_to_yolo.py script.

RarePlanes

The RarePlanes dataset is available at https://www.iqt.org/library/the-rareplanes-dataset.

To convert the RarePlanes dataset to YOLO format, use the conversion/rareplanes/run.py script.

Usage

To replicate the experiment conducted in the paper, use the replicate_experiments.py script.

The scripts takes in paths to the WASABI real and WASABI synthetic, RarePlanes real and RarePlanes synthetic, and DIMO real and DIMO synthetic datasets. The script will then split the datasets as necessary and perform evolutionary selection based on metric values, calculate SDQM, and perform regression on the SDQM values.

Example usage:

python3 replicate_experiments.py --wasabi_real_yolo_dir data/wasabi/real --wasabi_synthetic_yolo_dir data/wasabi/synthetic --rareplanes_real_yolo_dir data/rareplanes/real --rareplanes_synthetic_yolo_dir data/rareplanes/synthetic --dimo_real_yolo_dir data/dimo/real --dimo_synthetic_yolo_dir data/dimo/synthetic --output_dir data/experiment

Scripts

This repository contains three main scripts: sdqm.py, dataset_selection/select_datasets.py, and regression.py,

Calculating SDQM

The sdqm.py script calculates SDQM given a real and synthetic dataset pair.

Evolutionarily Selecting Datasets

The select_datasets.py script selects the desired real and synthetic dataset pairs given a set of real and synthetic datasets. The script takes in the following arguments:

Dataset Configuration

--num_datasets:
Type: int
Default: 1
Description: Number of real dataset splits to create.
--input_yolo_dir:
Type: str
Default: "data/yolo"
Description: Path to the real dataset YOLO directory.
--synthetic_yolo_dir:
Type: str
Default: "data/yolo"
Description: Path to the synthetic dataset YOLO directory.
--output_dir:
Type: str
Default: "data"
Description: Path to output the selected datasets.

Model Configuration

--model_text:
Type: str
Default: "Vehicle"
Description: Text to use for the grounding dino embedding model.

Scene Function Selection

--scene_function:
Type: str
Choices: "wasabi_scene", "rareplanes_real_scene", "rareplanes_synthetic_scene", "dimo_scene"
Description: Scene function to use for splitting datasets.
--synthetic_scene_function:
Type: str
Choices: "wasabi_scene", "rareplanes_real_scene", "rareplanes_synthetic_scene", "dimo_scene"
Description: Scene function to use for synthetic dataset splitting.

Dataset Splits

--input_splits:
Type: str
Nargs: +
Default: ["train", "val", "test"]
Description: Splits from input datasets to use.
--train_split:
Type: float
Description: Train split proportion.
--val_split:
Type: float
Description: Validation split proportion.
--test_split:
Type: float
Description: Test split proportion.

YOLO Configuration

--imgsz:
Type: int
Default: 640
Description: Image size for YOLO training.

Performing Regression on Datapoints

The regression.py script performs regression on SDQM datapoints from a CSV file.

Example usage:

python3 regression.py --input_path val.csv train.csv --output_path regression_output --all_methods --shuffle_split --last

The script takes in the following arguments:

Input and Output Paths

--input_path (required):
Type: str
Nargs: +
Description: Path(s) to the input CSV file(s).
Example: --input_path file1.csv file2.csv
--val_input_path:
Type: str
Nargs: +
Description: Path(s) to the validation CSV file(s).
--output_path:
Type: str
Description: Path to the output directory.

Data Configuration

--start_column:
Type: int
Description: Column to start regression from.
--y_column:
Type: int
Description: Column to use as the target variable.
--last:
Action: store_true
Description: Use the last column as the target variable.

Results Handling

--load_results:
Action: store_true
Description: Load results.json and calculate the Pearson coefficient with new data.

Data Processing

--standardize:
Action: store_true
Description: Standardize the input data.
--pca:
Type: int
Default: None
Description: Number of principal components to keep.

Regression Methods

--method:
Type: str
Default: linear
Choices: linear, ridge, lasso, decision_tree, random_forest, xgboost, svr
Description: Regression method to use.
--all_methods:
Action: store_true
Description: Run all regression methods.

Data Splitting

--shuffle_split:
Action: store_true
Description: Shuffle and split data into train and validation sets.
--test_size:
Type: float
Default: 0.2
Description: Proportion of data to use as the test set.
--k_folds:
Type: int
Default: None
Description: Number of folds for k-fold cross-validation.

Feature Analysis

--feature_removal_test:
Action: store_true
Description: Remove each feature one by one and measure the effect on correlation coefficients.
--correlation_threshold:
Type: float
Default: None
Description: Remove features whose absolute Pearson correlation with the target is below this threshold.
--sequential_test:
Type: int
Default: None
Description: Perform sequential feature selection test.

Scaling Options

--scaler:
Type: str
Default: None
Description: Path to a saved StandardScaler object.
--separately_scale:
Action: store_true
Description: Standardize each CSV file separately.

Acknowledgments

We acknowledge the AFRL Internship Program to support Ayush Zenith, Arnold Zumbrun, and Neel Raut's work. This material is based upon work supported by the Air Force Research Laboratory under agreement number FA8750-20-3-1004. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the U.S. Air Force. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.

Approved for Public Release; Distribution Unlimited: AFR/PA Case No. AFRL-2025-4672.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
conversion		conversion
dataset_interpretability		dataset_interpretability
dataset_selection		dataset_selection
dataset_similarity		dataset_similarity
labels_and_characteristics		labels_and_characteristics
media		media
pixel_intensity		pixel_intensity
separability		separability
spatial_distribution		spatial_distribution
LICENSE		LICENSE
README.md		README.md
embedding.py		embedding.py
regression.py		regression.py
replicate_experiments.py		replicate_experiments.py
requirements.txt		requirements.txt
sdqm.py		sdqm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Paper | Code

Abstract

Citation

Introduction

Installation

Getting the Data

DIMO

RarePlanes

Usage

Scripts

Calculating SDQM

Evolutionarily Selecting Datasets

Dataset Configuration

Model Configuration

Scene Function Selection

Dataset Splits

YOLO Configuration

Performing Regression on Datapoints

Input and Output Paths

Data Configuration

Results Handling

Data Processing

Regression Methods

Data Splitting

Feature Analysis

Scaling Options

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ayushzenith/SDQM

Folders and files

Latest commit

History

Repository files navigation

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Paper | Code

Abstract

Citation

Introduction

Installation

Getting the Data

DIMO

RarePlanes

Usage

Scripts

Calculating SDQM

Evolutionarily Selecting Datasets

Dataset Configuration

Model Configuration

Scene Function Selection

Dataset Splits

YOLO Configuration

Performing Regression on Datapoints

Input and Output Paths

Data Configuration

Results Handling

Data Processing

Regression Methods

Data Splitting

Feature Analysis

Scaling Options

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages