Skip to content

Commit 38a1707

Browse files
Add Synthetic Data Generation & Features to OSB Documentation (#11480) (#11586)
1 parent 18da337 commit 38a1707

File tree

15 files changed

+1431
-15
lines changed

15 files changed

+1431
-15
lines changed

_benchmark/features/index.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
layout: default
3+
title: Additional features
4+
nav_order: 30
5+
has_children: true
6+
has_toc: false
7+
redirect_from:
8+
- /benchmark/features/
9+
more_cards:
10+
- heading: "Synthetic data generation"
11+
description: "Create synthetic datasets using index mappings or custom Python logic for comprehensive benchmarking and testing."
12+
link: "/benchmark/features/synthetic-data-generation/"
13+
---
14+
15+
# Additional features
16+
17+
In addition to general benchmarking, OpenSearch Benchmark provides several specialized features.
18+
19+
{% include cards.html cards=page.more_cards %}
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
---
2+
layout: default
3+
title: Generating data using custom logic
4+
nav_order: 35
5+
parent: Synthetic data generation
6+
grand_parent: Additional features
7+
---
8+
9+
# Generating data using custom logic
10+
11+
You can generate synthetic data using custom logic defined in a Python module. This approach offers you the most granular control over how synthetic data is produced in OpenSearch Benchmark. This is especially useful if you understand the distribution of your data and the relationship between different fields.
12+
13+
## The generate_synthetic_document function
14+
15+
Every custom module provided to OpenSearch Benchmark must define the `generate_synthetic_document(providers, **custom_lists)` function. This function defines how OpenSearch Benchmark generates each synthetic document.
16+
17+
### Function parameters
18+
19+
| Parameter | Required/Optional | Description |
20+
|---|---|---|
21+
| `providers` | Required | A dictionary containing data generation tools. Available providers are `generic` (Mimesis [Generic provider](https://mimesis.name/master/api.html#generic-providers)) and `random` (Mimesis [Random class](https://mimesis.name/master/random_and_seed.html)). To add custom providers, see [Advanced configuration](#advanced-configuration). |
22+
| `custom_lists` | Optional | Keyword arguments containing predefined lists of values that you can use in your data generation logic. These are defined in your YAML configuration file under `custom_lists` and allow you to separate data values from your Python code. For example, if you define `dog_names: [Buddy, Max, Luna]` in YAML, you can access it as `custom_lists['dog_names']` in your function. This makes it easy to modify data values without changing your Python code. |
23+
24+
### Basic function template
25+
26+
```python
27+
def generate_synthetic_document(providers, **custom_lists):
28+
# Access the available providers
29+
generic = providers['generic']
30+
random_provider = providers['random']
31+
32+
# Generate a document using the providers
33+
document = {
34+
'name': generic.person.full_name(),
35+
'age': random_provider.randint(18, 80),
36+
'email': generic.person.email(),
37+
'timestamp': generic.datetime.datetime()
38+
}
39+
40+
# Optionally, use custom lists if provided
41+
if 'categories' in custom_lists:
42+
document['category'] = random_provider.choice(custom_lists['categories'])
43+
44+
return document
45+
```
46+
{% include copy.html %}
47+
48+
For more information, see the [Mimesis documentation](https://mimesis.name/master/api.html).
49+
50+
## Python module example
51+
52+
The following example Python module demonstrates custom logic for generating documents about dog drivers for a fictional ride-sharing company, *Pawber*, which uses OpenSearch to store and search large volumes of ride-sharing data.
53+
54+
This example showcases several advanced concepts:
55+
- **[Custom provider classes](#advanced-configuration)** (`NumericString`, `MultipleChoices`) that extend Mimesis functionality
56+
- **[Custom lists](#advanced-configuration)** for data values like dog names, breeds, and treats (referenced as `custom_lists['dog_names']`)
57+
- **Geographic clustering** logic for realistic location data
58+
- **Complex document structures** with nested objects and relationships
59+
60+
Save this code to a file called `pawber.py` in your desired directory (for example, `~/pawber.py`):
61+
62+
```python
63+
from mimesis.providers.base import BaseProvider
64+
from mimesis.enums import TimestampFormat
65+
66+
import random
67+
68+
GEOGRAPHIC_CLUSTERS = {
69+
'Manhattan': {
70+
'center': {'lat': 40.7831, 'lon': -73.9712},
71+
'radius': 0.05 # degrees
72+
},
73+
'Brooklyn': {
74+
'center': {'lat': 40.6782, 'lon': -73.9442},
75+
'radius': 0.05
76+
},
77+
'Austin': {
78+
'center': {'lat': 30.2672, 'lon': -97.7431},
79+
'radius': 0.1 # Increased radius to cover more of Austin
80+
}
81+
}
82+
83+
def generate_location(cluster):
84+
"""Generate a random location within a cluster"""
85+
center = GEOGRAPHIC_CLUSTERS[cluster]['center']
86+
radius = GEOGRAPHIC_CLUSTERS[cluster]['radius']
87+
lat = center['lat'] + random.uniform(-radius, radius)
88+
lon = center['lon'] + random.uniform(-radius, radius)
89+
return {'lat': lat, 'lon': lon}
90+
91+
class NumericString(BaseProvider):
92+
class Meta:
93+
name = "numeric_string"
94+
95+
@staticmethod
96+
def generate(length=5) -> str:
97+
return ''.join([str(random.randint(0, 9)) for _ in range(length)])
98+
99+
class MultipleChoices(BaseProvider):
100+
class Meta:
101+
name = "multiple_choices"
102+
103+
@staticmethod
104+
def generate(choices, num_of_choices=5) -> str:
105+
import logging
106+
logger = logging.getLogger(__name__)
107+
logger.info("Choices: %s", choices)
108+
logger.info("Length: %s", num_of_choices)
109+
total_choices_available = len(choices) - 1
110+
111+
return [choices[random.randint(0, total_choices_available)] for _ in range(num_of_choices)]
112+
113+
def generate_synthetic_document(providers, **custom_lists):
114+
generic = providers['generic']
115+
random_mimesis = providers['random']
116+
117+
first_name = generic.person.first_name()
118+
last_name = generic.person.last_name()
119+
city = random.choice(list(GEOGRAPHIC_CLUSTERS.keys()))
120+
121+
# Driver Document
122+
document = {
123+
"dog_driver_id": f"DD{generic.numeric_string.generate(length=4)}",
124+
"dog_name": random_mimesis.choice(custom_lists['dog_names']),
125+
"dog_breed": random_mimesis.choice(custom_lists['dog_breeds']),
126+
"license_number": f"{random_mimesis.choice(custom_lists['license_plates'])}{generic.numeric_string.generate(length=4)}",
127+
"favorite_treats": random_mimesis.choice(custom_lists['treats']),
128+
"preferred_tip": random_mimesis.choice(custom_lists['tips']),
129+
"vehicle_type": random_mimesis.choice(custom_lists['vehicle_types']),
130+
"vehicle_make": random_mimesis.choice(custom_lists['vehicle_makes']),
131+
"vehicle_model": random_mimesis.choice(custom_lists['vehicle_models']),
132+
"vehicle_year": random_mimesis.choice(custom_lists['vehicle_years']),
133+
"vehicle_color": random_mimesis.choice(custom_lists['vehicle_colors']),
134+
"license_plate": random_mimesis.choice(custom_lists['license_plates']),
135+
"current_location": generate_location(city),
136+
"status": random.choice(['available', 'busy', 'offline']),
137+
"current_ride": f"R{generic.numeric_string.generate(length=6)}",
138+
"account_status": random_mimesis.choice(custom_lists['account_status']),
139+
"join_date": generic.datetime.formatted_date(),
140+
"total_rides": generic.numeric.integer_number(start=1, end=200),
141+
"rating": generic.numeric.float_number(start=1.0, end=5.0, precision=2),
142+
"earnings": {
143+
"today": {
144+
"amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2),
145+
"currency": "USD"
146+
},
147+
"this_week": {
148+
"amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2),
149+
"currency": "USD"
150+
},
151+
"this_month": {
152+
"amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2),
153+
"currency": "USD"
154+
}
155+
},
156+
"last_grooming_check": "2023-12-01",
157+
"owner": {
158+
"first_name": first_name,
159+
"last_name": last_name,
160+
"email": f"{first_name}{last_name}@gmail.com"
161+
},
162+
"special_skills": generic.multiple_choices.generate(custom_lists['skills'], num_of_choices=3),
163+
"bark_volume": generic.numeric.float_number(start=1.0, end=10.0, precision=2),
164+
"tail_wag_speed": generic.numeric.float_number(start=1.0, end=10.0, precision=1)
165+
}
166+
167+
return document
168+
```
169+
{% include copy.html %}
170+
171+
## Generating data
172+
173+
To generate synthetic data using custom logic, use the `generate-data` subcommand and provide the required custom Python module, index name, output path, and total amount of data to generate:
174+
175+
```shell
176+
osb generate-data --custom-module ~/pawber.py --index-name pawber-data --output-path ~/Desktop/sdg_outputs/ --total-size 2
177+
```
178+
{% include copy.html %}
179+
180+
For a complete list of available parameters and their descriptions, see the [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/).
181+
182+
## Example output
183+
184+
The following is an example output when generating 100 GB of data:
185+
186+
```
187+
____ _____ __ ____ __ __
188+
/ __ \____ ___ ____ / ___/___ ____ ___________/ /_ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__
189+
/ / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
190+
/ /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,<
191+
\____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_|
192+
/_/
193+
194+
195+
[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status]
196+
[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard.
197+
[NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance:
198+
ssh -i <PEM_FILEPATH> -N -L localhost:8787:localhost:8787 ec2-user@<DNS>
199+
200+
Total GB to generate: [1]
201+
Average document size in bytes: [412]
202+
Max file size in GB: [40]
203+
204+
100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s]
205+
206+
Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB.
207+
✅ Visit the following path to view synthetically generated data: /home/ec2-user/
208+
209+
-----------------------------------
210+
[INFO] ✅ SUCCESS (took 272 seconds)
211+
-----------------------------------
212+
```
213+
214+
## Advanced configuration
215+
216+
You can optionally create a YAML configuration file to store custom data and providers. The configuration file must define a `CustomGenerationValues` parameter.
217+
218+
The following parameters are available in `CustomGenerationValues`. Both parameters are optional.
219+
220+
| Parameter | Required/Optional | Description |
221+
|---|---|---|
222+
| `custom_lists` | Optional | Predefined arrays of values that you can reference in your Python module using `custom_lists['list_name']`. This allows you to separate data values from your code logic, making it easy to modify data values without changing your Python file. For example, `dog_names: [Buddy, Max, Luna]` becomes accessible as `custom_lists['dog_names']`. |
223+
| `custom_providers` | Optional | Custom data generation classes that extend Mimesis functionality. These should be defined as classes in your Python module (like `NumericString` or `MultipleChoices` in the [example](#python-module-example)) and then listed in this parameter by name. This allows you to create specialized data generators beyond what Mimesis provides by default. |
224+
225+
### Example configuration file
226+
227+
Save your configuration in a YAML file:
228+
229+
```yml
230+
CustomGenerationValues:
231+
# Generate data using a custom Python module
232+
custom_lists:
233+
# Custom lists to consolidate all values in this YAML file
234+
dog_names: [Hana, Youpie, Charlie, Lucy, Cooper, Luna, Rocky, Daisy, Buddy, Molly]
235+
dog_breeds: [Jindo, Labrador, German Shepherd, Golden Retriever, Bulldog, Poodle, Beagle, Rottweiler, Boxer, Dachshund, Chihuahua]
236+
treats: [cookies, pup_cup, jerky]
237+
custom_providers:
238+
# OSB's synthetic data generator uses Mimesis; custom providers are essentially custom Python classes that adds more functionality to Mimesis
239+
- NumericString
240+
- MultipleChoices
241+
```
242+
{% include copy.html %}
243+
244+
245+
### Using the configuration
246+
247+
To use your configuration file, add the `--custom-config` parameter to the `generate-data` command:
248+
249+
```shell
250+
osb generate-data --custom-module ~/pawber.py --index-name pawber-data --output-path ~/Desktop/sdg_outputs/ --total-size 2 --custom-config ~/Desktop/sdg-config.yml
251+
```
252+
{% include copy.html %}
253+
254+
## Related documentation
255+
256+
- [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/)
257+
- [Generating data using index mappings]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/mapping-sdg/)

0 commit comments

Comments
 (0)