Skip to content

Commit 3c1fd87

Browse files
srikumar003AlessandroPomponiodanielelotitoVassilisVassiliadis
authored
feat(autoconf): introduce autoconf custom experiments (#255)
* Squashed 'plugins/custom_experiments/autoconf/' content from commit b6fcc73 git-subtree-dir: plugins/custom_experiments/autoconf git-subtree-split: b6fcc73b0b2b5b75192001b4f17e331a6df67fde * chore: remove pycache files Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * fix: breaking ci and remove unsupported models Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * fix: resolve ci issues Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * fix: updating recommender test Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * docs: remove redundant examples from autoconf readme Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * docs: update to examples Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * Update plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/changelog.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * Update plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * Update plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * Update plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * Update plugins/custom_experiments/autoconf/autoconf/min_gpu_recommender.py Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * Update plugins/custom_experiments/autoconf/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * refactor: improve name of variable in get_model_prediction_and_metadata * refactor: clarify variable naming in recommender.py * docs: improve readability Signed-off-by: Daniele Lotito <[email protected]> * docs: improve readability Signed-off-by: Daniele Lotito <[email protected]> * refactor: apply suggestion Signed-off-by: Daniele Lotito <[email protected]> * refactor: improve variable name * refactor(autoconf): tidy up the JobConfig pydantic model and remove the unused Config class Signed-off-by: Vassilis Vassiliadis <[email protected]> * refactor(autoconf): tidy up the recommend_min_gpu() method The new code iterates the candidate number_gpus starting from the minimum value and stops the first time it predicts that the job would complete successfully. Signed-off-by: Vassilis Vassiliadis <[email protected]> * refactor(autoconf): remove dead code and use log.debug() instead of print Signed-off-by: Vassilis Vassiliadis <[email protected]> * fix: add torch to dependencies Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * test: add autoconf test to tox Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * fix: update plugins/custom_experiments/autoconf/autoconf/utils/config_mapper.py Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * docs: update README to address comments Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * fix: address comments about exception handling Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * fix: breaking style check on pydantic models Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * build(autoconf): change the name of the package to ado-autoconf Signed-off-by: Vassilis Vassiliadis <[email protected]> * docs: fix training options The current setting uses `medium_quality` + `optimize_for_deployment` (even if in the script the optimization happens with predictor.clone_for_deployment) You can use good quali Signed-off-by: Daniele Lotito <[email protected]> * build(autoconf): specify the packages to include in ado-autoconf Signed-off-by: Vassilis Vassiliadis <[email protected]> * fix: update plugins/custom_experiments/autoconf/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * docs: update plugins/custom_experiments/autoconf/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * docs: update plugins/custom_experiments/autoconf/README.md Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * docs: fix folder naming structure Signed-off-by: Daniele Lotito <[email protected]> * fix: update plugins/custom_experiments/autoconf/autoconf/min_gpu_recommender.py Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * test: install autoconf before test Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * docs(autoconf): Updating the README to address review Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * test(autoconf): Update tox.ini Co-authored-by: Alessandro Pomponio <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> * docs(autoconf): update to paths * build(autoconf): update experiment name Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> * refactor: removed unused method. It was there because predictors from both sklearn and autogluon have this method and at the beginning I was planning to have it inherit from sklearn estimator Signed-off-by: Daniele Lotito <[email protected]> * refactor: improve readability with list comprehension and displace it in the only script that uses it Signed-off-by: Daniele Lotito <[email protected]> * refactor: use the logger Signed-off-by: Daniele Lotito <[email protected]> --------- Signed-off-by: SRIKUMAR VENUGOPAL <[email protected]> Signed-off-by: Srikumar Venugopal <[email protected]> Signed-off-by: Daniele Lotito <[email protected]> Signed-off-by: Vassilis Vassiliadis <[email protected]> Co-authored-by: Alessandro Pomponio <[email protected]> Co-authored-by: Daniele-Lotito <[email protected]> Co-authored-by: Vassilis Vassiliadis <[email protected]>
1 parent b327f16 commit 3c1fd87

File tree

39 files changed

+2095
-1
lines changed

39 files changed

+2095
-1
lines changed

.secrets.baseline

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
"files": "requirements.txt|^.secrets.baseline$",
44
"lines": null
55
},
6-
"generated_at": "2025-11-12T15:32:50Z",
6+
"generated_at": "2025-11-27T09:22:43Z",
77
"plugins_used": [
88
{
99
"name": "AWSKeyDetector"
@@ -319,6 +319,46 @@
319319
"verified_result": null
320320
}
321321
],
322+
"plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/v0-0-0_20251024_100825-refit-clone-opt/metadata.json": [
323+
{
324+
"hashed_secret": "5599fdad9234b54f60b83edbb7b93ebcdfd2ef39",
325+
"is_secret": false,
326+
"is_verified": false,
327+
"line_number": 344,
328+
"type": "Secret Keyword",
329+
"verified_result": null
330+
}
331+
],
332+
"plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/v1-0-0_ag-20251024_100825-refit-clone-opt/metadata.json": [
333+
{
334+
"hashed_secret": "5599fdad9234b54f60b83edbb7b93ebcdfd2ef39",
335+
"is_secret": false,
336+
"is_verified": false,
337+
"line_number": 334,
338+
"type": "Secret Keyword",
339+
"verified_result": null
340+
}
341+
],
342+
"plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/v1-1-0_ag-20251112_155927-refit-clone-opt/metadata.json": [
343+
{
344+
"hashed_secret": "2681dcc1148101e9fdf3e7945dbb61044224126e",
345+
"is_secret": false,
346+
"is_verified": false,
347+
"line_number": 172,
348+
"type": "Secret Keyword",
349+
"verified_result": null
350+
}
351+
],
352+
"plugins/custom_experiments/autoconf/autoconf/AutoGluonModels/v2-0-0_ag-20251113_154241-refit-clone-opt/metadata.json": [
353+
{
354+
"hashed_secret": "2681dcc1148101e9fdf3e7945dbb61044224126e",
355+
"is_secret": false,
356+
"is_verified": false,
357+
"line_number": 172,
358+
"type": "Secret Keyword",
359+
"verified_result": null
360+
}
361+
],
322362
"tests/fixtures/core/samplestore.py": [
323363
{
324364
"hashed_secret": "5d07e1b80e448a213b392049888111e1779a52db",
Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
# AutoConf
2+
3+
This package contains ado custom experiments for use in automated configuration
4+
of workload resources requirements for GenAI workloads.
5+
6+
## min_gpu_recommender
7+
8+
**min_gpu_recommender** is a predictive model that recommends the minimum number
9+
of GPUs per worker and the number of workers required to run a tuning job
10+
without triggering a GPU Out Of Memory exception.
11+
12+
This model combines rule-based logic with an
13+
[AutoGluon](https://auto.gluon.ai/stable/index.html) tabular classifier.
14+
15+
### Model Details
16+
17+
The model operates on the following features:
18+
19+
- `model_name`
20+
- `method` (e.g., `lora`, `full`)
21+
- `gpu_model`
22+
- `tokens_per_sample`
23+
- `batch_size`
24+
- `is_valid`
25+
26+
and outputs 3 parameters:
27+
28+
- `can_recommend` with values [0,1]
29+
- `workers` with an integer value
30+
- `gpus` with an integer value
31+
32+
The min_gpu_recommender is exposed via an [`ado`](ibm.github.io/ado/)
33+
[custom experiment](https://ibm.github.io/ado/actuators/creating-custom-experiments/)
34+
This enables validation of parameters provided for invocation against the domain
35+
accepted by the recommender model. This ensures that, as expected, the model
36+
returns `can_recommend==0` for configuration domain values (e.g. model names)
37+
that were absent in its training set.
38+
39+
Please note that the accepted domains of the models are updated with every
40+
version of the model. Please see
41+
[models README](autoconf/AutoGluonModels/README.md) for information on the
42+
different model versions available. Please refer to
43+
[the changelog](autoconf/AutoGluonModels/changelog.md) for more details on model
44+
updates
45+
46+
### Installation and Usage
47+
48+
Install the package e.g. from the root of the ado repository, run:
49+
50+
```bash
51+
pip install plugins/custom_experiments/autoconf
52+
```
53+
54+
The min_gpu_recommender model can be invoked in multiple ways:
55+
56+
#### 1. CLI
57+
58+
Via ado's `run_experiment` CLI command. Here's an example YAML file (which you
59+
can find under [examples/simple.yaml](examples/simple.yaml).
60+
61+
```yaml
62+
entity:
63+
model_name: llama-7b
64+
method: lora
65+
gpu_model: NVIDIA-A100-80GB-PCIe
66+
tokens_per_sample: 8192
67+
batch_size: 16
68+
model_version: 1.1.0
69+
70+
experiments:
71+
- actuatorIdentifier: custom_experiments
72+
experimentIdentifier: min_gpu_recommender
73+
```
74+
75+
To use it, from the root directory of ado repository, run
76+
77+
```bash
78+
run_experiment plugins/custom_experiments/autoconf/examples/simple.yaml
79+
```
80+
81+
After a few seconds you should see:
82+
83+
<!-- markdownlint-disable line-length -->
84+
85+
```bash
86+
Point: {'model_name': 'llama-7b', 'method': 'lora', 'gpu_model': 'NVIDIA-A100-80GB-PCIe', 'tokens_per_sample': 8192, 'batch_size': 16, 'model_version': '1.1.0'}
87+
2025-11-13 13:26:24,925 INFO worker.py:2003 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
88+
/Users/username/projects/orchestrator/autoconf/.venv/lib/python3.12/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
89+
warnings.warn(
90+
{}
91+
Validating entity ...
92+
Executing: custom_experiments.min_gpu_recommender
93+
(CustomExperiments pid=55466) Found 1 mismatches between original and current metadata:
94+
(CustomExperiments pid=55466) INFO: AutoGluon Python micro version mismatch (original=3.12.7, current=3.12.11)
95+
Result:
96+
[request_id 2df09f
97+
request_index 0
98+
entity_index 0
99+
result_index 0
100+
batch_size 16
101+
generatorid unk
102+
gpu_model NVIDIA-A100-80GB-PCIe
103+
method lora
104+
model_name llama-7b
105+
model_version 1.1.0
106+
tokens_per_sample 8192
107+
identifier model_name.llama-7b-method.lora-gpu_model.NVID...
108+
experiment_id custom_experiments.min_gpu_recommender
109+
valid True
110+
can_recommend 1
111+
gpus 2
112+
workers 1
113+
```
114+
115+
<!-- markdownlint-enable line-length -->
116+
117+
The output of the experiment are the lines:
118+
119+
```bash
120+
gpus 2
121+
workers 1
122+
can_recommend 1
123+
```
124+
125+
It reports that the recommender can make a suggestion (`can_recommend=1`). The
126+
suggestion comes in the form of number of workers and GPUs per worker. In the
127+
above example, you should use 1 worker with 2 GPUs.
128+
129+
#### 2. Example programmatic usage with validation
130+
131+
Calling decorated `min_gpu_recommender` custom experiment directly.
132+
133+
<!-- markdownlint-disable line-length -->
134+
135+
```python
136+
from orchestrator.schema.reference import (
137+
ExperimentReference,
138+
)
139+
from orchestrator.schema.point import SpacePoint
140+
from orchestrator.modules.actuators.registry import ActuatorRegistry
141+
from autoconf.min_gpu_recommender import (
142+
min_gpu_recommender,
143+
)
144+
145+
configuration = {
146+
"model_name": "llama-7b",
147+
"method": "lora",
148+
"gpu_model": "NVIDIA-A100-80GB-PCIe",
149+
"tokens_per_sample": 8192,
150+
"batch_size": 16,
151+
"model_version": "1.1.0",
152+
}
153+
154+
measured_properties=min_gpu_recommender(**configuration)
155+
print(measured_properties)
156+
```
157+
158+
<!-- markdownlint-enable line-length -->
159+
160+
This will print a similar text to:
161+
162+
<!-- markdownlint-disable line-length -->
163+
164+
```bash
165+
Found 1 mismatches between original and current metadata:
166+
WARNING: AutoGluon Python version mismatch (original=3.12, current=3.10)
167+
[value-op-min_gpu_recommender-can_recommend:1, value-op-min_gpu_recommender-gpus:2, value-op-min_gpu_recommender-workers:1]
168+
```
169+
170+
Note: This warning can be safely ignored for now.
171+
172+
<!-- markdownlint-enable line-length -->
173+
174+
#### 3. Calling `min_gpu_recommender` custom experiment via `ado`
175+
176+
This will use ray, the `custom_experiment` actuator and return results in `ado`
177+
format (MeasurementRequest)
178+
179+
<!-- markdownlint-disable line-length -->
180+
181+
```python
182+
from orchestrator.schema.reference import (
183+
ExperimentReference,
184+
)
185+
from orchestrator.schema.point import SpacePoint
186+
from orchestrator.modules.actuators.registry import ActuatorRegistry
187+
from orchestrator.utilities.run_experiment import local_execution_closure
188+
189+
configuration = {
190+
"model_name": "llama-7b",
191+
"method": "lora",
192+
"gpu_model": "NVIDIA-A100-80GB-PCIe",
193+
"tokens_per_sample": 8192,
194+
"batch_size": 16,
195+
"model_version": "1.1.0",
196+
}
197+
198+
entity = SpacePoint.model_validate({"entity":configuration}).to_entity()
199+
experiment = ActuatorRegistry().experimentForReference(
200+
ExperimentReference(
201+
actuatorIdentifier="custom_experiments",
202+
experimentIdentifier="min_gpu_recommender",
203+
)
204+
)
205+
206+
request=local_execution_closure(registry=ActuatorRegistry())(reference=experiment.reference, entity=entity)
207+
print(request.measurements[0].series_representation(output_format="target"))
208+
```
209+
210+
### Downstream Example: Parameter Sweep over a configuration space
211+
212+
<!-- markdownlint-enable line-length -->
213+
214+
This example demonstrates the use case where `ado` is used to obtain predictions
215+
for points in a large configuration space. This avoids the time and resource
216+
overheads of having to benchmark each point to determine if: a) the
217+
configuration represented by the point is feasible; and b) the minimum number of
218+
GPUs required for this configuration.
219+
220+
This example uses the space in
221+
[sweep/examples/space.yaml](sweep/examples/space.yaml) which applies the
222+
`min_gpu_recommender` experiment on 3960 configurations.
223+
224+
The space looks like this:
225+
226+
```yaml
227+
experiments:
228+
- experimentIdentifier: min_gpu_recommender
229+
actuatorIdentifier: custom_experiments
230+
231+
entitySpace:
232+
- identifier: "model_name"
233+
propertyDomain:
234+
values:
235+
[
236+
"granite-3.1-2b",
237+
"granite-20b-v2",
238+
"granite-13b-v2",
239+
"granite-3-8b",
240+
"granite-3.1-3b-a800m-instruct",
241+
"granite-3.1-8b-instruct",
242+
"granite-34b-code-base",
243+
"granite-3b-code-base-128k",
244+
"granite-7b-base",
245+
"granite-8b-code-base",
246+
"granite-8b-japanese",
247+
"llama-13b",
248+
"llama-7b",
249+
"llama2-70b",
250+
"llama3-70b",
251+
"llama3-8b",
252+
"llama3.1-405b",
253+
"llama3.1-70b",
254+
"llama3.1-8b",
255+
"mistral-123b-v2",
256+
"mistral-7b-v0.1",
257+
"mixtral-8x7b-instruct-v0.1",
258+
]
259+
- identifier: "tokens_per_sample"
260+
propertyDomain:
261+
values: [512, 1024, 2048, 4096, 8192]
262+
- identifier: "batch_size"
263+
propertyDomain:
264+
values: [1, 2, 4, 8, 32, 64]
265+
- identifier: "gpu_model"
266+
propertyDomain:
267+
values: ["NVIDIA-A100-SXM4-80GB", "NVIDIA-A100-80GB-PCIe", "L40S"]
268+
- identifier: method
269+
propertyDomain:
270+
values: ["full", "lora"]
271+
- identifier: model_version
272+
propertyDomain:
273+
values:
274+
- "1.1.0"
275+
```
276+
277+
To execute this run:
278+
279+
<!-- markdownlint-disable line-length -->
280+
281+
```bash
282+
ado create space -f examples/sweep/space.yaml
283+
ado create operation -f examples/sweep/operation.yaml --use-latest space
284+
: The above step will take a few minutes to sweep over the points
285+
: This command will generate a CSV file with the results
286+
ado show entities --use-latest space --output-format csv
287+
open space-*.csv
288+
```
289+
290+
<!-- markdownlint-enable line-length -->
291+
292+
Look for the `can_recommend`, `gpus`, and `workers` columns in the CSV file.
293+
294+
Learn more about exploring spaces in the ado documentation for taking a
295+
[RandomWalk on a space](https://ibm.github.io/ado/examples/random-walk/#exploring-the-discoveryspace).

0 commit comments

Comments
 (0)