-
Notifications
You must be signed in to change notification settings - Fork 208
RFC: add (path) "prefix" option to GPU plugin for scalability testing #1104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you give an example yaml of the deployment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see much benefit in the prefix, I see complications and unnecessary logic.
You should primarily aim to fake the resource with its existing name without breaking dependencies to the existing resource name.
Pod spec is just example of the changes. When using it in cluster running also real GPU plugin, you would add nodeSelector to the daemonSet to constrain faked GPU plugin to non-GPU nodes. With the example configMap, nodeSelector for a faked GPU workload [1] would ask for nodes where "gpu.intel.com/platform_fake_DG1.present" label is true, and real workloads would need to ask for non-fake GPU label. Eventually there would be following PRs:
[1] If fake workload is one from "fakedev-exporter" project, it can fake suitable GPU load for the (fake) GPU assigned to it, which allows testing further GPU related k8s components. |
Devices can be faked for scalability testing when non-standard paths are used (GPU plugin code assumes container paths to match host paths, and container runtime prevents creating fake files under real paths). Note: If one wants to run both normal GPU plugin and faked one in same cluster, all nodes providing fake "i915" resources should be labeled differently from ones with real GPU plugin + devices, so that real GPU workloads can be limited to correct nodes with a suitable nodeSelector. Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
Based on input JSON file
Fake devfs directory is mounted from host so OCI runtime can "mount" device files also to workloads requesting fake devices. This means that those files can persist over fake GPU plugin life-time, so earlier files need to be removed, as they may not match. Also, DaemonSet restarts failing init containers, so errors about directories generated on previous generator run would prevent getting logs of the real error from first generator run.
Represent fake GPU devices with null devices: https://www.kernel.org/doc/Documentation/admin-guide/devices.txt Real devfs check needed also changing, and removal warnings were simplified, as there's always just one entry.
Signed-off-by: Eero Tamminen <[email protected]>
With latest devices release.
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
Pushed fixes to CI complaints (including some white-space complaints for already existing GPU plugin code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good. I'm still going to have a nitpicking round if I find some style issues or so but in the bigger picture this is excellent.
immutable: false | ||
data: | ||
fakedev.json: |- | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the base object be an array? In case we ever want to define a heterogenous fake system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean different GPUs on different nodes, one could just run multiple fake plugins, restricted to different nodes. Or code could be changed to read multiple JSON files, and if JSON file includes new include & exclude keys listing node names, that code compares against specified env var value.
If you mean different GPUs on same node, there could be a device offset key in JSON, with code changed to parse multiple config files.
Both of above changes would be fully backwards compatible i.e. they could be done later.
=> If needed, I'll add those features later on, but keep code simple for now.
PS. I do not think changing API later would be problem either, as this is just a (scalability) test tool, config API change only requires matching generator version, and one gets error warning if they do not match.
I like that it's all deployable from k8s. No need to ssh into node and run generator there etc. Overall, it looks good! |
As suggested by Ukri. Signed-off-by: Eero Tamminen <[email protected]>
Give more detailed logging for most likely failure, as MkNod() device node creation can fail as normal user. Additional error checking done in new dir removal helper function fixes Ukri's review comments. There's now error if to-be-removed fake sysfs has more content than expected (earlier such check was only for fake devfs content). Signed-off-by: Eero Tamminen <[email protected]>
Noticed by Tuomas. Signed-off-by: Eero Tamminen <[email protected]>
With above comments addressed, do you have any advise on where to put the generator code, container & example config? (It's probably best to leave just the platform item and remove rest of keys from Capabilities map in its config.) |
For the locations. The yaml would probably go to /deployments/fake-gpu-plugin (or similar). |
I think YAML is best as new kustomize dir for GPU plugin.
Including it in gpu_nfdhook/ directory and container image would make things simpler => I'll do that and split rest of the stuff to separate PRs now that Mikko fixes SGX CI builds. As is has no extra deps and is only 3.5MB when statically compiled, image size increase and impact of that on GPU plugin deployment could be fine I guess, although generator is used only for testing. However:
[1] I.e. that functionality would either be included into GPU plugin itself, or it runs as GPU plugin sidecar to update NFD feature file whenever GPUs change. So I think it still makes sense to run generator in a separate init container, although it would be on the same image with the NFD hook. |
@tkatila Project container templating does not support building or installing multiple binaries to the same container image, so I need to put generator to a separate directory / container after all. Maybe "gpu_fakedev"? |
Sounds fine to me. |
Closing now that the features discussed here have their own PRs. |
PR adds
-prefix
option to GPU plugin which allows it to read faked devfs and sysfs GPU device files (please check the commit messages for details).PR includes also a tool for generating such files, which I've tested with the (included) modified GPU plugin / pod spec in k8s cluster. Those are included just for reference, and should be split to separate PR(s) before this is merged.
For now, they are under "fakedev" subdir under gpu_plugin/. What would be best place for the generator code, maybe
cmd/gpu_fake_devs/generator.go
?GPU plugin pod running with
-prefix
will have that generator as its first init container. For NFD labeling to work properly with faked devices,gpu_nfdhook
must be run in GPU plugin pod context, not copied to host as NFD hook, as is done by the modified GPU plugin pod spec (whilegpu_nfdhook
does not need changes for now, see [1]).With these, one can fake an arbitrary number of devices which can be used to do scalability testing both for GPU plugin, and things depending on it.
[1] Links: