FLI-based Worker Manager #622

al-rigazzi · 2024-06-25T17:22:54Z

This PR adds a simple TorchWorker which performs inference. The worker has only been tested for direct inference (and the file test_torch_worker.py reflects that). The output transform is still not implemented, but that's something that it is not needed for the moment being.

…to fli-worker

codecov · 2024-06-25T23:53:52Z

Codecov Report

Attention: Patch coverage is 32.65306% with 66 lines in your changes missing coverage. Please review.

Please upload report for BASE (mli-feature@52abd32). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...im/_core/mli/infrastructure/worker/torch_worker.py	35.41%	31 Missing ⚠️
smartsim/_core/launcher/dragon/dragonBackend.py	7.69%	12 Missing ⚠️
...e/mli/infrastructure/storage/dragonfeaturestore.py	0.00%	9 Missing ⚠️
...tsim/_core/mli/infrastructure/environmentloader.py	0.00%	6 Missing ⚠️
smartsim/_core/mli/infrastructure/worker/worker.py	68.42%	6 Missing ⚠️
smartsim/_core/mli/message_handler.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             mli-feature     #622   +/-   ##
==============================================
  Coverage               ?   63.61%           
==============================================
  Files                  ?       97           
  Lines                  ?     6690           
  Branches               ?        0           
==============================================
  Hits                   ?     4256           
  Misses                 ?     2434           
  Partials               ?        0

Files with missing lines	Coverage Δ
smartsim/_core/mli/comm/channel/channel.py	`66.66% <ø> (ø)`
...m/_core/mli/infrastructure/storage/featurestore.py	`100.00% <100.00%> (ø)`
smartsim/_core/mli/message_handler.py	`75.82% <0.00%> (ø)`
...tsim/_core/mli/infrastructure/environmentloader.py	`0.00% <0.00%> (ø)`
smartsim/_core/mli/infrastructure/worker/worker.py	`80.00% <68.42%> (ø)`
...e/mli/infrastructure/storage/dragonfeaturestore.py	`0.00% <0.00%> (ø)`
smartsim/_core/launcher/dragon/dragonBackend.py	`2.32% <7.69%> (ø)`
...im/_core/mli/infrastructure/worker/torch_worker.py	`35.41% <35.41%> (ø)`

ex/high_throughput_inference/standalone_workermanager.py

AlyssaCote

This looks so, so good! Just a couple tiny comments. No hold ups from me, though!

doc/changelog.md

smartsim/_core/mli/infrastructure/control/workermanager.py

ex/high_throughput_inference/mli_driver.py

doc/changelog.md

ankona · 2024-07-01T17:55:57Z

smartsim/_core/mli/infrastructure/control/workermanager.py

+    import dragon
+    from dragon import fli
+except ImportError as exc:
+    if not "pytest" in sys.modules:


are you sure you want to look for pytest here? might be a copy/paste mistake

It was an attempt to avoid failing tests when dragon was not available. This will be fixed once #621 will go in

smartsim/_core/mli/infrastructure/control/workermanager.py

smartsim/_core/mli/infrastructure/worker/torch_worker.py

AlyssaCote · 2024-07-08T15:52:43Z

ex/high_throughput_inference/mock_app.py

+            response = MessageHandler.deserialize_response(resp)
+            self.measure_time("deserialize_response")


Quick question. Shouldn't we be deserializing the request here? We serialize the request but I don't see where it's deserialized.

Unless that's actually measured duringapp_receive, and if so never mind!

I think on the app side we only want to deserialize the response - is there something I'm missing?

…to fli-worker

ankona · 2024-07-11T17:29:59Z

smartsim/_core/mli/infrastructure/control/workermanager.py

+        timings.append(time.perf_counter() - interm)
+        interm = time.perf_counter()
+
+        print(" ".join(str(time) for time in timings))


consider adding a custom log level e.g. "perf" so you can configure your logging output w/familiar logging interface / avoid prints

That's an interesting suggestion, thanks

al-rigazzi added 8 commits June 25, 2024 12:21

Initial FLI-based implementation

e98e2fe

Add inference example stub

043f0e7

Lint, style, black magic

efc9e83

Merge branch 'mli-feature' of https://github.com/CrayLabs/SmartSim in…

35ec45e

…to fli-worker

Bring up to feature branch

ed3c42a

Update example

e5be26b

Change the changelog

a23010f

Make style

3c20f46

al-rigazzi added 9 commits June 26, 2024 09:51

Attempt to mitigate import dragon error

b9ed5ba

Import dragon optional

0de06f3

isort

d051385

Fix imports in dragon backend tests

e77b1cd

Style

a90888d

Fix type

b431221

Rename examples dir

23efebc

Remove old dir

09b9d24

Add tests for torch worker

56d8e50

al-rigazzi requested review from AlyssaCote, ankona and mellis13 June 26, 2024 23:50

mellis13 reviewed Jun 27, 2024

View reviewed changes

ex/high_throughput_inference/standalone_workermanager.py Outdated Show resolved Hide resolved

al-rigazzi added 3 commits June 27, 2024 08:14

Switch to sender-supplied channels in app example

6cec83e

Add prototype client for mock app

3ad6d44

Update mock app

bd5f133

AlyssaCote approved these changes Jun 28, 2024

View reviewed changes

doc/changelog.md Outdated Show resolved Hide resolved

smartsim/_core/mli/infrastructure/control/workermanager.py Outdated Show resolved Hide resolved