-
Notifications
You must be signed in to change notification settings - Fork 107
Containerization of TM and TDS #2076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containerization of TM and TDS #2076
Conversation
08ce18f to
acc1894
Compare
|
This pull request introduces 13 alerts when merging acc1894 into 68a055b - view on LGTM.com new alerts:
|
a8640e4 to
a9672a9
Compare
|
RPMs for these latest changes at https://copr.fedorainfracloud.org/coprs/portante/pbench-test/, |
f7a8d4b to
f4f44cd
Compare
246e751 to
edcb4c7
Compare
4883cea to
b78b571
Compare
b78b571 to
78e85ee
Compare
|
Okay, now that we landed #2185, this is ready again for final review. |
The `README.md` file states that there targets for each of the supported distributions, but they were missing. This commit corrects that over- sight. Further, we modify the pbench "repo" file targets to be less abstract and more tied to the goal of generating pbench repo files.
Add support for running the Tool Meister(s) and the Tool Data Sink as
containerized services.
We provide separate images for Tool Meister and Tool Data Sink
containers. Each container has a entry point command at
`tool-meister/tool-*-ep`, which reads environment variables provided by
the container specification and passes them as parameters to the
respective Tool Meister and Tool Data Sink commands.
We have modified the Tool Meister and Tool Data Sink code to
continuously try to connect to the specified Redis server to find their
parameter key.
We have changed the Tool Data Sink to only daemonize itself when
requested.
The Tool Data Sink was changed to recognize the difference between the
internal (to the container) and external namespaces. The driver must
use a `.path` file in the `${pbench_run}` directory when working with
the Tool Data Sink so that they can pass external references to the Tool
Data Sink container via the APIs, allowing it to map them to internal
references in the container.
We have endowed the `pbench-tool-meister-start` and `-stop` commands
with an additional parameter, `--redis-server=<host>:<port>`, which
allows the user to specify their own Redis server. This implies that
the user will also be handling the orchestration of the Tool Data Sink
and Tool Meister containers. Besides the `--redis-server` parameter, we
also accept an environment variable,
`PBENCH_REDIS_SERVER=<host>:<port>`.
The `pbench-tool-meister-start` command will now timeout waiting for
the TDS to show up. We now wait a maximum time of one minute for the
Tool Data Sink to show up ready for service. Prior to this change the
start-up process would wait indefinitely. Since we added a new
`ReturnCode` for this timeout case, we also fix up the other uses of
`ReturnCode` in the Tool Group loading failure case.
Prior to the creation of the Tool Meister sub-system, collection of the
configuration (or system) information was performed on a best-effort
basis. A benchmark script would not report "failure" if the
`pbench-collect-sysinfo` command failed. This commit changes both
`pbench-tool-meister-start` and `-stop` to behave as before. The code
for initializing and ending the persistent tools has been moved out from
underneath a conditional to be clear that its operation is not optional.
In addition to the above major changes we also:
* In `pbench-tool-meister-stop` we have moved the steps taken to
gracefully shutdown the local Tool Data Sink, Tool Meister, and Redis
server to its own method to aid readability of the code
* We use the `error_log` and `warn_log` methods were appropriate to
restore / add output to the `pbench.log` file
* Correct how Tool Data Sink handles `EADDRINUSE`
We now capture exceptions raise in the WSGI thread so that we can
report the success or failure of that thread starting up. This also
removes the hacky timed wait that was there.
While we are at, we also remove the hacky timed wait for the log
capture thread.
* Only run WSGI server if successfully created
* Add unit tests for new BenchmarkRunDir class, and the
DataSinkWsgiServer class
* Fix how OSError is forwarded from WSGI thread
* Rename constants `*_port` to `def_*_port`
* Move the `ToolGroup` class to its own module
This allows reuse by the other commands as well, allowing us to DRY
out that code a bit as well.
We also need to use PCP 5.2.2 for all image distros we build.
We provide a new `dcgm-exporter` which is Python3 based, along with updated visualizers and a container example for DCGM.
78e85ee to
2ec2345
Compare
|
Had to rebase to pick up PR #2186. |
webbnh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@portante, merge this quick before someone merges something else!
Add support for running the Tool Meister(s) and the Tool Data Sink as containerized services.
We provide separate images for Tool Meister and Tool Data Sink containers. Each container has a entry point command at
tool-meister/tool-*-ep, which reads environment variables provided by the container specification and passes them as parameters to the respective Tool Meister and Tool Data Sink commands.We have modified the Tool Meister and Tool Data Sink code to continuously try to connect to the specified Redis server to find their parameter key.
We have changed the Tool Data Sink to only daemonize itself when requested.
The Tool Data Sink was changed to recognize the difference between the internal (to the container) and external namespaces. The driver must use a
.pathfile in the${pbench_run}directory when working with the Tool Data Sink so that they can pass external references to the Tool Data Sink container via the APIs, allowing it to map them to internal references in the container.We have endowed the
pbench-tool-meister-startand-stopcommands with an additional parameter,--redis-server=<host>:<port>, which allows the user to specify their own Redis server. This implies that the user will also be handling the orchestration of the Tool Data Sink and Tool Meister containers. Besides the--redis-serverparameter, we also accept an environment variable,PBENCH_REDIS_SERVER=<host>:<port>.The
pbench-tool-meister-startcommand will now timeout waiting for the TDS to show up. We now wait a maximum time of one minute for the Tool Data Sink to show up ready for service. Prior to this change the start-up process would wait indefinitely. Since we added a newReturnCodefor this timeout case, we also fix up the other uses ofReturnCodein the Tool Group loading failure case.Prior to the creation of the Tool Meister sub-system, collection of the configuration (or system) information was performed on a best-effort basis. A benchmark script would not report "failure" if the
pbench-collect-sysinfocommand failed. This commit changes bothpbench-tool-meister-startand-stopto behave as before. The code for initializing and ending the persistent tools has been moved out from underneath a conditional to be clear that its operation is not optional.In addition to the above major changes we also:
In
pbench-tool-meister-stopwe have moved the steps taken to gracefully shutdown the local Tool Data Sink, Tool Meister, and Redis server to its own method to aid readability of the codeWe use the
error_logandwarn_logmethods were appropriate to restore / add output to thepbench.logfileCorrect how Tool Data Sink handles
EADDRINUSEWe now capture exceptions raise in the WSGI thread so that we can report the success or failure of that thread starting up. This also removes the hacky timed wait that was there.
While we are at, we also remove the hacky timed wait for the log capture thread.
Only run WSGI server if successfully created
Add unit tests for new BenchmarkRunDir class, and the DataSinkWsgiServer class
Fix how OSError is forwarded from WSGI thread
Rename constants
*_porttodef_*_portMove the
ToolGroupclass to its own moduleThis allows reuse by the other commands as well, allowing us to DRY out that code a bit as well.
There are four (5) commits to this PR that we'll maintain when we merge:
Makefileto work as advertised in theREADME.mdfirewalldservice files