Containerization of TM and TDS #2076

portante · 2021-01-26T19:18:10Z

Add support for running the Tool Meister(s) and the Tool Data Sink as containerized services.

We provide separate images for Tool Meister and Tool Data Sink containers. Each container has a entry point command at tool-meister/tool-*-ep, which reads environment variables provided by the container specification and passes them as parameters to the respective Tool Meister and Tool Data Sink commands.

We have modified the Tool Meister and Tool Data Sink code to continuously try to connect to the specified Redis server to find their parameter key.

We have changed the Tool Data Sink to only daemonize itself when requested.

The Tool Data Sink was changed to recognize the difference between the internal (to the container) and external namespaces. The driver must use a .path file in the ${pbench_run} directory when working with the Tool Data Sink so that they can pass external references to the Tool Data Sink container via the APIs, allowing it to map them to internal references in the container.

We have endowed the pbench-tool-meister-start and -stop commands with an additional parameter, --redis-server=<host>:<port>, which allows the user to specify their own Redis server. This implies that the user will also be handling the orchestration of the Tool Data Sink and Tool Meister containers. Besides the --redis-server parameter, we also accept an environment variable, PBENCH_REDIS_SERVER=<host>:<port>.

The pbench-tool-meister-start command will now timeout waiting for the TDS to show up. We now wait a maximum time of one minute for the Tool Data Sink to show up ready for service. Prior to this change the start-up process would wait indefinitely. Since we added a new ReturnCode for this timeout case, we also fix up the other uses of ReturnCode in the Tool Group loading failure case.

Prior to the creation of the Tool Meister sub-system, collection of the configuration (or system) information was performed on a best-effort basis. A benchmark script would not report "failure" if the pbench-collect-sysinfo command failed. This commit changes both pbench-tool-meister-start and -stop to behave as before. The code for initializing and ending the persistent tools has been moved out from underneath a conditional to be clear that its operation is not optional.

In addition to the above major changes we also:

In pbench-tool-meister-stop we have moved the steps taken to gracefully shutdown the local Tool Data Sink, Tool Meister, and Redis server to its own method to aid readability of the code
We use the error_log and warn_log methods were appropriate to restore / add output to the pbench.log file
Correct how Tool Data Sink handles EADDRINUSE

We now capture exceptions raise in the WSGI thread so that we can report the success or failure of that thread starting up. This also removes the hacky timed wait that was there.

While we are at, we also remove the hacky timed wait for the log capture thread.
Only run WSGI server if successfully created
Add unit tests for new BenchmarkRunDir class, and the DataSinkWsgiServer class
Fix how OSError is forwarded from WSGI thread
Rename constants *_port to def_*_port
Move the ToolGroup class to its own module

This allows reuse by the other commands as well, allowing us to DRY out that code a bit as well.

There are four (5) commits to this PR that we'll maintain when we merge:

Correction to the agent image building Makefile to work as advertised in the README.md
The combined (from 20+ separate commits) set of changes as described above
Use of PCP "bintray" repos when building the container images
Properly exposing PCP ports and corrections to the firewalld service files
Incorporation of @Maxusmusti work from PR dcgm-exporter update portante/pbench#15

lgtm-com · 2021-02-01T05:58:52Z

This pull request introduces 13 alerts when merging acc1894 into 68a055b - view on LGTM.com

new alerts:

9 for Unused local variable
2 for Unused import
2 for Variable defined multiple times

portante · 2021-02-01T08:16:40Z

RPMs for these latest changes at https://copr.fedorainfracloud.org/coprs/portante/pbench-test/, v0.71.0-130ga9672a901.

lib/pbench/agent/tool_data_sink.py

agent/util-scripts/pbench-tool-meister-start

portante · 2021-03-29T23:56:30Z

Rebased and resolved all code review comments. Unfortunately, I hit the bug solved by PR #2185, which delayed the verification of the tests (py3-agent and py3-server) tox tests (any pytest based test) would fail.

Let's get PR #2185 to land first, and then this one.

portante · 2021-03-30T18:27:01Z

Okay, now that we landed #2185, this is ready again for final review.

The `README.md` file states that there targets for each of the supported distributions, but they were missing. This commit corrects that over- sight. Further, we modify the pbench "repo" file targets to be less abstract and more tied to the goal of generating pbench repo files.

Add support for running the Tool Meister(s) and the Tool Data Sink as containerized services. We provide separate images for Tool Meister and Tool Data Sink containers. Each container has a entry point command at `tool-meister/tool-*-ep`, which reads environment variables provided by the container specification and passes them as parameters to the respective Tool Meister and Tool Data Sink commands. We have modified the Tool Meister and Tool Data Sink code to continuously try to connect to the specified Redis server to find their parameter key. We have changed the Tool Data Sink to only daemonize itself when requested. The Tool Data Sink was changed to recognize the difference between the internal (to the container) and external namespaces. The driver must use a `.path` file in the `${pbench_run}` directory when working with the Tool Data Sink so that they can pass external references to the Tool Data Sink container via the APIs, allowing it to map them to internal references in the container. We have endowed the `pbench-tool-meister-start` and `-stop` commands with an additional parameter, `--redis-server=<host>:<port>`, which allows the user to specify their own Redis server. This implies that the user will also be handling the orchestration of the Tool Data Sink and Tool Meister containers. Besides the `--redis-server` parameter, we also accept an environment variable, `PBENCH_REDIS_SERVER=<host>:<port>`. The `pbench-tool-meister-start` command will now timeout waiting for the TDS to show up. We now wait a maximum time of one minute for the Tool Data Sink to show up ready for service. Prior to this change the start-up process would wait indefinitely. Since we added a new `ReturnCode` for this timeout case, we also fix up the other uses of `ReturnCode` in the Tool Group loading failure case. Prior to the creation of the Tool Meister sub-system, collection of the configuration (or system) information was performed on a best-effort basis. A benchmark script would not report "failure" if the `pbench-collect-sysinfo` command failed. This commit changes both `pbench-tool-meister-start` and `-stop` to behave as before. The code for initializing and ending the persistent tools has been moved out from underneath a conditional to be clear that its operation is not optional. In addition to the above major changes we also: * In `pbench-tool-meister-stop` we have moved the steps taken to gracefully shutdown the local Tool Data Sink, Tool Meister, and Redis server to its own method to aid readability of the code * We use the `error_log` and `warn_log` methods were appropriate to restore / add output to the `pbench.log` file * Correct how Tool Data Sink handles `EADDRINUSE` We now capture exceptions raise in the WSGI thread so that we can report the success or failure of that thread starting up. This also removes the hacky timed wait that was there. While we are at, we also remove the hacky timed wait for the log capture thread. * Only run WSGI server if successfully created * Add unit tests for new BenchmarkRunDir class, and the DataSinkWsgiServer class * Fix how OSError is forwarded from WSGI thread * Rename constants `*_port` to `def_*_port` * Move the `ToolGroup` class to its own module This allows reuse by the other commands as well, allowing us to DRY out that code a bit as well.

We also need to use PCP 5.2.2 for all image distros we build.

We provide a new `dcgm-exporter` which is Python3 based, along with updated visualizers and a container example for DCGM.

portante · 2021-03-30T20:14:23Z

Had to rebase to pick up PR #2186.

webbnh

@portante, merge this quick before someone merges something else!

portante added this to the v0.71 milestone Jan 26, 2021

portante self-assigned this Jan 26, 2021

portante added Agent Code Infrastructure enhancement packaging Issues related to software packaging tools Of and related to the operation and behavior of various tools (iostat, sar, etc.) labels Jan 26, 2021

portante changed the base branch from master to main January 28, 2021 22:08

portante mentioned this pull request Jan 30, 2021

Merge coll sysinfo metadata log #2005

Merged

portante force-pushed the containerized-tds-tm branch from 08ce18f to acc1894 Compare February 1, 2021 05:49

portante force-pushed the containerized-tds-tm branch 2 times, most recently from a8640e4 to a9672a9 Compare February 1, 2021 08:15

portante force-pushed the containerized-tds-tm branch 3 times, most recently from f7a8d4b to f4f44cd Compare February 11, 2021 04:07

portante commented Feb 12, 2021

View reviewed changes

lib/pbench/agent/tool_data_sink.py Show resolved Hide resolved

portante commented Feb 12, 2021

View reviewed changes

lib/pbench/agent/tool_data_sink.py Outdated Show resolved Hide resolved

portante force-pushed the containerized-tds-tm branch 4 times, most recently from 246e751 to edcb4c7 Compare February 12, 2021 23:32