Skip to content

Conversation

@portante
Copy link
Member

@portante portante commented Jan 26, 2021

Add support for running the Tool Meister(s) and the Tool Data Sink as containerized services.

We provide separate images for Tool Meister and Tool Data Sink containers. Each container has a entry point command at tool-meister/tool-*-ep, which reads environment variables provided by the container specification and passes them as parameters to the respective Tool Meister and Tool Data Sink commands.

We have modified the Tool Meister and Tool Data Sink code to continuously try to connect to the specified Redis server to find their parameter key.

We have changed the Tool Data Sink to only daemonize itself when requested.

The Tool Data Sink was changed to recognize the difference between the internal (to the container) and external namespaces. The driver must use a .path file in the ${pbench_run} directory when working with the Tool Data Sink so that they can pass external references to the Tool Data Sink container via the APIs, allowing it to map them to internal references in the container.

We have endowed the pbench-tool-meister-start and -stop commands with an additional parameter, --redis-server=<host>:<port>, which allows the user to specify their own Redis server. This implies that the user will also be handling the orchestration of the Tool Data Sink and Tool Meister containers. Besides the --redis-server parameter, we also accept an environment variable, PBENCH_REDIS_SERVER=<host>:<port>.

The pbench-tool-meister-start command will now timeout waiting for the TDS to show up. We now wait a maximum time of one minute for the Tool Data Sink to show up ready for service. Prior to this change the start-up process would wait indefinitely. Since we added a new ReturnCode for this timeout case, we also fix up the other uses of ReturnCode in the Tool Group loading failure case.

Prior to the creation of the Tool Meister sub-system, collection of the configuration (or system) information was performed on a best-effort basis. A benchmark script would not report "failure" if the pbench-collect-sysinfo command failed. This commit changes both pbench-tool-meister-start and -stop to behave as before. The code for initializing and ending the persistent tools has been moved out from underneath a conditional to be clear that its operation is not optional.

In addition to the above major changes we also:

  • In pbench-tool-meister-stop we have moved the steps taken to gracefully shutdown the local Tool Data Sink, Tool Meister, and Redis server to its own method to aid readability of the code

  • We use the error_log and warn_log methods were appropriate to restore / add output to the pbench.log file

  • Correct how Tool Data Sink handles EADDRINUSE

    We now capture exceptions raise in the WSGI thread so that we can report the success or failure of that thread starting up. This also removes the hacky timed wait that was there.

    While we are at, we also remove the hacky timed wait for the log capture thread.

  • Only run WSGI server if successfully created

  • Add unit tests for new BenchmarkRunDir class, and the DataSinkWsgiServer class

  • Fix how OSError is forwarded from WSGI thread

  • Rename constants *_port to def_*_port

  • Move the ToolGroup class to its own module

    This allows reuse by the other commands as well, allowing us to DRY out that code a bit as well.


There are four (5) commits to this PR that we'll maintain when we merge:

  1. Correction to the agent image building Makefile to work as advertised in the README.md
  2. The combined (from 20+ separate commits) set of changes as described above
  3. Use of PCP "bintray" repos when building the container images
  4. Properly exposing PCP ports and corrections to the firewalld service files
  5. Incorporation of @Maxusmusti work from PR dcgm-exporter update portante/pbench#15

@portante portante added this to the v0.71 milestone Jan 26, 2021
@portante portante self-assigned this Jan 26, 2021
@portante portante added Agent Code Infrastructure enhancement packaging Issues related to software packaging tools Of and related to the operation and behavior of various tools (iostat, sar, etc.) labels Jan 26, 2021
@portante portante changed the base branch from master to main January 28, 2021 22:08
@portante portante force-pushed the containerized-tds-tm branch from 08ce18f to acc1894 Compare February 1, 2021 05:49
@lgtm-com
Copy link

lgtm-com bot commented Feb 1, 2021

This pull request introduces 13 alerts when merging acc1894 into 68a055b - view on LGTM.com

new alerts:

  • 9 for Unused local variable
  • 2 for Unused import
  • 2 for Variable defined multiple times

@portante portante force-pushed the containerized-tds-tm branch 2 times, most recently from a8640e4 to a9672a9 Compare February 1, 2021 08:15
@portante
Copy link
Member Author

portante commented Feb 1, 2021

RPMs for these latest changes at https://copr.fedorainfracloud.org/coprs/portante/pbench-test/, v0.71.0-130ga9672a901.

@portante portante force-pushed the containerized-tds-tm branch 3 times, most recently from f7a8d4b to f4f44cd Compare February 11, 2021 04:07
@portante portante force-pushed the containerized-tds-tm branch 4 times, most recently from 246e751 to edcb4c7 Compare February 12, 2021 23:32
@portante
Copy link
Member Author

Rebased and resolved all code review comments. Unfortunately, I hit the bug solved by PR #2185, which delayed the verification of the tests (py3-agent and py3-server) tox tests (any pytest based test) would fail.

Let's get PR #2185 to land first, and then this one.

@portante portante marked this pull request as draft March 29, 2021 23:56
@portante portante force-pushed the containerized-tds-tm branch from b78b571 to 78e85ee Compare March 30, 2021 18:24
@portante portante marked this pull request as ready for review March 30, 2021 18:24
@portante
Copy link
Member Author

Okay, now that we landed #2185, this is ready again for final review.

dbutenhof
dbutenhof previously approved these changes Mar 30, 2021
The `README.md` file states that there targets for each of the supported
distributions, but they were missing.  This commit corrects that over-
sight.

Further, we modify the pbench "repo" file targets to be less abstract
and more tied to the goal of generating pbench repo files.
Add support for running the Tool Meister(s) and the Tool Data Sink as
containerized services.

We provide separate images for Tool Meister and Tool Data Sink
containers. Each container has a entry point command at
`tool-meister/tool-*-ep`, which reads environment variables provided by
the container specification and passes them as parameters to the
respective Tool Meister and Tool Data Sink commands.

We have modified the Tool Meister and Tool Data Sink code to
continuously try to connect to the specified Redis server to find their
parameter key.

We have changed the Tool Data Sink to only daemonize itself when
requested.

The Tool Data Sink was changed to recognize the difference between the
internal (to the container) and external namespaces.  The driver must
use a `.path` file in the `${pbench_run}` directory when working with
the Tool Data Sink so that they can pass external references to the Tool
Data Sink container via the APIs, allowing it to map them to internal
references in the container.

We have endowed the `pbench-tool-meister-start` and `-stop` commands
with an additional parameter, `--redis-server=<host>:<port>`, which
allows the user to specify their own Redis server. This implies that
the user will also be handling the orchestration of the Tool Data Sink
and Tool Meister containers. Besides the `--redis-server` parameter, we
also accept an environment variable,
`PBENCH_REDIS_SERVER=<host>:<port>`.

The `pbench-tool-meister-start` command will now timeout waiting for
the TDS to show up. We now wait a maximum time of one minute for the
Tool Data Sink to show up ready for service.  Prior to this change the
start-up process would wait indefinitely.  Since we added a new
`ReturnCode` for this timeout case, we also fix up the other uses of
`ReturnCode` in the Tool Group loading failure case.

Prior to the creation of the Tool Meister sub-system, collection of the
configuration (or system) information was performed on a best-effort
basis.  A benchmark script would not report "failure" if the
`pbench-collect-sysinfo` command failed.  This commit changes both
`pbench-tool-meister-start` and `-stop` to behave as before.  The code
for initializing and ending the persistent tools has been moved out from
underneath a conditional to be clear that its operation is not optional.

In addition to the above major changes we also:

 * In `pbench-tool-meister-stop` we have moved the steps taken to
   gracefully shutdown the local Tool Data Sink, Tool Meister, and Redis
   server to its own method to aid readability of the code

 * We use the `error_log` and `warn_log` methods were appropriate to
   restore / add output to the `pbench.log` file

 * Correct how Tool Data Sink handles `EADDRINUSE`

   We now capture exceptions raise in the WSGI thread so that we can
   report the success or failure of that thread starting up.  This also
   removes the hacky timed wait that was there.

   While we are at, we also remove the hacky timed wait for the log
   capture thread.

 * Only run WSGI server if successfully created

 * Add unit tests for new BenchmarkRunDir class, and the
   DataSinkWsgiServer class

 * Fix how OSError is forwarded from WSGI thread

 * Rename constants `*_port` to `def_*_port`

 * Move the `ToolGroup` class to its own module

   This allows reuse by the other commands as well, allowing us to DRY
   out that code a bit as well.
We also need to use PCP 5.2.2 for all image distros we build.
We provide a new `dcgm-exporter` which is Python3 based, along with
updated visualizers and a container example for DCGM.
@portante
Copy link
Member Author

Had to rebase to pick up PR #2186.

Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@portante, merge this quick before someone merges something else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Agent Code Infrastructure enhancement packaging Issues related to software packaging tools Of and related to the operation and behavior of various tools (iostat, sar, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update DCGM Persistent Tool to Utilize dcgm-exporter

5 participants