Skip to content

Conversation

@nbojanic
Copy link

🔍 Description

This PR adds a diagnose subcommand to Trident (#285).

The command creates a tarball containing:

  • host status
  • whether trident is in a container
  • whether trident is in a VM
  • platform info (os release, kernel version, cpus, memory)
  • disk information (lsblk)
  • log files
  • metrics
  • datastores

🤔 Rationale

This PR aims to simplify troubleshooting by providing one command that aggregates relevant debugging info into a single tarball.

📝 Checks

📌 Follow-ups

TODO (before promoting from draft):

  • check on warnings in osutils crate
  • add an end to end test for trident diagnose

🗒️ Notes

@nbojanic
Copy link
Author

@microsoft-github-policy-service agree company="Microsoft"

@frhuelsz
Copy link
Contributor

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nbojanic
Copy link
Author

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 336 in repo microsoft/trident

@bfjelds
Copy link
Member

bfjelds commented Nov 17, 2025

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member

bfjelds commented Nov 17, 2025

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member

bfjelds commented Nov 18, 2025

@bfjelds
Copy link
Member

bfjelds commented Nov 18, 2025

i think it'd be great to invoke trident diagnose in the e2e ab-update tests here:

logrus.Infof("Trident service is in expected state")
.

something like:

out, err := utils.InvokeTrident(h.args.Env, client, h.args.EnvVars, "diagnose")
if err != nil {
    return fmt.Errorf("failed to invoke Trident diagnose: %w", err)
}

logrus.Infof("Trident service is in expected state")
return nil, nil

then we'd get at least a little testing for it. it'd be even better to scp the diagnostics to the artifacts folder or even validate the contents of the diagnostics, but i'd be happy with something simple first :)

@bfjelds
Copy link
Member

bfjelds commented Nov 18, 2025

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Britel
Copy link
Collaborator

Britel commented Nov 18, 2025

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member

bfjelds commented Nov 19, 2025

would be nice to have a docs/Explanation/Diagnostics.md file, maybe it references https://microsoft.github.io/trident/docs/How-To-Guides/View-Trident's-Background-Log to explain some of the collected diagnostics?

maybe suggest using trident diagnostics for troubleshooting in these docs:

also add note about diagnostics here: https://microsoft.github.io/trident/docs/Trident/How-Do-I-Interact-With-Trident

@nbojanic nbojanic changed the title feature: Add diagnostics command feature: Add diagnose command Nov 19, 2025
@nbojanic nbojanic force-pushed the user/nbojanic/diagnostics branch from c52d608 to 163a337 Compare November 20, 2025 03:05
@nbojanic
Copy link
Author

@bfjelds added an e2e test and docs. Promoting from draft PR.

@nbojanic nbojanic marked this pull request as ready for review November 20, 2025 03:08
@nbojanic nbojanic requested a review from a team as a code owner November 20, 2025 03:08
Copilot AI review requested due to automatic review settings November 20, 2025 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a diagnose command to Trident that generates comprehensive diagnostic bundles for troubleshooting. The command collects system information, logs, metrics, and datastores into a compressed tarball (.tar.zst) that can be shared for support purposes.

Key Changes

  • New trident diagnose CLI command that creates compressed diagnostic bundles containing system info, logs, metrics, and datastore files
  • End-to-end test in Storm framework that validates the diagnostic bundle contents
  • Comprehensive documentation including a new How-To guide and updates to tutorials

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/trident/src/diagnostics.rs Core diagnostics implementation including report collection, bundle creation, and support for historical logs
crates/trident/src/cli.rs Added Diagnose command to CLI with output path parameter
crates/trident/src/main.rs Integrated diagnose command into main execution flow
crates/trident/src/lib.rs Added public diagnose method and made diagnostics module and TEMPORARY_DATASTORE_PATH public
crates/trident/src/logging/tracestream.rs Made PLATFORM_INFO publicly accessible for diagnostics collection
crates/trident_api/src/error.rs Added DiagnosticBundleGeneration error variant
tools/storm/utils/ssh/client/client.go Added CopyRemoteFileToLocal function for downloading diagnostic bundles in tests
tools/storm/helpers/ab_update.go Added checkDiagnostics test case that validates bundle contents and structure
docs/How-To-Guides/Diagnostics.md Comprehensive guide explaining bundle generation, structure, and use cases
docs/Tutorials/Trident-Hello-World.md Added troubleshooting section with diagnose command example
docs/Tutorials/Performing-an-AB-Update.md Added troubleshooting section with diagnose command example
docs/Reference/Trident-CLI.md Added complete CLI reference for diagnose command
docs/Trident/How-Do-I-Interact-With-Trident.md Added diagnose command to the list of available CLI commands

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 43 to 56
```
trident-diagnostics/
├── report.json # Diagnostic report with metadata
└── logs/
├── trident-full.log # Current Trident execution log
├── trident-metrics.jsonl # Current Trident metrics
├── historical/ # Logs from past servicing
│ ├── trident-<servicing_state>-<timestamp>.log
│ ├── trident-metrics-<servicing_state>-<timestamp>.log
│ └── ...
├── datastore.sqlite # Default datastore
├── datastore-tmp.sqlite # Temporary datastore (if applicable)
└── datastore-configured.sqlite # Configured datastore (if applicable)
```
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundle structure documentation shows datastore files as being inside the logs/ directory, but the implementation (in diagnostics.rs lines 276-292) places them at the root level of the bundle. The structure should be:

trident-diagnostics/
├── report.json
├── datastore.sqlite
├── datastore-tmp.sqlite
├── datastore-configured.sqlite
└── logs/
    ├── trident-full.log
    ├── trident-metrics.jsonl
    └── historical/
        └── ...

Copilot uses AI. Check for mistakes.
) -> Result<PathBuf, Error> {
let mut collected_files = Vec::new();
let file = osutils::files::create_file(output_path)?;
let encoder = zstd::Encoder::new(file, 0)?;
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using compression level 0 for zstd means using the default compression level (which is typically level 3). Consider adding a comment to clarify this is intentional, or explicitly setting a specific level (e.g., 3 for default) for better code clarity. Alternatively, consider using a higher level like 10 for better compression at the cost of slightly slower compression time, since diagnostics bundles are typically created infrequently and smaller file sizes are beneficial for transmission.

Suggested change
let encoder = zstd::Encoder::new(file, 0)?;
// Use compression level 10 for zstd: higher compression for diagnostics bundles, which are created infrequently.
let encoder = zstd::Encoder::new(file, 10)?;

Copilot uses AI. Check for mistakes.
@alejandro-microsoft
Copy link
Contributor

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Run the diagnose command to create a support bundle:

```bash
sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any requirements for the --output path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a path trident diagnose can write to (selinux may prevent it from writing to certain locations).

ssh -i $HOME/.ssh/id_rsa tutorial-user@$TARGET_MACHINE_IP sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst
```

See [Generate a Diagnostics Bundle](../How-To-Guides/Diagnostics.md) for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to actually add this note to the end of all tutorials and how-to guides? So that the customer knows how to request help immediately when something in the tutorial or guide doesn't work for them

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've also added it to Trident Hello World tutorial, which is the other one I think makes sense. It's also in How Do I Interact With Trident, and CLI reference, as well as having a how-to guide. Maybe in the future we add a general troubleshooting guide?


2. Review the contents to ensure no sensitive data is included

3. Attach the bundle to your bug report or support request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, we should probably detail here on how exactly to share this diagnostics bundle with us. Do we have any clear guidelines right now? @frhuelsz

Copilot AI review requested due to automatic review settings January 2, 2026 18:34
@nbojanic nbojanic force-pushed the user/nbojanic/diagnostics branch from 905413f to 77134bd Compare January 2, 2026 18:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants