Feat/hierarchical workflow tracing #73

serglom21 · 2025-10-01T11:15:43Z

Summary

This PR transforms the GitHub Actions workflow tracing from individual job transactions to a single workflow transaction containing nested job and step spans, providing better visibility into workflow timing and structure.

Problem that I am attempting to solve

Before: Individual job transactions created a flat structure in Sentry, making it difficult to understand workflow timing and relationships.
After: Single workflow transaction with proper hierarchical spans showing:

Workflow-level timing and status
Individual job spans as children
Step spans as children of their respective jobs

Key Changes

Core Implementation:

WorkflowTracer: New class for creating single workflow transactions with nested spans
WorkflowJobCollector: Collects jobs from workflow runs and sends workflow-level traces
Enhanced WebAppHandler: Integrates job collection and workflow tracing

Trace Structure:

workflow: Multi-Job Test (transaction)
├─ security-scan (job span)
│  ├─ Set up job (step span)
│  ├─ Checkout code (step span)
│  └─ Run security scan (step span)
├─ performance-tests (job span)
│  ├─ Set up job (step span)
│  ├─ Setup performance test environment (step span)
│  └─ Run performance tests (step span)
└─ backend-tests (job span)
   ├─ Set up job (step span)
   ├─ Setup Python (step span)
   └─ Run backend unit tests (step span)

Files Changed

src/workflow_tracer.py - New WorkflowTracer implementation
src/web_app_handler.py - Enhanced with WorkflowJobCollector
src/enhanced_web_app_handler.py - Alternative handler implementation

Before

After

… structure - Add WorkflowTracer class for creating single workflow transactions with nested job/step spans - Implement WorkflowJobCollector to collect jobs and send workflow-level traces - Fix Sentry payload structure to match expected format (remove event_id, fix timestamps, correct status mapping) - Disable individual job transactions to prevent duplicate traces - Add proper span hierarchy: workflow -> jobs -> steps - Include trace_version tags for validation - Add comprehensive logging and error handling - Disable Flask automatic Sentry tracing to prevent interference - Add documentation for local development and testing This change transforms the tracing from individual job transactions to a single workflow transaction containing all job and step spans, providing better visibility into workflow timing and structure.

- Remove PR_DESCRIPTION.md (will be added manually to PR) - Revert LOCAL_DEVELOPMENT.md changes - Restore original Sentry Flask integration in main.py (no interference with WorkflowTracer) - Restore original github_sdk.py send_trace method (GithubClient not used in current implementation)

These documentation files are not needed for the core feature implementation.

- Replace hardcoded job count (5) with dynamic thresholds - Add job arrival time tracking for intelligent processing - Implement timeout-based detection for small workflows - Support workflows of any size (1+ jobs) with appropriate timing - Add cleanup for arrival time tracking data - Improve logging for better debugging Smart thresholds: - 10+ jobs: Process immediately when all complete - 5-9 jobs: Process when all complete - 3-4 jobs: Process when all complete - 1-2 jobs: Process after 3s timeout or immediately if single job

seer-by-sentry · 2025-10-01T11:21:23Z

src/web_app_handler.py

-                # Once the Sentry org has a .sentry repo we can remove the DSN from the deployment
-                dsn = fetch_dsn_for_github_org(org, token)


Potential bug: A hard-coded job count check if len(self.workflow_jobs[run_id]) >= 5 prevents workflows with fewer than 5 jobs from ever being processed, leading to data loss.

Description: The workflow processing logic is only triggered for workflows with 5 or more jobs due to a hard-coded check: if len(self.workflow_jobs[run_id]) >= 5. Workflows with fewer jobs will never meet this condition, and their tracing data will be permanently lost. The analysis notes that the repository's own CI workflow has only 4 jobs and would be ignored by this logic. A comment, For testing, we'll wait for 5 jobs, suggests this was intended for testing but affects production functionality. There is no fallback mechanism to process these smaller workflows.

Suggested fix: Remove the hard-coded job count check. Instead, implement a more robust mechanism to determine workflow completion, such as using the _is_workflow_complete method which exists but is currently unused, or by waiting for a signal that all jobs for a given run have been received.
_{severity: 0.8, confidence: 0.95}

_{Did we get this right? 👍 / 👎 to inform future reviews.}

seer-by-sentry · 2025-10-01T11:21:23Z

src/enhanced_web_app_handler.py

+        try:
+            # Use the first job as the base for workflow metadata
+            base_job = jobs[0]
+
+            # Send workflow trace
+            self.workflow_tracer.send_workflow_trace(base_job, jobs)
+
+            # Clean up processed jobs
+            del self.workflow_jobs[run_id]


Potential bug: An unhandled exception in the _process_workflow_immediately timer callback will cause the thread to fail silently, preventing cleanup and leading to a resource leak.

Description: The _process_workflow_immediately method, executed by a threading.Timer, lacks exception handling. If an exception occurs within this method or in methods it calls like _send_workflow_trace (e.g., due to network issues or malformed data, as seen in Sentry issue 6767327085), the timer thread will terminate silently. This prevents the execution of critical cleanup logic in the finally block, such as removing the workflow's data from self.workflow_jobs and self.workflow_timers, and adding the run_id to self.processed_workflows. This results in a permanent resource leak and leaves the workflow in an inconsistent state.

Suggested fix: Wrap the contents of the _process_workflow_immediately method in a try...except block. In the except block, log the exception to ensure failures are not silent. This will allow the program to handle errors gracefully without causing resource leaks or state corruption.
_{severity: 0.7, confidence: 0.9}

_{Did we get this right? 👍 / 👎 to inform future reviews.}

- Add try-catch block to _process_workflow_immediately method - Implement _cleanup_workflow_run helper for proper resource cleanup - Ensure cleanup happens even when exceptions occur in timer threads - Add comprehensive error logging with stack traces - Prevent silent failures that could lead to resource leaks This addresses the Seer bot feedback about unhandled exceptions in timer callbacks that could cause workflow data to remain in memory indefinitely.

serglom21 added 4 commits September 29, 2025 20:33

docs: add PR description and update local development guide

6c14dc2

docs: remove LOCAL_DEVELOPMENT.md and TESTING_SUMMARY.md

b844639

These documentation files are not needed for the core feature implementation.

serglom21 requested a review from a team as a code owner October 1, 2025 11:15

seer-by-sentry bot reviewed Oct 1, 2025

View reviewed changes

bc-sentry requested a review from armenzg October 1, 2025 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/hierarchical workflow tracing #73

Feat/hierarchical workflow tracing #73

Uh oh!

serglom21 commented Oct 1, 2025

Uh oh!

seer-by-sentry bot Oct 1, 2025

Uh oh!

seer-by-sentry bot Oct 1, 2025

Uh oh!

Uh oh!

		# Once the Sentry org has a .sentry repo we can remove the DSN from the deployment
		dsn = fetch_dsn_for_github_org(org, token)

Feat/hierarchical workflow tracing #73

Are you sure you want to change the base?

Feat/hierarchical workflow tracing #73

Uh oh!

Conversation

serglom21 commented Oct 1, 2025

Summary

Problem that I am attempting to solve

Key Changes

Trace Structure:

Files Changed

Before

After

Uh oh!

seer-by-sentry bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

seer-by-sentry bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!