Skip to content

Conversation

serglom21
Copy link

Summary

This PR transforms the GitHub Actions workflow tracing from individual job transactions to a single workflow transaction containing nested job and step spans, providing better visibility into workflow timing and structure.

Problem that I am attempting to solve

Before: Individual job transactions created a flat structure in Sentry, making it difficult to understand workflow timing and relationships.
After: Single workflow transaction with proper hierarchical spans showing:

  • Workflow-level timing and status
  • Individual job spans as children
  • Step spans as children of their respective jobs

Key Changes

Core Implementation:

  • WorkflowTracer: New class for creating single workflow transactions with nested spans
  • WorkflowJobCollector: Collects jobs from workflow runs and sends workflow-level traces
  • Enhanced WebAppHandler: Integrates job collection and workflow tracing

Trace Structure:

workflow: Multi-Job Test (transaction)
├─ security-scan (job span)
│  ├─ Set up job (step span)
│  ├─ Checkout code (step span)
│  └─ Run security scan (step span)
├─ performance-tests (job span)
│  ├─ Set up job (step span)
│  ├─ Setup performance test environment (step span)
│  └─ Run performance tests (step span)
└─ backend-tests (job span)
   ├─ Set up job (step span)
   ├─ Setup Python (step span)
   └─ Run backend unit tests (step span)

Files Changed

  • src/workflow_tracer.py - New WorkflowTracer implementation
  • src/web_app_handler.py - Enhanced with WorkflowJobCollector
  • src/enhanced_web_app_handler.py - Alternative handler implementation

Before

Screenshot 2025-10-01 at 7 13 42 AM

After

Screenshot 2025-10-01 at 7 14 45 AM

… structure

- Add WorkflowTracer class for creating single workflow transactions with nested job/step spans
- Implement WorkflowJobCollector to collect jobs and send workflow-level traces
- Fix Sentry payload structure to match expected format (remove event_id, fix timestamps, correct status mapping)
- Disable individual job transactions to prevent duplicate traces
- Add proper span hierarchy: workflow -> jobs -> steps
- Include trace_version tags for validation
- Add comprehensive logging and error handling
- Disable Flask automatic Sentry tracing to prevent interference
- Add documentation for local development and testing

This change transforms the tracing from individual job transactions to a single
workflow transaction containing all job and step spans, providing better
visibility into workflow timing and structure.
- Remove PR_DESCRIPTION.md (will be added manually to PR)
- Revert LOCAL_DEVELOPMENT.md changes
- Restore original Sentry Flask integration in main.py (no interference with WorkflowTracer)
- Restore original github_sdk.py send_trace method (GithubClient not used in current implementation)
These documentation files are not needed for the core feature implementation.
@serglom21 serglom21 requested a review from a team as a code owner October 1, 2025 11:15
- Replace hardcoded job count (5) with dynamic thresholds
- Add job arrival time tracking for intelligent processing
- Implement timeout-based detection for small workflows
- Support workflows of any size (1+ jobs) with appropriate timing
- Add cleanup for arrival time tracking data
- Improve logging for better debugging

Smart thresholds:
- 10+ jobs: Process immediately when all complete
- 5-9 jobs: Process when all complete
- 3-4 jobs: Process when all complete
- 1-2 jobs: Process after 3s timeout or immediately if single job
Comment on lines -54 to -55
# Once the Sentry org has a .sentry repo we can remove the DSN from the deployment
dsn = fetch_dsn_for_github_org(org, token)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: A hard-coded job count check if len(self.workflow_jobs[run_id]) >= 5 prevents workflows with fewer than 5 jobs from ever being processed, leading to data loss.
  • Description: The workflow processing logic is only triggered for workflows with 5 or more jobs due to a hard-coded check: if len(self.workflow_jobs[run_id]) >= 5. Workflows with fewer jobs will never meet this condition, and their tracing data will be permanently lost. The analysis notes that the repository's own CI workflow has only 4 jobs and would be ignored by this logic. A comment, For testing, we'll wait for 5 jobs, suggests this was intended for testing but affects production functionality. There is no fallback mechanism to process these smaller workflows.

  • Suggested fix: Remove the hard-coded job count check. Instead, implement a more robust mechanism to determine workflow completion, such as using the _is_workflow_complete method which exists but is currently unused, or by waiting for a signal that all jobs for a given run have been received.
    severity: 0.8, confidence: 0.95

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines +77 to +85
try:
# Use the first job as the base for workflow metadata
base_job = jobs[0]

# Send workflow trace
self.workflow_tracer.send_workflow_trace(base_job, jobs)

# Clean up processed jobs
del self.workflow_jobs[run_id]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: An unhandled exception in the _process_workflow_immediately timer callback will cause the thread to fail silently, preventing cleanup and leading to a resource leak.
  • Description: The _process_workflow_immediately method, executed by a threading.Timer, lacks exception handling. If an exception occurs within this method or in methods it calls like _send_workflow_trace (e.g., due to network issues or malformed data, as seen in Sentry issue 6767327085), the timer thread will terminate silently. This prevents the execution of critical cleanup logic in the finally block, such as removing the workflow's data from self.workflow_jobs and self.workflow_timers, and adding the run_id to self.processed_workflows. This results in a permanent resource leak and leaves the workflow in an inconsistent state.

  • Suggested fix: Wrap the contents of the _process_workflow_immediately method in a try...except block. In the except block, log the exception to ensure failures are not silent. This will allow the program to handle errors gracefully without causing resource leaks or state corruption.
    severity: 0.7, confidence: 0.9

Did we get this right? 👍 / 👎 to inform future reviews.

- Add try-catch block to _process_workflow_immediately method
- Implement _cleanup_workflow_run helper for proper resource cleanup
- Ensure cleanup happens even when exceptions occur in timer threads
- Add comprehensive error logging with stack traces
- Prevent silent failures that could lead to resource leaks

This addresses the Seer bot feedback about unhandled exceptions in timer
callbacks that could cause workflow data to remain in memory indefinitely.
@bc-sentry bc-sentry requested a review from armenzg October 1, 2025 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant