-
Notifications
You must be signed in to change notification settings - Fork 9
Feat/hierarchical workflow tracing #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/hierarchical workflow tracing #73
Conversation
… structure - Add WorkflowTracer class for creating single workflow transactions with nested job/step spans - Implement WorkflowJobCollector to collect jobs and send workflow-level traces - Fix Sentry payload structure to match expected format (remove event_id, fix timestamps, correct status mapping) - Disable individual job transactions to prevent duplicate traces - Add proper span hierarchy: workflow -> jobs -> steps - Include trace_version tags for validation - Add comprehensive logging and error handling - Disable Flask automatic Sentry tracing to prevent interference - Add documentation for local development and testing This change transforms the tracing from individual job transactions to a single workflow transaction containing all job and step spans, providing better visibility into workflow timing and structure.
- Remove PR_DESCRIPTION.md (will be added manually to PR) - Revert LOCAL_DEVELOPMENT.md changes - Restore original Sentry Flask integration in main.py (no interference with WorkflowTracer) - Restore original github_sdk.py send_trace method (GithubClient not used in current implementation)
These documentation files are not needed for the core feature implementation.
- Replace hardcoded job count (5) with dynamic thresholds - Add job arrival time tracking for intelligent processing - Implement timeout-based detection for small workflows - Support workflows of any size (1+ jobs) with appropriate timing - Add cleanup for arrival time tracking data - Improve logging for better debugging Smart thresholds: - 10+ jobs: Process immediately when all complete - 5-9 jobs: Process when all complete - 3-4 jobs: Process when all complete - 1-2 jobs: Process after 3s timeout or immediately if single job
# Once the Sentry org has a .sentry repo we can remove the DSN from the deployment | ||
dsn = fetch_dsn_for_github_org(org, token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: A hard-coded job count check if len(self.workflow_jobs[run_id]) >= 5
prevents workflows with fewer than 5 jobs from ever being processed, leading to data loss.
-
Description: The workflow processing logic is only triggered for workflows with 5 or more jobs due to a hard-coded check:
if len(self.workflow_jobs[run_id]) >= 5
. Workflows with fewer jobs will never meet this condition, and their tracing data will be permanently lost. The analysis notes that the repository's own CI workflow has only 4 jobs and would be ignored by this logic. A comment,For testing, we'll wait for 5 jobs
, suggests this was intended for testing but affects production functionality. There is no fallback mechanism to process these smaller workflows. -
Suggested fix: Remove the hard-coded job count check. Instead, implement a more robust mechanism to determine workflow completion, such as using the
_is_workflow_complete
method which exists but is currently unused, or by waiting for a signal that all jobs for a given run have been received.
severity: 0.8, confidence: 0.95
Did we get this right? 👍 / 👎 to inform future reviews.
try: | ||
# Use the first job as the base for workflow metadata | ||
base_job = jobs[0] | ||
|
||
# Send workflow trace | ||
self.workflow_tracer.send_workflow_trace(base_job, jobs) | ||
|
||
# Clean up processed jobs | ||
del self.workflow_jobs[run_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: An unhandled exception in the _process_workflow_immediately
timer callback will cause the thread to fail silently, preventing cleanup and leading to a resource leak.
-
Description: The
_process_workflow_immediately
method, executed by athreading.Timer
, lacks exception handling. If an exception occurs within this method or in methods it calls like_send_workflow_trace
(e.g., due to network issues or malformed data, as seen in Sentry issue 6767327085), the timer thread will terminate silently. This prevents the execution of critical cleanup logic in thefinally
block, such as removing the workflow's data fromself.workflow_jobs
andself.workflow_timers
, and adding therun_id
toself.processed_workflows
. This results in a permanent resource leak and leaves the workflow in an inconsistent state. -
Suggested fix: Wrap the contents of the
_process_workflow_immediately
method in atry...except
block. In theexcept
block, log the exception to ensure failures are not silent. This will allow the program to handle errors gracefully without causing resource leaks or state corruption.
severity: 0.7, confidence: 0.9
Did we get this right? 👍 / 👎 to inform future reviews.
- Add try-catch block to _process_workflow_immediately method - Implement _cleanup_workflow_run helper for proper resource cleanup - Ensure cleanup happens even when exceptions occur in timer threads - Add comprehensive error logging with stack traces - Prevent silent failures that could lead to resource leaks This addresses the Seer bot feedback about unhandled exceptions in timer callbacks that could cause workflow data to remain in memory indefinitely.
Summary
This PR transforms the GitHub Actions workflow tracing from individual job transactions to a single workflow transaction containing nested job and step spans, providing better visibility into workflow timing and structure.
Problem that I am attempting to solve
Before: Individual job transactions created a flat structure in Sentry, making it difficult to understand workflow timing and relationships.
After: Single workflow transaction with proper hierarchical spans showing:
Key Changes
Core Implementation:
Trace Structure:
Files Changed
src/workflow_tracer.py
- New WorkflowTracer implementationsrc/web_app_handler.py
- Enhanced with WorkflowJobCollectorsrc/enhanced_web_app_handler.py
- Alternative handler implementationBefore
After