Skip to content

[FEATURE] Allow hooks to retry model invocations #5

@zastrowm

Description

@zastrowm

Problem Statement

In sdk-typescript, we allow model retries - that was implemented as part of strands-agents/sdk-typescript#222

Folks would like to retry on arbitrary exceptions (see strands-agents#370) and I think we should let them.

Proposed Solution

Allow AfterModelCall hook callbacks to set a field to retry model invocation. For now it should only be allowed if an exception is thrown.

This does not replace our existing retry-strategy, but makes it more flexible

Use Case

Retrying model calls on exceptions

Implementation Requirements

Based on repository analysis and clarification discussion, here are the detailed requirements:

Technical Approach

Hook System Enhancement:

  • Add writable retry_model field to AfterModelCallEvent (boolean, default False)
  • Field should only be checked when exception attribute is present (not on successful calls)
  • Implement _can_write() method to allow modification of retry_model field
  • Multiple hooks can set/unset the field; final value is read after all callbacks execute

Retry Logic:

  • Hooks determine their own retry parameters (count, delay, conditions)
  • Hook-initiated retries are independent from existing throttle retry logic
  • Existing throttle retry can be conceptually viewed as a built-in retry mechanism
  • No validation on exception types - hooks decide what to retry
  • No maximum retry limit enforced by framework (hooks manage their own limits)
  • Hooks should implement their own delay logic (no delay parameter on event)

Integration with Existing Code:

  • Modify _handle_model_execution() in src/strands/event_loop/event_loop.py
  • Check retry_model field after invoking AfterModelCallEvent callbacks
  • If retry_model=True and exception exists, continue retry loop
  • Existing throttle retry logic should remain unchanged

Files to Modify

  • src/strands/hooks/events.py - Add retry_model field and _can_write() to AfterModelCallEvent
  • src/strands/event_loop/event_loop.py - Integrate hook-initiated retries into _handle_model_execution()
  • tests/strands/agent/hooks/test_agent_events.py - Add unit tests for retry functionality

Acceptance Criteria

  • AfterModelCallEvent has writable retry_model: bool = False field
  • retry_model is only checked when exception is present (not on successful calls)
  • Hook callbacks can set retry_model=True to retry the model call
  • Multiple hooks can modify the field; final value after all callbacks is used
  • Retry logic integrates seamlessly with existing throttle retry mechanism
  • Unit tests demonstrate retry behavior with mock exceptions (e.g., ServiceUnavailableException)
  • Unit tests verify retry only happens when exception exists
  • Unit tests verify hooks control their own retry logic (count, delay, conditions)
  • Existing throttle retry tests continue to pass
  • No integration tests required for this feature

Scope

  • In Scope: Regular model calls via Agent.__call__() and Agent.stream_async()
  • Out of Scope: structured_output invocations (per existing AfterModelCallEvent behavior)

Example Usage

A hook provider might implement retry logic like this:

class RetryOnServiceUnavailable(HookProvider):
    def __init__(self, max_retries=3):
        self.max_retries = max_retries
        self.retry_counts = {}
    
    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(AfterModelCallEvent, self.handle_retry)
    
    async def handle_retry(self, event: AfterModelCallEvent) -> None:
        if event.exception and "ServiceUnavailable" in str(event.exception):
            request_id = id(event)  # Use some request identifier
            count = self.retry_counts.get(request_id, 0)
            
            if count < self.max_retries:
                self.retry_counts[request_id] = count + 1
                await asyncio.sleep(2 ** count)  # Exponential backoff
                event.retry_model = True
            else:
                # Max retries reached, let exception propagate
                self.retry_counts.pop(request_id, None)

Related Issues

Additional Context

The existing retry mechanism only handles ModelThrottledException with exponential backoff (MAX_ATTEMPTS=6, INITIAL_DELAY=4s, MAX_DELAY=240s). This feature enables users to implement custom retry logic for any exception type via hooks, providing the flexibility requested in issue strands-agents#370 without hardcoding specific exception types into the framework.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions