Skip to content

Conversation

hazai
Copy link
Collaborator

@hazai hazai commented Oct 6, 2025

This PR replaces previous (closed) PR #1404

The PR contains a new nemoguardrails/cache folder with an LFU cache implementation (and interface).
The cache can be configured for any model.
It supports:

configuration
stats tracking
logging
thread-safety
(a very minimal) normalization of cache key
@Pouyanpi @tgasser-nv

This PR replaces previous (closed) PR #1404

Readme

Content Safety LLM Call Caching

Overview

The content safety checks in actions.py now use an LFU (Least Frequently Used) cache to improve performance by avoiding redundant LLM calls for identical safety checks.

Implementation Details

Cache Configuration

  • Per-model caches: Each model gets its own LFU cache instance
  • Default capacity: 50,000 entries per model
  • Eviction policy: LFU with LRU tiebreaker
  • Statistics tracking: Disabled by default (configurable)
  • Tracks timestamps: created_at and accessed_at for each entry
  • Cache creation: Automatic when a model is initialized with cache enabled
  • Supported model types: Any non-main and non-embeddings model type (typically content safety models)

Cached Functions

content_safety_check_input() - Caches safety checks for user inputs

Cache Key Components

The cache key is generated from:

  • The rendered prompt (normalized for whitespace)

Since temperature is fixed (1e-20) and stop/max_tokens are derived from the model configuration, they don't need to be part of the cache key.

How It Works

  1. Before LLM Call:

    • Generate cache key from request parameters
    • Check if result exists in cache
    • If found, return cached result (cache hit)
  2. After LLM Call:

    • If not in cache, make the actual LLM call
    • Store the result in cache for future use

Cache Management

The caching system automatically creates and manages separate caches for each model. Key features:

  • Automatic Creation: Caches are created when the model is initialized with cache configuration
  • Isolated Storage: Each model maintains its own cache, preventing cross-model interference
  • Default Settings: Each cache has 50,000 entry capacity (configurable)
  • Per-Model Configuration: Cache is configured per model in the YAML configuration

Statistics and Monitoring

The cache supports detailed statistics tracking and periodic logging for monitoring cache performance:

models:
  - type: content_safety
    engine: nim
    model: nvidia/llama-3.1-nemoguard-8b-content-safety
    cache:
      enabled: true
      capacity_per_model: 10000
      store: memory  # Currently only 'memory' is supported
      stats:
        enabled: true      # Enable stats tracking
        log_interval: 60.0 # Log stats every minute

Statistics Features:

  1. Tracking Only: Set stats.enabled: true with no log_interval to track stats without logging
  2. Automatic Logging: Set both stats.enabled: true and log_interval for periodic logging

Statistics Tracked:

  • Hits: Number of cache hits (successful lookups)
  • Misses: Number of cache misses (failed lookups)
  • Hit Rate: Percentage of requests served from cache
  • Evictions: Number of items removed due to capacity
  • Puts: Number of new items added to cache
  • Updates: Number of existing items updated
  • Current Size: Number of items currently in cache

Log Format:

LFU Cache Statistics - Size: 2456/10000 | Hits: 15234 | Misses: 2456 | Hit Rate: 86.11% | Evictions: 0 | Puts: 2456 | Updates: 0

Usage Examples:

The cache is managed internally by the NeMo Guardrails framework. When you configure a model with caching enabled, the framework automatically:

  1. Creates an LFU cache instance for that model
  2. Passes the cache to content safety actions via kwargs
  3. Tracks statistics if configured
  4. Logs statistics at the specified interval

Configuration Options:

  • stats.enabled: Enable/disable statistics tracking (default: false)
  • stats.log_interval: Seconds between automatic stats logs (None = no logging)

Notes:

  • Stats logging requires stats tracking to be enabled
  • Logs appear at INFO level in the nemoguardrails.cache.lfu logger
  • Stats are reset when cache is cleared or when reset_stats() is called
  • Each model maintains independent statistics

Example Configuration

# config.yml
models:
  - type: main
    engine: openai
    model: gpt-4

  - type: content_safety
    engine: nim
    model: nvidia/llama-3.1-nemoguard-8b-content-safety
    cache:
      enabled: true
      capacity_per_model: 50000
      store: memory
      stats:
        enabled: true
        log_interval: 300.0  # Log stats every 5 minutes

rails:
  input:
    flows:
      - content safety check input model="content_safety"

Example Usage

from nemoguardrails import RailsConfig, LLMRails

# The cache is automatically configured based on your YAML config
config = RailsConfig.from_path("./config.yml")
rails = LLMRails(config)

# Content safety checks will be cached automatically
response = await rails.generate_async(
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

Thread Safety

The content safety caching system is thread-safe for single-node deployments:

  1. LFUCache Implementation:

    • Uses threading.RLock for all operations
    • All public methods (get, put, size, clear, etc.) are protected by locks
    • Supports atomic get_or_compute() operations that prevent duplicate computations
  2. LLMRails Model Initialization:

    • Thread-safe cache creation during model initialization
    • Ensures only one cache instance per model across all threads
  3. Key Features:

    • No Data Corruption: Concurrent operations maintain data integrity
    • No Race Conditions: Proper locking prevents race conditions
    • Atomic Operations: get_or_compute() ensures expensive computations happen only once
    • Minimal Lock Contention: Efficient locking patterns minimize performance impact
  4. Usage in Web Servers:

    • Safe for use in multi-threaded web servers (FastAPI, Flask, etc.)
    • Handles concurrent requests without issues
    • Each thread sees consistent cache state

Note: This implementation is designed for single-node deployments. For distributed systems, consider using external caching solutions like Redis.

Benefits

  1. Performance: Eliminates redundant LLM calls for identical inputs
  2. Cost Savings: Reduces API calls to LLM services
  3. Consistency: Ensures identical inputs always produce identical outputs
  4. Smart Eviction: LFU policy keeps frequently checked content in cache
  5. Model Isolation: Each model has its own cache, preventing interference between different safety models
  6. Statistics Tracking: Monitor cache performance with hit rates, evictions, and more per model
  7. Timestamp Tracking: Track when entries were created and last accessed
  8. Efficiency: LFU eviction algorithm ensures the most useful entries remain in cache
  9. Thread Safety: Safe for concurrent access in multi-threaded environments

Example Usage Pattern

# First call - takes ~500ms (LLM API call)
result = await content_safety_check_input(
    llms=llms,
    llm_task_manager=task_manager,
    model_name="content_safety",
    context={"user_message": "Hello world"}
)

# Subsequent identical calls - takes ~1ms (cache hit)
result = await content_safety_check_input(
    llms=llms,
    llm_task_manager=task_manager,
    model_name="content_safety",
    context={"user_message": "Hello world"}
)

Logging

The implementation includes debug logging:

  • Cache creation: "Created cache for model '{model_name}' with capacity {capacity}"
  • Cache hits: "Content safety cache hit for model '{model_name}'"
  • Cache stores: "Content safety result cached for model '{model_name}'"

Enable debug logging to monitor cache behavior:

import logging
logging.getLogger("nemoguardrails.library.content_safety.actions").setLevel(logging.DEBUG)

@hazai hazai self-assigned this Oct 6, 2025
@codecov-commenter
Copy link

codecov-commenter commented Oct 6, 2025

Codecov Report

❌ Patch coverage is 91.30435% with 34 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nemoguardrails/llm/cache/lfu.py 87.61% 28 Missing ⚠️
nemoguardrails/llm/cache/interface.py 86.04% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@hazai hazai changed the title Feature/dynamic caching cs mod 2 dynamic caching for models Oct 6, 2025
Copy link
Contributor

github-actions bot commented Oct 6, 2025

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1436

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Pouyanpi Pouyanpi requested a review from Copilot October 10, 2025 05:14
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@Pouyanpi Pouyanpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hazai , looks very good overall 👍🏻

Please have a look at the review comments. We need to make sure the tests caught those edge cases.

@Pouyanpi
Copy link
Collaborator

@hazai for telemetry and logging, we can get llm_stats_var after llm_call:

    result = await llm_call(
        llm,
        check_input_prompt,
        stop=stop,
        llm_params={"temperature": 1e-20, "max_tokens": max_tokens},
    )
    print("llm_stats_var:", llm_stats_var.get())

so similar to the result it should be cached, but the second time when it is read from the cache we should set it

llm_stat_var.set(value_from_cache)

langchain for example returns the token usage metrics as is but change the duration to the cache read duration.

@Pouyanpi Pouyanpi force-pushed the feature/dynamic-caching-cs-mod-2 branch from 7d379ca to 71ec05e Compare October 16, 2025 11:52
hazai added 4 commits October 16, 2025 15:02
…d interface)

add tests for lfu cache

new content safety dynamic cache + integration

add stats logging

remove redundant test

thread safety support for content-safety caching

fixed failing tests

update documentation to reflect thread-safety support for cache

fixes following test failures on race conditions

fixes following test failures

remove a test

update cache interface

per model config without defaults
@hazai hazai force-pushed the feature/dynamic-caching-cs-mod-2 branch from 71ec05e to 026980d Compare October 16, 2025 12:05
Copy link
Collaborator

@Pouyanpi Pouyanpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hazai thank you very much for the hard work to getting this feature ready. We are good to merge this PR 🚀

@Pouyanpi Pouyanpi changed the title dynamic caching for models feat(cache): Add LFU caching system for models (currently applied to content safety checks) Oct 17, 2025
@Pouyanpi Pouyanpi merged commit a588d4c into develop Oct 17, 2025
7 checks passed
@Pouyanpi Pouyanpi deleted the feature/dynamic-caching-cs-mod-2 branch October 17, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants