-
Notifications
You must be signed in to change notification settings - Fork 550
feat(cache): Add LFU caching system for models (currently applied to content safety checks) #1436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Documentation preview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hazai , looks very good overall 👍🏻
Please have a look at the review comments. We need to make sure the tests caught those edge cases.
@hazai for telemetry and logging, we can get result = await llm_call(
llm,
check_input_prompt,
stop=stop,
llm_params={"temperature": 1e-20, "max_tokens": max_tokens},
)
print("llm_stats_var:", llm_stats_var.get()) so similar to the result it should be cached, but the second time when it is read from the cache we should set it
langchain for example returns the token usage metrics as is but change the duration to the cache read duration. |
7d379ca
to
71ec05e
Compare
…d interface) add tests for lfu cache new content safety dynamic cache + integration add stats logging remove redundant test thread safety support for content-safety caching fixed failing tests update documentation to reflect thread-safety support for cache fixes following test failures on race conditions fixes following test failures remove a test update cache interface per model config without defaults
71ec05e
to
026980d
Compare
Signed-off-by: Pouyan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hazai thank you very much for the hard work to getting this feature ready. We are good to merge this PR 🚀
This PR replaces previous (closed) PR #1404
The PR contains a new nemoguardrails/cache folder with an LFU cache implementation (and interface).
The cache can be configured for any model.
It supports:
configuration
stats tracking
logging
thread-safety
(a very minimal) normalization of cache key
@Pouyanpi @tgasser-nv
This PR replaces previous (closed) PR #1404
Readme
Content Safety LLM Call Caching
Overview
The content safety checks in
actions.py
now use an LFU (Least Frequently Used) cache to improve performance by avoiding redundant LLM calls for identical safety checks.Implementation Details
Cache Configuration
created_at
andaccessed_at
for each entrymain
and non-embeddings
model type (typically content safety models)Cached Functions
content_safety_check_input()
- Caches safety checks for user inputsCache Key Components
The cache key is generated from:
Since temperature is fixed (1e-20) and stop/max_tokens are derived from the model configuration, they don't need to be part of the cache key.
How It Works
Before LLM Call:
After LLM Call:
Cache Management
The caching system automatically creates and manages separate caches for each model. Key features:
Statistics and Monitoring
The cache supports detailed statistics tracking and periodic logging for monitoring cache performance:
Statistics Features:
stats.enabled: true
with nolog_interval
to track stats without loggingstats.enabled: true
andlog_interval
for periodic loggingStatistics Tracked:
Log Format:
Usage Examples:
The cache is managed internally by the NeMo Guardrails framework. When you configure a model with caching enabled, the framework automatically:
Configuration Options:
stats.enabled
: Enable/disable statistics tracking (default: false)stats.log_interval
: Seconds between automatic stats logs (None = no logging)Notes:
nemoguardrails.cache.lfu
loggerreset_stats()
is calledExample Configuration
Example Usage
Thread Safety
The content safety caching system is thread-safe for single-node deployments:
LFUCache Implementation:
threading.RLock
for all operationsget
,put
,size
,clear
, etc.) are protected by locksget_or_compute()
operations that prevent duplicate computationsLLMRails Model Initialization:
Key Features:
get_or_compute()
ensures expensive computations happen only onceUsage in Web Servers:
Note: This implementation is designed for single-node deployments. For distributed systems, consider using external caching solutions like Redis.
Benefits
Example Usage Pattern
Logging
The implementation includes debug logging:
"Created cache for model '{model_name}' with capacity {capacity}"
"Content safety cache hit for model '{model_name}'"
"Content safety result cached for model '{model_name}'"
Enable debug logging to monitor cache behavior: