The intelligent model capability engine. Production-ready Python library with dynamic model discovery, capability-based selection, real-time streaming, and Pydantic-native architecture.
from chuk_llm import quick_question
print(quick_question("What is 2+2?")) # "2 + 2 equals 4."Revolutionary Registry System:
- π§ Dynamic Model Discovery - No more hardcoded model lists, automatic capability detection
- π― Intelligent Selection - Find models by capabilities, cost, and quality tier
- π Smart Queries -
find_best(requires_tools=True, quality_tier="cheap") - ποΈ Pydantic V2 Native - Type-safe models throughout, no dictionary goop
- β‘ Async-First Architecture - True async/await with sync wrappers for convenience
- π Layered Capability Resolution - Heuristics β YAML cache β Provider APIs
- π Zero-Config - Pull a new Ollama model, use it immediately
Latest Models (December 2025):
- π€ Gemini 2.5/3 Pro - 1M token context, adaptive thinking, multimodal (
gemini-2.5-flash,gemini-3-pro-preview) - π Mistral Large 3 - 675B MoE, 41B active, Apache 2.0 (
mistral-large-2512,ministral-8b-2512,ministral-14b-2512) - π‘ DeepSeek V3.2 - 671B MoE, ultra-efficient at $0.27/M tokens (
deepseek-chat,deepseek-reasoner)
Performance:
- β‘ 52x faster imports - Lazy loading reduces import time from 735ms to 14ms
- π 112x faster client creation - Automatic thread-safe caching
- π <0.015% overhead - Negligible library overhead vs API latency
See REGISTRY_COMPLETE.md for architecture details.
- π§ Intelligent: Dynamic registry selects models by capabilities, not names
- π Auto-Discovery: Pull new models, use immediately - no configuration needed
- β‘ Lightning Fast: Massive performance improvements (see Performance)
- π οΈ Clean Tools API: Function calling without complexity - tools are just parameters
- ποΈ Type-Safe: Pydantic V2 models throughout, no dictionary goop
- β‘ Async-Native: True async/await with sync wrappers when needed
- π Built-in Analytics: Automatic cost and usage tracking with session isolation
- π― Production-Ready: Thread-safe caching, connection pooling, negligible overhead
# Core functionality
pip install chuk_llm
# Or with extras
pip install chuk_llm[redis] # Persistent sessions
pip install chuk_llm[cli] # Enhanced CLI experience
pip install chuk_llm[all] # Everything# Simplest approach - auto-detects available providers
from chuk_llm import quick_question
answer = quick_question("Explain quantum computing in one sentence")
# Provider-specific (auto-generated functions!)
from chuk_llm import ask_openai_sync, ask_claude_sync, ask_ollama_llama3_2_sync
response = ask_openai_sync("Tell me a joke")
response = ask_claude_sync("Write a haiku")
response = ask_ollama_llama3_2_sync("Explain Python") # Auto-discovered!from chuk_llm import ask
# Gemini 3 Pro - Advanced reasoning with 1M context
response = await ask(
"Explain consciousness vs intelligence in AI",
provider="gemini",
model="gemini-3-pro-preview"
)
# Mistral Large 3 - 675B MoE, Apache 2.0
response = await ask(
"Write a Python function for binary search",
provider="mistral",
model="mistral-large-2512"
)
# Ministral 8B - Fast, efficient, cost-effective
response = await ask(
"Summarize this text",
provider="mistral",
model="ministral-8b-2512"
)
# DeepSeek V3.2 - Ultra-efficient at $0.27/M tokens
response = await ask(
"Solve this math problem step by step",
provider="deepseek",
model="deepseek-chat"
)import asyncio
from chuk_llm import ask, stream
async def main():
# Async call
response = await ask("What's the capital of France?")
# Real-time streaming
async for chunk in stream("Write a story"):
print(chunk, end="", flush=True)
asyncio.run(main())from chuk_llm import ask
from chuk_llm.api.tools import tools_from_functions
def get_weather(location: str) -> dict:
return {"temp": 22, "location": location, "condition": "sunny"}
# Tools are just a parameter!
toolkit = tools_from_functions(get_weather)
response = await ask(
"What's the weather in Paris?",
tools=toolkit.to_openai_format()
)
print(response) # Returns dict with tool_calls when tools provided# Quick commands with global aliases
chuk-llm ask_gpt "What is Python?"
chuk-llm ask_claude "Explain quantum computing"
# Auto-discovered Ollama models work instantly
chuk-llm ask_ollama_gemma3 "Hello world"
chuk-llm stream_ollama_mistral "Write a long story"
# llama.cpp with automatic model resolution
chuk-llm ask "What is Python?" --provider llamacpp --model qwen3
chuk-llm ask "Count to 5" --provider llamacpp --model llama3.2
# Discover new models
chuk-llm discover ollamaThe registry is the intelligent core of chuk-llm. Instead of hardcoding model names, it dynamically discovers models and their capabilities, then selects the best one for your needs.
from chuk_llm.registry import get_registry
from chuk_llm import ask
# Get the registry (auto-discovers all available models)
registry = await get_registry()
# Find the best cheap model with tool support
model = await registry.find_best(
requires_tools=True,
quality_tier="cheap"
)
print(f"Selected: {model.spec.provider}:{model.spec.name}")
# Selected: groq:llama-3.3-70b-versatile
# Use the selected model with ask()
response = await ask(
"Summarize this document",
provider=model.spec.provider,
model=model.spec.name
)
# Find best model for vision with large context
model = await registry.find_best(
requires_vision=True,
min_context=128_000,
quality_tier="balanced"
)
# Returns: openai:gpt-4o-mini or gemini:gemini-2.0-flash-exp
# Custom queries with multiple requirements
from chuk_llm.registry import ModelQuery
results = await registry.query(ModelQuery(
requires_tools=True,
requires_vision=True,
min_context=100_000,
max_cost_per_1m_input=2.0,
quality_tier="balanced"
))3-Tier Capability Resolution:
- Heuristic Resolver - Infers capabilities from model name patterns (e.g., "gpt-4" β likely supports tools)
- YAML Cache - Tested capabilities stored in
registry/capabilities/*.yamlfor fast, reliable access - Provider APIs - Queries provider APIs dynamically (Ollama
/api/tags, Gemini models API, etc.)
Dynamic Discovery Sources:
- OpenAI
/v1/modelsAPI - Anthropic known models
- Google Gemini models API
- Ollama
/api/tags(local models) - llama.cpp
/v1/models(local GGUF + Ollama bridge) - DeepSeek
/v1/modelsAPI - Moonshot AI
/v1/modelsAPI - Groq, Mistral, Perplexity, and more
Provider APIs are cached on disk and refreshed periodically (or via chuk-llm discover), so new models appear without needing a chuk-llm release.
Benefits:
- β No hardcoded model lists - Pull new Ollama models, use immediately
- β Capability-based selection - Declare requirements, not model names
- β Cost-aware - Find cheapest model that meets requirements
- β Quality tiers - BEST, BALANCED, CHEAP classification
- β Extensible - Add custom sources and resolvers via protocols
Pull new Ollama models and use them immediately - no configuration needed:
# Terminal 1: Pull a new model
ollama pull llama3.2
ollama pull mistral-small:latest
# Terminal 2: Use immediately in Python
from chuk_llm import ask_ollama_llama3_2_sync, ask_ollama_mistral_small_latest_sync
response = ask_ollama_llama3_2_sync("Hello!")
# Or via CLI
chuk-llm ask_ollama_mistral_small_latest "Tell me a joke"Run local GGUF models with advanced control via llama.cpp server. Reuse Ollama's downloaded models without re-downloading!
CLI Usage (β¨ Now fully supported!):
# Simple usage - model names automatically resolve to GGUF files
chuk-llm ask "What is Python?" --provider llamacpp --model qwen3
chuk-llm ask "Count to 5" --provider llamacpp --model llama3.2
# Streaming (default)
chuk-llm ask "Write a story" --provider llamacpp --model qwen3
# Non-streaming
chuk-llm ask "Quick question" --provider llamacpp --model qwen3 --no-streamPython API (Simple - Recommended):
from chuk_llm import ask
# Model names automatically resolve to Ollama's GGUF files!
response = await ask(
"What is Python?",
provider="llamacpp",
model="qwen3" # Auto-resolves to ~/.ollama/models/blobs/sha256-xxx
)
print(response)
# Streaming
from chuk_llm import stream
async for chunk in stream("Tell me a story", provider="llamacpp", model="llama3.2"):
print(chunk, end="", flush=True)Python API (Advanced - Full Control):
from chuk_llm.registry.resolvers.llamacpp_ollama import discover_ollama_models
from chuk_llm.llm.providers.llamacpp_client import LlamaCppLLMClient
from chuk_llm.core import Message, MessageRole
# Discover Ollama models (finds GGUF blobs in ~/.ollama/models/blobs/)
models = discover_ollama_models()
print(f"Found {len(models)} Ollama models") # e.g., "Found 48 Ollama models"
# Create client with auto-managed server
client = LlamaCppLLMClient(
model=str(models[0].gguf_path), # Reuse Ollama's GGUF!
ctx_size=8192,
n_gpu_layers=-1, # Use all GPU layers
)
messages = [Message(role=MessageRole.USER, content="Hello!")]
result = await client.create_completion(messages=messages)
print(result["response"])
# Cleanup
await client.stop_server()Key Features:
- β CLI Support - Full integration with chuk-llm CLI (model name resolution)
- β Ollama Bridge - Automatically discovers and reuses Ollama's downloaded models (no re-download!)
- β Auto-Resolution - Model names (qwen3, llama3.2) resolve to GGUF file paths automatically
- β Process Management - Auto-managed server lifecycle (start/stop/health checks)
- β OpenAI-Compatible - Uses standard OpenAI client (streaming, tools, etc.)
- β High Performance - Benchmarks show llama.cpp is 1.53x faster than Ollama (311 vs 204 tok/s)
- β Advanced Control - Custom sampling, grammars, GPU layers, context size
- β Cross-Platform - Works on macOS, Linux, Windows
Performance Comparison (same GGUF file, qwen3:0.6b):
- llama.cpp: 311.4 tok/s
- Ollama: 204.2 tok/s
- llama.cpp is 1.53x faster!
See examples/providers/llamacpp_ollama_usage_examples.py and examples/providers/benchmark_ollama_vs_llamacpp.py for full examples.
Every call is automatically tracked for analytics:
from chuk_llm import ask_sync, get_session_stats
ask_sync("What's the capital of France?")
ask_sync("What's 2+2?")
stats = get_session_stats()
print(f"Total cost: ${stats['estimated_cost']:.6f}")
print(f"Total tokens: {stats['total_tokens']}")Build conversational AI with memory:
from chuk_llm import conversation
async with conversation() as chat:
await chat.ask("My name is Alice")
response = await chat.ask("What's my name?")
# AI responds: "Your name is Alice"Run multiple queries in parallel for massive speedups:
import asyncio
from chuk_llm import ask
# 3-7x faster than sequential!
responses = await asyncio.gather(
ask("What is AI?"),
ask("Capital of Japan?"),
ask("Meaning of life?")
)All providers are dynamically discovered via the registry system - no hardcoded model lists!
| Provider | Discovery Method | Special Features | Status |
|---|---|---|---|
| OpenAI | /v1/models API |
GPT-5 / GPT-5.1, o3-family reasoning, industry standard | β Dynamic |
| Azure OpenAI | Deployment config | SOC2, HIPAA compliant, VNet, multi-region | β Dynamic |
| Anthropic | Known modelsβ | Claude 3.5 Sonnet, advanced reasoning, 200K context | β Static |
| Google Gemini | Models API | Gemini 2.5/3 Pro, 1M token context, adaptive thinking, multimodal | β Dynamic |
| Groq | /v1/models API |
Llama 3.3, ultra-fast (our benchmarks: ~526 tok/s) | β Dynamic |
| Ollama | /api/tags |
Any local model, auto-discovery, offline, privacy | β Dynamic |
| llama.cpp | /v1/models |
Local GGUF models, Ollama bridge, advanced control | β Dynamic |
| IBM watsonx | Known modelsβ | Granite 3.3, enterprise, on-prem, compliance | β Static |
| Perplexity | Known modelsβ | Sonar, real-time web search, citations | β Static |
| Mistral | Known modelsβ | Large 3 (675B MoE), Ministral 3 (3B/8B/14B), Apache 2.0 | β Static |
| DeepSeek | /v1/models API |
DeepSeek V3.2 (671B MoE), ultra-efficient, $0.27/M tokens | β Dynamic |
| Moonshot AI | /v1/models API |
Kimi K2, 256K context, coding, Chinese language | β Dynamic |
| OpenRouter | Known modelsβ | Access to 100+ models via single API | β Static |
β Static = discovered from curated model list + provider docs, not via /models endpoint
Capabilities (auto-detected by registry):
- β Streaming responses
- β Function calling / tool use
- β Vision / multimodal inputs
- β JSON mode / structured outputs
- β Async and sync interfaces
- β Automatic client caching
- β Session tracking
- β Conversation management
# API Keys - Cloud Providers
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..." # For Gemini 2.5/3 models
export GROQ_API_KEY="..."
export DEEPSEEK_API_KEY="..." # For DeepSeek V3.2 (chat/reasoner)
export MOONSHOT_API_KEY="..."
export MISTRAL_API_KEY="..." # For Mistral Large 3 & Ministral 3
# Azure Configuration
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
# Local Servers
# (No API keys needed for Ollama or llama.cpp)
# Session Storage (optional)
export SESSION_PROVIDER=redis # Default: memory
export SESSION_REDIS_URL=redis://localhost:6379/0
# Performance Settings
export CHUK_LLM_CACHE_CLIENTS=1 # Enable client caching (default: 1)
export CHUK_LLM_AUTO_DISCOVER=true # Auto-discover new models (default: true)from chuk_llm import configure
configure(
provider="azure_openai",
model="gpt-4o-mini",
temperature=0.7
)
# All subsequent calls use these settings
response = ask_sync("Hello!")Automatic client caching is enabled by default for maximum performance:
from chuk_llm.llm.client import get_client
# First call creates client (~12ms)
client1 = get_client("openai", model="gpt-4o")
# Subsequent calls return cached instance (~125Β΅s)
client2 = get_client("openai", model="gpt-4o")
assert client1 is client2 # Same instance!
# Disable caching for specific call
client3 = get_client("openai", model="gpt-4o", use_cache=False)
# Monitor cache performance
from chuk_llm.client_registry import print_registry_stats
print_registry_stats()
# Cache statistics:
# - Total clients: 1
# - Cache hits: 1
# - Cache misses: 1
# - Hit rate: 50.0%ChukLLM provides a clean, unified API for function calling. Recommended approach: Use the Tools class for automatic execution.
from chuk_llm import Tools, tool
# Recommended: Class-based tools with auto-execution
class MyTools(Tools):
@tool(description="Get weather for a city")
def get_weather(self, location: str) -> dict:
return {"temp": 22, "location": location, "condition": "sunny"}
@tool # Description auto-extracted from docstring
def calculate(self, expr: str) -> float:
"""Evaluate a mathematical expression"""
return eval(expr)
# Auto-executes tools and returns final response
tools = MyTools()
response = await tools.ask("What's the weather in Paris and what's 2+2?")
print(response) # "The weather in Paris is 22Β°C and sunny. 2+2 equals 4."
# Sync version
response = tools.ask_sync("Calculate 15 * 4")
print(response) # "15 * 4 equals 60"Alternative: Direct API usage (for more control):
from chuk_llm import ask
from chuk_llm.api.tools import tools_from_functions
def get_weather(location: str) -> dict:
"""Get weather information for a location"""
return {"temp": 22, "location": location}
# Create toolkit
toolkit = tools_from_functions(get_weather)
# Returns dict with tool_calls - you handle execution
response = await ask(
"What's the weather in Paris?",
tools=toolkit.to_openai_format()
)
print(response) # {"response": "...", "tool_calls": [...]}from chuk_llm import stream
# Streaming with tools
async for chunk in stream(
"What's the weather in Tokyo?",
tools=toolkit.to_openai_format(),
return_tool_calls=True # Include tool calls in stream
):
if isinstance(chunk, dict):
print(f"Tool call: {chunk['tool_calls']}")
else:
print(chunk, end="", flush=True)π³ Conversation Branching
async with conversation() as chat:
await chat.ask("Planning a vacation")
# Explore different options
async with chat.branch() as japan_branch:
await japan_branch.ask("Tell me about Japan")
async with chat.branch() as italy_branch:
await italy_branch.ask("Tell me about Italy")
# Main conversation unaffected by branches
await chat.ask("I'll go with Japan!")π Provider Comparison
from chuk_llm import compare_providers
results = compare_providers(
"Explain quantum computing",
["openai", "anthropic", "groq", "ollama"]
)
for provider, response in results.items():
print(f"{provider}: {response[:100]}...")π― Intelligent System Prompts
ChukLLM automatically generates optimized system prompts based on provider capabilities:
# Each provider gets optimized prompts
response = ask_claude_sync("Help me code", tools=tools)
# Claude gets: "You are Claude, an AI assistant created by Anthropic..."
response = ask_openai_sync("Help me code", tools=tools)
# OpenAI gets: "You are a helpful assistant with function calling..."# Quick access to any model
chuk-llm ask_gpt "Your question"
chuk-llm ask_claude "Your question"
chuk-llm ask_ollama_llama3_2 "Your question"
# llama.cpp with automatic model resolution
chuk-llm ask "Your question" --provider llamacpp --model qwen3
chuk-llm ask "Your question" --provider llamacpp --model llama3.2
# Discover and test
chuk-llm discover ollama # Find new models
chuk-llm test llamacpp # Test llamacpp provider
chuk-llm test azure_openai # Test connection
chuk-llm providers # List all providers
chuk-llm models ollama # Show available models
chuk-llm functions # List all generated functions
# Advanced usage
chuk-llm ask "Question" --provider azure_openai --model gpt-4o-mini --json
chuk-llm ask "Question" --provider llamacpp --model qwen3 --no-stream
chuk-llm ask "Question" --stream --verbose
# Function calling / Tool use from CLI
chuk-llm ask "Calculate 15 * 4" --tools calculator_tools.py
chuk-llm stream "What's the weather?" --tools weather_tools.py --return-tool-calls
# Zero-install with uvx
uvx chuk-llm ask_claude "Hello world"
uvx chuk-llm ask "Question" --provider llamacpp --model qwen3chuk-llm is designed for production with negligible overhead:
| Operation | Time | Notes |
|---|---|---|
| Import | 14ms | 52x faster than eager loading |
| Client creation (cached) | 125Β΅s | 112x faster, thread-safe |
| Request overhead | 50-140Β΅s | <0.015% of typical API call |
- Automatic client caching - Thread-safe, 112x faster repeated operations
- Lazy imports - Only load what you use
- Connection pooling - Efficient HTTP/2 reuse
- Async-native - Built on asyncio for maximum throughput
- Smart caching - Model discovery results cached on disk
Run comprehensive benchmarks:
uv run python benchmarks/benchmark_client_registry.py
uv run python benchmarks/llm_benchmark.pySee PERFORMANCE_OPTIMIZATIONS.md for detailed analysis and micro-benchmarks.
ChukLLM uses a registry-driven, async-native architecture optimized for production use:
- Dynamic Registry - Models discovered and selected by capabilities, not names
- Pydantic V2 Native - Type-safe models throughout, no dictionary goop
- Async-First - Built on asyncio with sync wrappers for convenience
- Stateless Clients - Clients don't store conversation history; your application manages state
- Lazy Loading - Modules load on-demand for instant imports (14ms)
- Automatic Caching - Thread-safe client registry eliminates duplicate initialization
User Code
β
import chuk_llm (14ms - lazy loading)
β
get_client() (2Β΅s - cached registry lookup)
β
[Cached Client Instance]
β
async ask() (~50Β΅s - minimal overhead)
β
Provider SDK (~50Β΅s - efficient request building)
β
HTTP Request (50-500ms - network I/O)
β
Response Parsing (~50Β΅s - orjson)
β
Return to User
Total chuk-llm Overhead: ~150Β΅s (<0.015% of API call)
Important: Conversation history is NOT shared between calls. Each conversation is independent:
from chuk_llm.llm.client import get_client
from chuk_llm.core.models import Message
client = get_client("openai", model="gpt-4o")
# Conversation 1
conv1 = [Message(role="user", content="My name is Alice")]
response1 = await client.create_completion(conv1)
# Conversation 2 (completely separate)
conv2 = [Message(role="user", content="What's my name?")]
response2 = await client.create_completion(conv2)
# AI won't know the name - conversations are isolated!Key Insights:
- β Clients are stateless (safe to cache and share)
- β Conversation state lives in YOUR application
- β HTTP sessions shared for performance (connection pooling)
- β No cross-conversation or cross-user leakage
- β Thread-safe for concurrent use
See CONVERSATION_ISOLATION.md for detailed architecture.
chuk-llm/
βββ api/ # Public API (ask, stream, conversation)
βββ registry/ # β Dynamic model registry (THE BRAIN)
β βββ core.py # ModelRegistry orchestrator
β βββ models.py # Pydantic models (ModelSpec, ModelCapabilities)
β βββ sources/ # Discovery sources (OpenAI, Ollama, Gemini, etc.)
β βββ resolvers/ # Capability resolvers (Heuristic, YAML, APIs)
βββ core/ # Pydantic V2 models (Message, Tool, ContentPart)
β βββ models.py # Core Pydantic models
β βββ enums.py # Type-safe enums (Provider, Feature, etc.)
β βββ constants.py # Constants
βββ llm/
β βββ providers/ # 15+ provider implementations
β βββ client.py # Client factory with registry integration
β βββ features.py # Feature detection
βββ configuration/ # Unified configuration system
βββ client_registry.py # Thread-safe client caching
chuk-llm is the canonical LLM layer for the entire CHUK ecosystem:
- chuk-ai-planner uses the registry to select planning vs drafting models by capability
- chuk-acp-agent uses capability-based policies per agent (e.g., "requires tools + 128k context")
- chuk-mcp-remotion uses it to pick video-script models with vision + long context
Instead of hardcoding "use GPT-4o", CHUK components declare what they need, and the registry finds the best available model.
- π Full Documentation
- π― Examples (33)
- β‘ Performance Optimizations
- ποΈ Client Registry
- π Lazy Imports
- π Conversation Isolation
- π Registry System
- π§ Debug Tools - Test OpenAI-compatible API capabilities
- ποΈ Migration Guide
- π€ Contributing
| Feature | chuk-llm | LangChain | LiteLLM | OpenAI SDK |
|---|---|---|---|---|
| Import speed | β‘ 14ms | π 1-2s | π 500ms+ | β‘ Fast |
| Client caching | β Auto (112x) | β | β | β |
| Auto-discovery | β | β | β | β |
| Native streaming | β | β | β | |
| Function calling | β Clean API | β Complex | β | |
| Session tracking | β Built-in | β | β | |
| Session isolation | β Guaranteed | |||
| CLI included | β | β | β | |
| Provider functions | β Auto-generated | β | β | β |
| Conversations | β Branching | β | β | |
| Thread-safe | β | β | ||
| Async-native | β | β | β | |
| Setup complexity | Simple | Complex | Simple | Simple |
| Dependencies | Minimal | Heavy | Moderate | Minimal |
| Performance overhead | <0.015% | ~2-5% | ~1-2% | Minimal |
| Command | Features | Use Case |
|---|---|---|
pip install chuk_llm |
Core + Session tracking | Development |
pip install chuk_llm[redis] |
+ Redis persistence | Production |
pip install chuk_llm[cli] |
+ Rich CLI formatting | CLI tools |
pip install chuk_llm[all] |
Everything | Full features |
MIT License - see LICENSE file for details.
- π Issues
- π¬ Discussions
- π§ Email
Built with β€οΈ for developers who just want their LLMs to work.