Everything You Need to Know About LiteLLM Python SDK

This guide provides a comprehensive deep-dive into the LiteLLM Python SDK, designed to help you effectively integrate over 100+ Large Language Models (LLMs) into your applications using a single, consistent, and powerful interface. Actually it is prepared by Yiğit to use for his Cursor. (like llms.txt for LiteLLM Python SDK) To make it easy to copy and paste, please use the given Gist link: https://gist.github.com/yigitkonur/1ff98e1995fbc40fec538b0a56116fc4 (click Raw to make it easy to copy paste) Core Value Proposition: LiteLLM provides a unified, consistent interface (often OpenAI-compatible) to interact with a vast array of LLM providers (OpenAI, Azure OpenAI, Anthropic (Claude), Google (Gemini, Vertex AI), Cohere, Mistral, Bedrock, Ollama, Hugging Face, Replicate, Perplexity, Groq, etc.). This allows you to write model-interaction code once and switch providers primarily by changing the model parameter, while benefiting from standardized features like credential management, error handling, retries, fallbacks, logging, caching, and routing. Who Should Use This Guide: Developers needing a comprehensive reference for all LiteLLM features, parameters, and utilities. Engineers building complex applications requiring fine-grained control over LLM interactions across multiple providers. Teams standardizing LLM access who need detailed documentation on configuration, routing, and error handling. Advanced users integrating custom models, prompt templates, or using adapters for specific workflows. Anyone encountering specific issues who needs detailed information on internal helpers, parameter mapping, or less common functionalities. Structure of This Guide: This document provides in-depth coverage of: Installation (including source), Authentication (all methods, extensive keys), Global Configuration (all settings detailed), Core API Functions (every function, detailed parameters, extensive examples), the Router (all parameters, strategies, management, inspection), Exceptions (comprehensive list), Cost Calculation (detailed methods), Client Budget Management (all methods), Utilities (every utility function detailed), and Constants. (Note:* While exhaustive, this guide reflects the state at the time of writing. For the absolute latest function signatures, parameter nuances across providers, and potential new features, always refer to the official LiteLLM source code, inline docstrings (help(litellm.)), and the official LiteLLM documentation website if available.)* Table of Contents Installation Using Pip (Recommended) Virtual Environments (Best Practice) Optional Dependencies (Comprehensive List & Use Cases) Installing from Source (For Contributors / Latest Dev) Verification Quickstart: Comprehensive First Calls (Multi-Provider, Sync/Async, Detailed Output) Authentication: API Keys & Endpoints Method 1: Environment Variables (Recommended, Exhaustive List) Method 2: Setting Credentials in Code (litellm.*, Exhaustive List) Method 3: Per-Call Credentials (Detailed Examples) Priority Order & Resolution Logic Global Configuration (litellm.*) API Keys & Endpoints (Cross-reference) Callbacks & Logging (Input/Success/Failure Hooks, Built-in Integrations, turn_off_message_logging, log_raw_request_response, Logging Levels) Retries & Timeouts (Detailed Parameters & Behavior) Fallbacks (General, Context Window, Content Policy - Detailed Configuration) Caching (litellm.Cache, All Backends: Local, Redis, Redis Semantic, Disk, S3 - Detailed Setup & Usage) Model & Prompt Handling (Aliases, register_model, register_prompt_template, drop_params) Global Budget Limits (litellm.max_budget, litellm._current_cost) Core API Functions (litellm.*) Chat Completions (completion / acompletion) Detailed Parameters (All common & less common args explained) Return Object (ModelResponse, CustomStreamWrapper) Detailed Structure Examples (Basic, Streaming, Multimodal/Vision, JSON Mode, Tools - Parallel & Sequential, Error Simulation) Embeddings (embedding / aembedding) Detailed Parameters (dimensions, encoding_format, input_type, etc.) Return Object (EmbeddingResponse) Detailed Structure Examples (OpenAI v2/v3, Cohere, Azure) Image Generation (image_generation / aimage_generation) Detailed Parameters (n, size, quality, style, response_format, SD params) Return Object (ImageResponse) Detailed Structure Examples (DALL-E 2/3, Bedrock SDXL, b64_json handling) Audio Transcription (transcription / atranscription) Detailed Parameters (language, prompt, response_format, temperature, timestamp_granularities) Return Object (TranscriptionResponse) Detailed Structure (All formats) Examples (Text, Verbose JSON with Timestamps) Text-to-Speech (speech / aspeech) Detailed Parameters (voice, response_format, speed) Return Object (HttpxBinaryResponseContent) Detailed Usage (stream_to_file, read) Example (Saving to File, Readi

Mar 29, 2025 - 22:10
 0
Everything You Need to Know About LiteLLM Python SDK

This guide provides a comprehensive deep-dive into the LiteLLM Python SDK, designed to help you effectively integrate over 100+ Large Language Models (LLMs) into your applications using a single, consistent, and powerful interface.

Actually it is prepared by Yiğit to use for his Cursor. (like llms.txt for LiteLLM Python SDK) To make it easy to copy and paste, please use the given Gist link: https://gist.github.com/yigitkonur/1ff98e1995fbc40fec538b0a56116fc4 (click Raw to make it easy to copy paste)

Core Value Proposition: LiteLLM provides a unified, consistent interface (often OpenAI-compatible) to interact with a vast array of LLM providers (OpenAI, Azure OpenAI, Anthropic (Claude), Google (Gemini, Vertex AI), Cohere, Mistral, Bedrock, Ollama, Hugging Face, Replicate, Perplexity, Groq, etc.). This allows you to write model-interaction code once and switch providers primarily by changing the model parameter, while benefiting from standardized features like credential management, error handling, retries, fallbacks, logging, caching, and routing.

Who Should Use This Guide:

  • Developers needing a comprehensive reference for all LiteLLM features, parameters, and utilities.
  • Engineers building complex applications requiring fine-grained control over LLM interactions across multiple providers.
  • Teams standardizing LLM access who need detailed documentation on configuration, routing, and error handling.
  • Advanced users integrating custom models, prompt templates, or using adapters for specific workflows.
  • Anyone encountering specific issues who needs detailed information on internal helpers, parameter mapping, or less common functionalities.

Structure of This Guide:
This document provides in-depth coverage of: Installation (including source), Authentication (all methods, extensive keys), Global Configuration (all settings detailed), Core API Functions (every function, detailed parameters, extensive examples), the Router (all parameters, strategies, management, inspection), Exceptions (comprehensive list), Cost Calculation (detailed methods), Client Budget Management (all methods), Utilities (every utility function detailed), and Constants.

(Note:* While exhaustive, this guide reflects the state at the time of writing. For the absolute latest function signatures, parameter nuances across providers, and potential new features, always refer to the official LiteLLM source code, inline docstrings (help(litellm.)), and the official LiteLLM documentation website if available.)*

Table of Contents

  1. Installation
    • Using Pip (Recommended)
    • Virtual Environments (Best Practice)
    • Optional Dependencies (Comprehensive List & Use Cases)
    • Installing from Source (For Contributors / Latest Dev)
    • Verification
  2. Quickstart: Comprehensive First Calls (Multi-Provider, Sync/Async, Detailed Output)
  3. Authentication: API Keys & Endpoints
    • Method 1: Environment Variables (Recommended, Exhaustive List)
    • Method 2: Setting Credentials in Code (litellm.*, Exhaustive List)
    • Method 3: Per-Call Credentials (Detailed Examples)
    • Priority Order & Resolution Logic
  4. Global Configuration (litellm.*)
    • API Keys & Endpoints (Cross-reference)
    • Callbacks & Logging (Input/Success/Failure Hooks, Built-in Integrations, turn_off_message_logging, log_raw_request_response, Logging Levels)
    • Retries & Timeouts (Detailed Parameters & Behavior)
    • Fallbacks (General, Context Window, Content Policy - Detailed Configuration)
    • Caching (litellm.Cache, All Backends: Local, Redis, Redis Semantic, Disk, S3 - Detailed Setup & Usage)
    • Model & Prompt Handling (Aliases, register_model, register_prompt_template, drop_params)
    • Global Budget Limits (litellm.max_budget, litellm._current_cost)
  5. Core API Functions (litellm.*)
    • Chat Completions (completion / acompletion)
      • Detailed Parameters (All common & less common args explained)
      • Return Object (ModelResponse, CustomStreamWrapper) Detailed Structure
      • Examples (Basic, Streaming, Multimodal/Vision, JSON Mode, Tools - Parallel & Sequential, Error Simulation)
    • Embeddings (embedding / aembedding)
      • Detailed Parameters (dimensions, encoding_format, input_type, etc.)
      • Return Object (EmbeddingResponse) Detailed Structure
      • Examples (OpenAI v2/v3, Cohere, Azure)
    • Image Generation (image_generation / aimage_generation)
      • Detailed Parameters (n, size, quality, style, response_format, SD params)
      • Return Object (ImageResponse) Detailed Structure
      • Examples (DALL-E 2/3, Bedrock SDXL, b64_json handling)
    • Audio Transcription (transcription / atranscription)
      • Detailed Parameters (language, prompt, response_format, temperature, timestamp_granularities)
      • Return Object (TranscriptionResponse) Detailed Structure (All formats)
      • Examples (Text, Verbose JSON with Timestamps)
    • Text-to-Speech (speech / aspeech)
      • Detailed Parameters (voice, response_format, speed)
      • Return Object (HttpxBinaryResponseContent) Detailed Usage (stream_to_file, read)
      • Example (Saving to File, Reading Bytes)
    • Content Moderation (moderation / amoderation)
      • Detailed Parameters (model)
      • Return Object (OpenAI Moderation Object) Detailed Structure
      • Example (Checking Multiple Inputs)
    • Text Completions (Legacy) (text_completion / atext_completion)
      • Detailed Parameters (max_tokens, stop, suffix, logprobs, echo, etc.)
      • Return Object (TextCompletionResponse, Stream Wrapper) Detailed Structure
      • Example (Instruct Model, Streaming)
    • Adapter Completions (adapter_completion / aadapter_completion)
      • Concept Explanation (Request/Response Translation)
      • Adapter Registration
      • Detailed Example (Custom Input/Output Schemas, Adapter Implementation)
    • Batch Completions (batch_completion / abatch_completion)
      • Detailed Parameters (requests, max_concurrent_requests, use_threadpool)
      • Return Value (List of Responses/Exceptions) Handling
      • Detailed Example (Processing Results)
    • Reranking (rerank / arerank)
      • Detailed Parameters (query, documents, top_n, rank_fields, return_documents)
      • Return Object (RerankResponse) Detailed Structure (results, meta)
      • Detailed Example (RAG Context Improvement)
    • OpenAI API Pass-through Functions (Files, Fine-tuning, Batch, Assistants - Concept & Conceptual Examples)
    • Health Checks (health_check / ahealth_check)
      • Detailed Parameters (mode)
      • Return Object (HealthCheckResponse) Structure
      • Detailed Example (Checking Multiple Endpoints)
  6. Router (litellm.Router)
    • Benefits of Using the Router (Detailed)
    • Initialization (__init__) - Exhaustive Parameters (model_list structure, All Routing Strategies Explained, Redis Params, Caching, Retries, Timeouts, Fallbacks, Cooldowns, Aliases)
    • Core Router Methods (completion, embedding, etc.) - Parameter Overrides (specific_deployment), Response Inspection (_hidden_params), Multi-Method Examples (Chat, Embedding, Streaming)
    • Deployment Management (Detailed add, update, delete, upsert, set_model_list Examples)
    • Getting Router Information (Detailed Examples: get_model_list, get_model_names, get_model_ids, get_deployment, get_model_group_info, get_available_deployments, get_model_group_usage, get_settings)
    • Advanced Router Features (Detailed flush_cache, reset, set_custom_routing_strategy Examples)
    • Router and Assistants API (Detailed Explanation)
  7. Handling Exceptions (litellm.exceptions.*)
    • Exception Hierarchy Overview
    • Detailed Descriptions of All Common & Specific Exceptions (including attributes like llm_provider, model, status_code)
    • Comprehensive Handling Example (Nested try...except, logging details)
  8. Cost Calculation (litellm.*)
    • litellm.completion_cost() (Exhaustive Parameter Details & Logic)
    • litellm.cost_per_token() (Detailed Explanation & Use Cases)
    • litellm.response_cost_calculator() (Internal Utility Explanation)
    • Comprehensive Examples (Response Objects, Explicit Tokens/Text, Embeddings, Custom Pricing)
  9. Client-Side Budget Management (litellm.BudgetManager)
    • Distinction from LiteLLM Proxy Budgets (Reinforced)
    • Initialization (__init__) Detailed Parameters
    • Exhaustive Method Descriptions (create_budget, update_cost, get_current_cost, get_total_budget, check_cost_and_update, projected_cost, is_valid_user, reset_cost, reset_on_duration, update_budget_all_users, get_users, get_model_cost, save_data)
    • Exhaustive Budget Management Example (Duration, Resets, Checks, Multiple Users)
  10. Utilities (litellm.utils.*)
    • Tokenizer Utilities (token_counter details, encode, decode, create_pretrained_tokenizer, create_tokenizer, openai_token_counter - Detailed Examples for each)
    • Model Information & Capability Checks (get_model_info details, get_max_tokens, All supports_X functions listed & explained, is_prompt_caching_valid_prompt example)
    • Parameter Handling & Validation (get_optional_params_* internal explanation, validate_environment detailed example, check_valid_key detailed example & warning)
    • Configuration & Registration (register_model detailed example & URL loading, register_prompt_template detailed example, read_config_args example, get_provider_fields example)
    • Testing & Mocking Utilities (load_test_model detailed example & warning, mock_completion_streaming_obj & async_mock_completion_streaming_obj detailed examples)
    • Miscellaneous Utilities (trim_messages detailed example, get_valid_models detailed example & warning, function_to_dict detailed example & reqs, return_raw_request detailed example & beta note, get_utc_datetime example)
  11. Reference: Constants (litellm.*) (More constants listed)

1. Installation

Ensure LiteLLM is correctly installed in your Python environment.

Using Pip (Recommended)

Install the latest stable version from the Python Package Index (PyPI):

pip install litellm

This installs the core library. For additional features, see Optional Dependencies.

Virtual Environments (Best Practice)

Isolating project dependencies is crucial. Use a virtual environment:

# 1. Create a virtual environment (e.g., in project folder)
python -m venv .venv # Or python3 -m venv .venv

# 2. Activate the environment
#    macOS/Linux (bash/zsh): source .venv/bin/activate
#    Windows (CMD):          .venv\Scripts\activate.bat
#    Windows (PowerShell):   .venv\Scripts\Activate.ps1
# Your terminal prompt should now indicate the active environment (e.g., '(.venv) your-prompt$')

# 3. Install within the activated environment
pip install litellm

Optional Dependencies (Comprehensive List & Use Cases)

Install these extras using pip install litellm[extra1,extra2,...] based on your needs:

  • redis: Required for Redis-backed caching (litellm.Cache(type="redis")) and stateful Router strategies (least-busy, latency-based-routing, usage-based-routing, persistent cooldowns). Needs a running Redis server.
  • numpydoc: Required only for the litellm.utils.function_to_dict utility, which converts Python functions with NumPy-style docstrings to OpenAI tool schemas.
  • boto3: Required for interacting with AWS Bedrock models and for using AWS S3 as a caching backend (litellm.Cache(type="s3")). Requires AWS credentials to be configured (typically via environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION_NAME or IAM roles).
  • vertex: Required for interacting with Google Vertex AI models (including Gemini on Vertex). Requires GCP authentication configured (API Key via GOOGLE_API_KEY or, preferably, Application Default Credentials setup via gcloud). Also needs VERTEX_PROJECT and VERTEX_LOCATION environment variables.
  • huggingface: Primarily needed for using the tokenizers library via litellm.utils.create_pretrained_tokenizer for loading custom Hugging Face tokenizers. May also be needed for specific direct HF model integrations if not using standard inference endpoints.
  • diskcache: Required for using local disk-based caching (litellm.Cache(type="disk")). Persists cache between script runs.
  • Observability/Logging Integrations: Install the relevant extra to enable logging to these platforms using simple string identifiers in litellm.success_callback / litellm.failure_callback. Requires setting corresponding API keys/endpoints as environment variables.
    • langfuse: Logs to Langfuse. (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST)
    • langsmith: Logs to LangSmith. (LANGCHAIN_API_KEY, LANGCHAIN_TRACING_V2=true, LANGCHAIN_PROJECT, LANGCHAIN_ENDPOINT)
    • traceloop: Logs to Traceloop/OpenLLMetry. (TRACELOOP_API_KEY, TRACELOOP_APP_NAME)
    • helicone: Logs to Helicone. (HELICONE_API_KEY)
    • promptlayer: Logs to PromptLayer. (PROMPTLAYER_API_KEY)
    • athina: Logs to Athina. (ATHINA_API_KEY)
    • lunary: Logs to Lunary. (LUNARY_APP_ID or LUNARY_PUBLIC_KEY)
    • supabase: For using Supabase as a logging or caching backend. Requires Supabase credentials (SUPABASE_URL, SUPABASE_KEY).
  • proxy: Installs dependencies often useful when interacting with the LiteLLM Proxy Server from the SDK, though core SDK usage doesn't strictly require it. Check Proxy documentation if needed.
  • test: Installs dependencies required to run LiteLLM's own test suite (pytest, pytest-asyncio, requests-mock, etc.). Useful for contributors.
  • dev: Installs development dependencies (linters like ruff, formatters like black, pre-commit, etc.). Useful for contributors.

Installing from Source (For Contributors / Latest Dev)

Use this method if you need the bleeding-edge version or plan to contribute code changes.

# 1. Clone the LiteLLM repository
git clone https://github.com/BerriAI/litellm.git
cd litellm

# 2. Create and activate a virtual environment (strongly recommended)
python -m venv .venv
source .venv/bin/activate # Or Windows equivalent

# 3. Install in editable mode (-e). This links the installed package to your source code.
#    This command installs core dependencies defined in pyproject.toml.
pip install -e .

# 4. Optionally, install extras needed for development or testing
#    You can install multiple extras. This example installs test, dev, redis, and boto3 extras.
pip install -e ".[test,dev,redis,boto3]"

Editable mode ensures that any changes you make to the local source code files are immediately reflected when you import litellm in Python, without needing to reinstall.

Verification

After installation (from pip or source), confirm LiteLLM is correctly installed and accessible in your Python environment:

python -c "import litellm; print(f'LiteLLM version: {litellm.__version__}')"
# Expected output: LiteLLM version: 

If this command runs without an ImportError or other exceptions, the installation is successful.

2. Quickstart: Comprehensive First Calls

This example provides a more detailed demonstration of LiteLLM's core capability: calling diverse LLM providers using a consistent interface. It includes synchronous calls to multiple providers, shows how to access response content and usage, includes basic timing, and provides a separate asynchronous example.

Prerequisites:

  • LiteLLM installed (pip install litellm).
  • Relevant optional dependencies installed if needed (e.g., litellm[vertex] for the Vertex AI example).
  • API Keys and necessary Endpoint information configured via environment variables (refer to the Authentication section for detailed variable names).
import os
import litellm
import asyncio
import time
import traceback
from typing import List, Dict, Any

# --- Ensure Environment Variables Are Set (Example Reminder) ---
# Check Authentication section for required variables like:
# OPENAI_API_KEY, ANTHROPIC_API_KEY,
# GOOGLE_APPLICATION_CREDENTIALS / GOOGLE_API_KEY, VERTEX_PROJECT, VERTEX_LOCATION,
# OLLAMA_API_BASE, GROQ_API_KEY

# --- Common Input ---
common_messages: List[Dict[str, str]] = [
    {"role": "system", "content": "You are a concise and helpful assistant."},
    {"role": "user", "content": "Explain the concept of vector databases in simple terms (max 3 sentences)."}
]

# --- Helper function for structured output ---
def process_and_print_response(provider: str, response_or_exception: Any, start_time: float):
    """Formats and prints the result of an API call."""
    duration = time.time() - start_time
    print(f"--- {provider} ({duration:.3f}s) ---")
    if isinstance(response_or_exception, litellm.ModelResponse):
        response = response_or_exception
        if response.choices and response.choices[0].message and response.choices[0].message.content:
            print("Content:", response.choices[0].message.content.strip())
            if response.usage:
                usage_str = f"Prompt={response.usage.prompt_tokens}, Completion={response.usage.completion_tokens}, Total={response.usage.total_tokens}"
                print(f"[Usage: {usage_str}]")
            # Check if cost calculation was added (e.g., by a callback)
            cost = response._hidden_params.get("cost")
            if cost is not None:
                print(f"[Cost: ${cost:.6f}]")
            if response.choices[0].finish_reason:
                print(f"[Finish Reason: {response.choices[0].finish_reason}]")
        else:
            print("ERROR: Received ModelResponse but no valid content found.")
            # print("DEBUG: Response Object:", response) # Uncomment for debugging
    elif isinstance(response_or_exception, Exception):
        error = response_or_exception
        print(f"FAILED: {type(error).__name__}")
        print(f"  Error Details: {str(error)}")
        # Optionally print more exception details if available
        if hasattr(error, 'llm_provider'): print(f"  Provider Context: {error.llm_provider}") # type: ignore
        if hasattr(error, 'model'): print(f"  Model Context: {error.model}") # type: ignore
    else:
        print(f"FAILED: Received unexpected result type: {type(response_or_exception)}")
        print(f"  Result: {response_or_exception}")
    print("-" * (len(provider) + 12))

# --- Synchronous API Calls ---
print(">>> Running Synchronous Examples <<<")
sync_providers_to_test = {
    "OpenAI GPT-3.5 Turbo": "gpt-3.5-turbo",
    "Anthropic Claude 3 Haiku": "claude-3-haiku-20240307",
    "Google Vertex AI Gemini Pro": "vertex_ai/gemini-pro",
    "Local Ollama Llama 3": "ollama/llama3"
}

sync_results = {}
for name, model_id in sync_providers_to_test.items():
    print(f"\nAttempting call to {name}...")
    start_sync = time.time()
    try:
        # Make the synchronous call
        response = litellm.completion(
            model=model_id,
            messages=common_messages,
            max_tokens=120,
            temperature=0.3,
            # Example of adding metadata for tracking/callbacks
            metadata={"call_type": "quickstart_sync", "provider_target": name}
        )
        sync_results[name] = response
    except Exception as e:
        # Catch potential errors during the call itself
        sync_results[name] = e
    # Process and print the result (or error) using the helper
    process_and_print_response(name, sync_results[name], start_sync)

# --- Asynchronous API Call Example ---
async def run_async_example():
    print("\n>>> Running Asynchronous Example (Groq Llama3 8b) <<<")
    # Requires GROQ_API_KEY environment variable
    async_model = "groq/llama3-8b-8192"
    print(f"\nAttempting async call to {async_model}...")
    start_async = time.time()
    response_async = None
    try:
        # Make the asynchronous call using 'acompletion'
        response_async = await litellm.acompletion(
            model=async_model,
            messages=common_messages,
            max_tokens=120,
            temperature=0.3,
            metadata={"call_type": "quickstart_async"}
        )
    except Exception as e:
        response_async = e
    # Process and print the result (or error)
    process_and_print_response(f"Async {async_model}", response_async, start_async)

# --- Run the Asynchronous Example ---
# Ensure GROQ_API_KEY is set in your environment before uncommenting
# print("\nStarting Async Execution...")
# try:
#     asyncio.run(run_async_example())
# except RuntimeError as e:
#     # Handle cases where asyncio event loop is already running (e.g., in Jupyter)
#     if "cannot run event loop while another loop is running" in str(e):
#         print("Note: Cannot run top-level asyncio.run in this environment (e.g., Jupyter).")
#         # Optionally, you could try nest_asyncio if needed for specific environments
#     else:
#         print(f"Error running async example: {e}")
#         traceback.print_exc()
# except Exception as main_async_e:
#     print(f"Error executing async function: {main_async_e}")
#     traceback.print_exc()

print("\nQuickstart Examples Complete.")

3. Authentication: API Keys & Endpoints

Securely providing credentials and specifying the correct API endpoints is fundamental to using LiteLLM with various providers. LiteLLM offers flexible methods to handle this.

Method 1: Environment Variables (Recommended)

Using environment variables is the most secure, flexible, and standard approach, especially for production systems and CI/CD pipelines. LiteLLM automatically detects and utilizes keys and endpoints set in the environment.

Setting Environment Variables:

  • Mechanism: Use your OS's standard method (shell export, Windows System Properties, Docker environment variables, secrets management tools, .env files).
  • Security: Never commit API keys or secrets directly into version control (like Git). Use .gitignore for .env files. Prefer managed secrets solutions for production.
  • Common Variable Names: (This is an extensive but not necessarily exhaustive list; always check specific provider documentation if unsure)

    • OpenAI: OPENAI_API_KEY, OPENAI_API_BASE (for proxies/compatible endpoints), OPENAI_ORGANIZATION (optional)
    • Azure OpenAI: AZURE_API_KEY, AZURE_API_BASE (e.g., https://.openai.azure.com/), AZURE_API_VERSION (e.g., 2024-02-15-preview). Also AZURE_AD_TOKEN if using Azure AD auth (litellm.use_azure_ad_token=True).
    • Anthropic: ANTHROPIC_API_KEY, ANTHROPIC_API_BASE (optional override)
    • Cohere: COHERE_API_KEY, COHERE_API_BASE (optional override)
    • Google (Vertex AI & AI Platform):
      • GOOGLE_API_KEY (for API Key based auth)
      • OR configure Application Default Credentials (ADC) via gcloud auth application-default login. LiteLLM checks ADC if GOOGLE_API_KEY is not set.
      • VERTEX_PROJECT (Your GCP Project ID - Required)
      • VERTEX_LOCATION (Your GCP Region, e.g., us-central1 - Required)
      • VERTEX_SERVICE_ACCOUNT_KEY_PATH (Optional path to service account JSON)
    • AWS Bedrock: Standard AWS SDK credentials mechanism:
      • AWS_ACCESS_KEY_ID
      • AWS_SECRET_ACCESS_KEY
      • AWS_SESSION_TOKEN (optional, for temporary credentials)
      • AWS_REGION_NAME (e.g., us-east-1 - Required)
      • OR IAM Role attached to the execution environment (e.g., EC2 instance, Lambda function).
    • Ollama: OLLAMA_API_BASE (e.g., http://localhost:11434)
    • Hugging Face: HUGGINGFACE_API_KEY (for Inference Endpoints, Hub models)
    • Mistral AI: MISTRAL_API_KEY
    • Together AI: TOGETHER_AI_KEY
    • Groq: GROQ_API_KEY
    • Replicate: REPLICATE_API_KEY
    • Perplexity AI: PERPLEXITY_API_KEY
    • OpenRouter: OPENROUTER_API_KEY, OPENROUTER_API_BASE (optional)
    • AI21: AI21_API_KEY
    • NLP Cloud: NLP_CLOUD_API_KEY
    • Aleph Alpha: ALEPH_ALPHA_API_KEY
    • Petals: PETALS_API_KEY (optional)
    • Baseten: BASETEN_API_KEY
    • DeepInfra: DEEPINFRA_API_KEY
    • VLLM: VLLM_API_BASE (for self-hosted VLLM instances)
    • ... (Check LiteLLM provider documentation for others)
  • Using .env Files: (Recommended for local/dev)

    1. pip install python-dotenv
    2. Create .env file in your project root (add to .gitignore).
    3. Add KEY=VALUE pairs, e.g., OPENAI_API_KEY="sk-..."
    4. Load at script start:

      from dotenv import load_dotenv
      load_dotenv()
      

LiteLLM Usage: No extra code needed. LiteLLM automatically checks os.environ when you make calls like litellm.completion(model="...", ...) based on the model string's inferred provider.

Method 2: Setting Credentials in Code (litellm.*)

Assign credential values directly to attributes of the litellm module. Use with extreme caution for secret keys. Better for non-secrets or controlled server environments.

  • Attribute Names (Extensive List):
    • litellm.api_key (Generic fallback)
    • litellm.openai_key
    • litellm.azure_key
    • litellm.anthropic_key
    • litellm.cohere_key
    • litellm.replicate_key
    • litellm.huggingface_key
    • litellm.together_ai_key
    • litellm.groq_key
    • litellm.mistral_api_key
    • litellm.openrouter_key
    • litellm.perplexity_key
    • litellm.petals_key
    • litellm.baseten_key
    • litellm.nlp_cloud_key
    • litellm.ai21_key
    • litellm.maritalk_key
    • litellm.deepinfra_key
    • litellm.aleph_alpha_key
    • litellm.api_base (Default base URL for all providers, often overridden)
    • litellm.api_version (Default API version, mainly Azure)
    • litellm.organization (OpenAI Org ID)
    • litellm.vertex_project, litellm.vertex_location, litellm.vertex_credentials (Path or Credentials object)
    • litellm.google_api_key
    • litellm.aws_access_key_id, litellm.aws_secret_access_key, litellm.aws_session_token, litellm.aws_region_name (Bedrock overrides)
    • litellm.use_azure_ad_token (bool, set to True to use Azure AD token - requires token to be passed via azure_ad_token param in call or specific setup)
    • (May be others for specific/newer providers - check source)

Example:

import litellm

# Set default Azure endpoint info (safer than setting the key)
litellm.api_base = "https://my-prod-resource.openai.azure.com/"
litellm.api_version = "2024-02-01"
print("Set global default Azure api_base and api_version.")

# Maybe set Vertex project/location if always using the same one
litellm.vertex_project = "my-gcp-project"
litellm.vertex_location = "europe-west2"
print("Set global default Vertex project and location.")

Method 3: Per-Call Credentials

Pass credentials as arguments directly to the LiteLLM function (completion, embedding, etc.). This offers the finest control and highest priority.

Example:

import litellm
import os

# Assuming base OpenAI key is in env var for most calls
# But make one call using a specific key and endpoint (e.g., testing a proxy)
response_override = litellm.completion(
    model="gpt-4", # Model name doesn't need provider prefix if OpenAI is default/intended
    messages=[{"role": "user", "content": "Test override"}],
    api_key=os.getenv("SPECIFIC_TEST_KEY", "sk-dummy"), # Use specific key
    api_base=os.getenv("SPECIFIC_ENDPOINT", "http://localhost:8000/v1"), # Target specific endpoint
    timeout=30 # Specific timeout for this call
)
print("Made call with per-call credential overrides.")

Priority Order

LiteLLM resolves credentials and endpoints by checking in this order:

  1. Per-Call Arguments: (e.g., api_key passed to litellm.completion) - Highest Priority
  2. Code Configuration: (e.g., litellm.openai_key, litellm.api_base)
  3. Environment Variables: (e.g., os.getenv("OPENAI_API_KEY")) - Lowest Priority

The first non-None value found in this sequence is used for the specific parameter (api_key, api_base, etc.) for that API call.

4. Global Configuration (litellm.*)

Customize LiteLLM's default behavior globally by setting attributes on the litellm module. These settings apply to all subsequent calls unless overridden on a per-call basis or by a litellm.Router instance.

API Keys & Endpoints (Cross-reference)

As detailed in the Authentication section, you can set default provider keys (litellm.openai_key, litellm.azure_key, etc.), a generic fallback key (litellm.api_key), a default base URL (litellm.api_base), default API version (litellm.api_version), and provider-specific configurations (like litellm.vertex_project).

Callbacks & Logging

Register functions or integrate with platforms to monitor, log, or modify LLM calls.

  • Callback Hooks: Define lists of functions or string identifiers for built-in integrations.
    • litellm.input_callback: List[Callable | str] - Executed before the API call. Receives kwargs (dict of call parameters). Can modify kwargs in-place. Use cases: logging inputs, dynamic parameter injection, input validation/redaction, setting trace IDs.
    • litellm.success_callback: List[Callable | str] - Executed after a successful API call. Receives kwargs (original call kwargs, potentially updated by input callbacks or internal logic like cost calculation), response_obj (the successful ModelResponse, EmbeddingResponse, etc.), start_time (datetime), end_time (datetime). Use cases: logging success details, cost, latency; triggering downstream actions; storing results.
    • litellm.failure_callback: List[Callable | str] - Executed after a failed API call (exception raised). Receives kwargs, response_obj (the Exception instance), start_time, end_time. Use cases: logging errors, sending alerts, analyzing failure patterns.
    • litellm.callbacks: List[Callable | str] - Convenience to add a callback to all three lists simultaneously.
  • Built-in Integrations (String Identifiers): LiteLLM supports numerous platforms directly. Add the platform name as a string to the callback lists (e.g., litellm.success_callback = ["helicone", my_func]). Requires relevant environment variables (API keys/endpoints for the platform) and potentially installing extras (e.g., litellm[langfuse]). Supported platforms include:
    • helicone, promptlayer, langfuse, langsmith, traceloop (OpenLLMetry), athina, lunary, supabase, weights_biases (W&B), dynamodb. (Check LiteLLM docs for the latest list and required env vars).
  • Logging Control:
    • litellm.turn_off_message_logging: bool = False - If True, message content (messages, input, prompt) is not passed to any callback's kwargs. kwargs["messages"] will be None or removed. Crucial for privacy/compliance if callbacks log externally.
    • litellm.log_raw_request_response: bool = False - If True, LiteLLM attempts to capture the raw HTTP request (URL, method, headers, body) and response (status, headers, body) and attaches them to an internal logging object accessible within callbacks via kwargs["litellm_logging_obj"].collected_data. This is primarily for deep debugging and may impact performance slightly. Callbacks need to be written to access this specific structure.
    • os.environ['LITELLM_LOG'] = "LEVEL" - Controls the verbosity of LiteLLM's internal logging messages printed to stderr. Levels: DEBUG (very verbose, includes request/response snippets), INFO (standard informational messages), WARNING (potential issues), ERROR (only errors). Recommended way to manage internal log level.

Example: Comprehensive Callback Setup

import litellm
import datetime
import time
import os
import logging
import uuid
import json
from typing import List, Dict, Any, Optional

# Configure standard Python logging for our callbacks
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
callback_logger = logging.getLogger("MyLiteLLMCallbacks")

# --- Callback Functions ---
def enrich_and_log_input(kwargs: dict):
    """Input callback: Adds trace ID, logs basic info."""
    trace_id = str(uuid.uuid4())[:8] # Shorter trace ID
    kwargs.setdefault('metadata', {})['trace_id'] = trace_id
    user = kwargs.get('metadata', {}).get('user_id', 'anon')
    model = kwargs.get('model', 'unknown')
    call_type = kwargs.get('litellm_call_id', 'unknown_type') # LiteLLM adds call type id

    # Log basic info (INFO level)
    callback_logger.info(f"[Input] Trace={trace_id} User={user} Type={call_type} Model={model}")

    # Log detailed input only if DEBUG level is enabled for callbacks
    if callback_logger.isEnabledFor(logging.DEBUG):
        # Be careful logging full messages in production!
        messages_preview = "REDACTED (message logging off)"
        if not litellm.turn_off_message_logging and 'messages' in kwargs:
             try:
                 messages_preview = json.dumps([{"role": m.get("role"), "content_preview": str(m.get("content"))[:50]+"..."} for m in kwargs['messages']])
             except: messages_preview = "[Error serializing messages]"
        callback_logger.debug(f"  Input Kwargs (Partial): {{'model': '{model}', 'stream': {kwargs.get('stream')}, 'temperature': {kwargs.get('temperature')}, 'messages': {messages_preview}}}")

def log_success(kwargs: dict, response_obj: Any, start_time: datetime.datetime, end_time: datetime.datetime):
    """Success callback: Logs result details, cost, latency."""
    duration_ms = (end_time - start_time).total_seconds() * 1000
    metadata = kwargs.get('metadata', {})
    trace_id = metadata.get('trace_id', 'N/A')
    user = metadata.get('user_id', 'anon')
    model_req = kwargs.get('model', 'unknown')
    model_resp = getattr(response_obj, 'model', model_req) # Model actually used
    cost = kwargs.get('response_cost', None) # Populated by LiteLLM's cost logger

    base_log = f"[Success] Trace={trace_id} User={user} ModelReq={model_req} ModelResp={model_resp} Latency={duration_ms:.0f}ms"
    if cost is not None: base_log += f" Cost=${cost:.6f}"

    # Extract usage info if available
    if hasattr(response_obj, 'usage') and response_obj.usage:
        usage = response_obj.usage
        base_log += f" Usage(P/C/T)={usage.prompt_tokens}/{usage.completion_tokens}/{usage.total_tokens}"

    # Extract finish reason if available
    finish_reason = "N/A"
    if hasattr(response_obj, 'choices') and response_obj.choices:
        finish_reason = getattr(response_obj.choices[0], 'finish_reason', 'N/A')
    base_log += f" FinishReason={finish_reason}"

    callback_logger.info(base_log)

    # Debug log raw request/response if enabled and collected
    if litellm.log_raw_request_response:
         log_obj = kwargs.get("litellm_logging_obj")
         if log_obj and hasattr(log_obj, 'collected_data'):
             # Log selectively, bodies can be huge
             raw_req = log_obj.collected_data.get('raw_request', {})
             raw_resp = log_obj.collected_data.get('raw_response', {})
             callback_logger.debug(f"  Raw Req: URL={raw_req.get('url')} Headers={list(raw_req.get('headers',{}).keys())} Body Snippet={str(raw_req.get('body'))[:100]}...")
             callback_logger.debug(f"  Raw Resp: Status={raw_resp.get('status_code')} Headers={list(raw_resp.get('headers',{}).keys())} Body Snippet={str(raw_resp.get('body'))[:100]}...")


def log_failure(kwargs: dict, response_obj: Any, start_time: datetime.datetime, end_time: datetime.datetime):
    """Failure callback: Logs error details."""
    duration_ms = (end_time - start_time).total_seconds() * 1000
    metadata = kwargs.get('metadata', {})
    trace_id = metadata.get('trace_id', 'N/A')
    user = metadata.get('user_id', 'anon')
    model_req = kwargs.get('model', 'unknown')
    error = response_obj # Exception object

    model_ctx = getattr(error, 'model', model_req) # Get model from error if available
    provider_ctx = getattr(error, 'llm_provider', 'unknown')
    status_code = getattr(error, 'status_code', 'N/A')
    error_type = type(error).__name__

    callback_logger.error(f"[Failure] Trace={trace_id} User={user} ModelReq={model_req} ModelCtx={model_ctx} Provider={provider_ctx} Latency={duration_ms:.0f}ms Status={status_code} Error={error_type}: {error}")

# --- Configure LiteLLM ---
print("Configuring detailed callbacks...")
litellm.input_callback = [enrich_and_log_input]
litellm.success_callback = [log_success] # Add built-in cost logger if needed: litellm.success_callback = ["cost", log_success]
litellm.failure_callback = [log_failure]
# litellm.turn_off_message_logging = True # Example: Turn off message content logging
litellm.log_raw_request_response = False # Example: Keep raw logging off by default
os.environ['LITELLM_LOG'] = 'INFO' # Control LiteLLM's internal logs

# --- Test Call ---
print("\nMaking test call with callbacks enabled...")
try:
    litellm.completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Test"}], metadata={"user_id": "cb_user"})
except Exception: pass # Let failure callback handle logging

try:
    litellm.completion(model="bad-model-name", messages=[{"role": "user", "content": "Test failure"}], metadata={"user_id": "cb_user_fail"})
except Exception: pass # Let failure callback handle logging

Managing Retries & Timeouts

Control LiteLLM's resilience to transient errors and slow responses.

  • litellm.num_retries: int = 3 (Default): Sets the default maximum number of times LiteLLM will retry a failed API call for specific retryable errors.
    • Retryable Errors Typically Include: Timeout (HTTP 408), RateLimitError (HTTP 429), APIConnectionError, ServiceUnavailableError (HTTP 503), InternalServerError (HTTP 500).
    • Retries usually implement exponential backoff with jitter internally.
    • Set to 0 to disable automatic retries globally. Can be overridden per-call (num_retries=...).
  • litellm.request_timeout: int = 600 (Default): Sets the default maximum time in seconds LiteLLM will wait for a response from the API provider for a single attempt before raising a litellm.exceptions.Timeout error.
    • This applies to each individual attempt, including retries.
    • Can be overridden per-call (timeout=...).

Fallbacks

Define strategies to automatically switch to backup models or providers if initial attempts fail.

  • litellm.fallbacks: List[Dict[str, List[str]]] = [] (Default): A list defining general fallback rules. If a call to a model matching a key in the dictionary fails (after exhausting num_retries), LiteLLM will attempt calls to the models/groups listed in the value list, in order.
    • Format: [{"model_or_group_to_fallback_from": ["first_fallback_model", "second_fallback_group", ...]}, {"*": ["default_fallback_if_all_else_fails"]}]
    • The key "*" acts as a catch-all default fallback for any model not explicitly listed as a key.
    • Used for any failure type that isn't explicitly handled by context window or content policy fallbacks.
  • litellm.context_window_fallbacks: List[Dict[str, List[str]]] = [] (Default): Same format as fallbacks, but this list is consulted only when a litellm.exceptions.ContextWindowExceededError occurs.
    • Use Case: Automatically route a request that failed due to prompt length to a model known to have a larger context window (e.g., gpt-4 -> gpt-4-turbo).
  • litellm.content_policy_fallbacks: List[Dict[str, List[str]]] = [] (Default): Same format, but consulted only when a litellm.exceptions.ContentPolicyViolationError occurs.
    • Use Case: Route potentially sensitive or problematic content (that got blocked by the primary model) to a different model, potentially one with different filtering levels or for logging/review, instead of just failing.

Example: Advanced Fallback Configuration

import litellm

litellm.num_retries = 1 # Retry primary model once

# General Fallbacks
litellm.fallbacks = [
    # If gpt-4-turbo fails (non-context/policy), try Opus, then Sonnet
    {"gpt-4-turbo": ["claude-3-opus-20240229", "claude-3-sonnet-20240229"]},
    # If gemini fails, try gpt-3.5
    {"gemini/gemini-1.5-pro-latest": ["gpt-3.5-turbo"]},
    # Default fallback for any other model failure
    {"*": ["gpt-3.5-turbo"]}
]

# Context Window Fallbacks
litellm.context_window_fallbacks = [
    {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]},
    {"gpt-4": ["gpt-4-turbo"]},
    {"claude-3-haiku-20240307": ["claude-3-sonnet-20240229"]}, # Haiku -> Sonnet
    {"claude-3-sonnet-20240229": ["claude-3-opus-20240229"]} # Sonnet -> Opus
]

# Content Policy Fallbacks (Example: Route to a model known for stricter alignment for review)
# litellm.content_policy_fallbacks = [
#     {"*": ["gpt-4-safe-review-model"]} # Fictional model name
# ]

print("Configured num_retries=1, general fallbacks, and context window fallbacks.")

# Test call simulation (requires keys for potential fallbacks)
# try:
#     response = litellm.completion(
#         model="gpt-4-turbo", messages=[...],
#         mock_response=litellm.exceptions.RateLimitError("Simulate failure") # Trigger general fallback
#     )
#     # Check response.model or response._hidden_params['model'] to see which model was used
# except Exception as e:
#     print(f"Call failed even after fallbacks: {e}")

Enabling Caching

Store LLM responses to reduce latency and costs for identical requests. Requires installing backend libraries (e.g., litellm[redis], litellm[diskcache]).

  • Global Configuration: Assign a configured litellm.Cache object to litellm.cache.
    • litellm.Cache(type: str, ttl: Optional[int] = None, **kwargs)
      • type: "local" (non-persistent in-memory dict), "redis", "redis-semantic" (semantic similarity search, experimental, requires specific setup), "disk", "s3".
      • ttl: Default Time-To-Live (seconds) for cache entries.
      • **kwargs: Backend-specific parameters:
        • redis/redis-semantic: host, port, password, db.
        • disk: directory (path to store cache files).
        • s3: s3_bucket_name, s3_region_name (requires AWS credentials configured).
  • Per-Call Usage:
    • Enable caching for a call: litellm.completion(..., caching=True) (requires litellm.cache to be set globally).
    • Disable caching for a call: litellm.completion(..., cache={"no-cache": True}).
    • Set TTL per call: litellm.completion(..., cache={"ttl": 120}).
    • Semantic Caching (beta, type="redis-semantic"): litellm.completion(..., cache={"similarity_threshold": 0.95}) - returns cached response if input messages are semantically similar above the threshold.

Example: Disk Cache Setup and Usage

import litellm
import time
import os
import shutil

# Requires: pip install litellm[diskcache]

cache_directory = "./litellm_disk_cache_example"
print(f"--- Configuring Disk Cache in '{cache_directory}' ---")
try:
    # Ensure directory exists
    os.makedirs(cache_directory, exist_ok=True)
    litellm.cache = litellm.Cache(
        type="disk",
        directory=cache_directory,
        ttl=300 # Cache entries valid for 5 minutes
    )
    print("Disk cache configured successfully.")

    # --- Test Cache ---
    if litellm.cache:
        model="gpt-3.5-turbo"
        messages=[{"role": "user", "content": "Tell me about disk caching."}]
        print("\nMaking first call (expect cache miss)...")
        start1 = time.time()
        resp1 = litellm.completion(model=model, messages=messages, caching=True)
        print(f" Duration 1: {time.time() - start1:.3f}s")

        print("\nMaking second call (expect cache hit)...")
        start2 = time.time()
        resp2 = litellm.completion(model=model, messages=messages, caching=True)
        print(f" Duration 2: {time.time() - start2:.3f}s (should be much faster)")
        assert resp1.choices[0].message.content == resp2.choices[0].message.content

        print("\nMaking third call (bypass cache)...")
        start3 = time.time()
        resp3 = litellm.completion(model=model, messages=messages, cache={"no-cache": True})
        print(f" Duration 3: {time.time() - start3:.3f}s (should be like first call)")

        print("\nMaking fourth call with different TTL...")
        start4 = time.time()
        resp4 = litellm.completion(model=model, messages=messages, cache={"ttl": 10}) # Cache only 10 sec
        print(f" Duration 4: {time.time() - start4:.3f}s")

    else:
         print("Cache configuration failed, skipping usage tests.")

except ImportError:
    print("\nError: 'diskcache' package not installed. Run: pip install litellm[diskcache]")
except Exception as e:
    print(f"\nAn error occurred with disk cache: {e}")
finally:
    litellm.cache = None # Disable cache
    # Clean up cache directory
    if os.path.exists(cache_directory):
         try:
             shutil.rmtree(cache_directory)
             print(f"\nCleaned up cache directory: {cache_directory}")
         except Exception as clean_e:
              print(f"Error cleaning up cache directory: {clean_e}")

Model & Prompt Handling Settings

Customize how LiteLLM identifies models and formats prompts for specific cases.

  • litellm.model_alias_map: Dict[str, str] = {}: Define short aliases for longer model names.

    litellm.model_alias_map = {"opus": "claude-3-opus-20240229"}
    # Now use litellm.completion(model="opus", ...)
    
  • litellm.register_model(model_cost: Union[str, dict]): Dynamically add or update model metadata (context window, cost, provider, features) in litellm.model_cost. Essential for custom models, fine-tunes, or overriding outdated defaults. See Utilities section for detailed example.

  • litellm.register_prompt_template(model: str, roles={}, ...): Define custom prompt structures (like Llama-2 [INST]) for specific model names used in litellm.completion. Overrides default OpenAI formatting. See Utilities section for detailed example.

  • litellm.drop_params: bool = False: If set to True, LiteLLM will silently ignore optional parameters passed to API functions (e.g., temperature, tools) that are not supported by the target model/provider, instead of raising an UnsupportedParamsError. Default (False) raises the error.

Global Budget Limits (litellm.max_budget)

A simple, in-process global spending limit (USD).

  • litellm.max_budget: float = 0.0: If set > 0, LiteLLM tracks cumulative estimated cost in litellm._current_cost. Before making a call, it checks if _current_cost + estimated_call_cost > max_budget. If so, raises BudgetExceededError.
  • litellm._current_cost: float = 0.0: Internal counter. Reset manually if needed (litellm._current_cost = 0.0).
  • Note: Less robust than BudgetManager or LiteLLM Proxy features. Only tracks cost within the current Python process lifetime.

(Remaining sections will follow this exhaustive level of detail)

Okay, this guide is becoming extremely long. I will continue generating the remaining sections with the same level of exhaustive detail as requested. Please be aware that the complete response will significantly exceed typical message length limits. I'll provide the next major sections now.

5. Core API Functions (litellm.*)

These are the primary functions provided by the litellm module for interacting with various LLM modalities. They offer a unified interface over diverse provider APIs. Each typically has a synchronous and an asynchronous (a prefix) version.

(For the most precise function signatures, parameter descriptions, and provider-specific nuances, refer to the LiteLLM source code or use help(litellm.).)

Chat Completions (litellm.completion / acompletion)

  • Purpose: The cornerstone function for generative text tasks using chat-based models. It manages conversation history (messages), supports advanced features like tool use and streaming, handles various parameters, and routes requests to the appropriate provider based on the model identifier.
  • Detailed Parameters:
    • model (str): Required. The unique identifier string for the target model. LiteLLM uses this string to determine the provider, endpoint, and applicable API rules. Examples: "gpt-4-turbo", "openai/gpt-4o", "azure/", "anthropic/claude-3-opus-20240229", "cohere/command-r-plus", "google/gemini-1.5-pro-latest", "vertex_ai/gemini-1.5-flash-001", "mistral/mistral-large-latest", "groq/llama3-70b-8192", "bedrock/anthropic.claude-3-sonnet-v1:0", "huggingface/meta-llama/Llama-3-8b-chat-hf", "ollama/llama3", "replicate/meta/llama-2-70b-chat:...".
    • messages (List[Dict[str, Any]]): Required. A list of message dictionaries representing the conversation history. Each dict needs:
      • "role" (str): "system", "user", "assistant", or "tool" (for tool results).
      • "content" (Union[str, List[Dict]]):
        • For text: A string containing the message content.
        • For multimodal (vision) input: A list containing dictionaries specifying content types. Example: [{"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": "..." or "https://...", "detail": "auto"|"low"|"high"}}]. Check model support (litellm.utils.supports_vision).
      • "tool_calls" (List[Dict], optional): Present on assistant messages when the model decides to call tools. Contains information about the functions to be called.
      • "tool_call_id" (str, optional): Required for messages with role: "tool". Must match the id from the corresponding assistant tool_calls entry.
    • stream (bool, optional, default=False): If True, returns an iterator/generator yielding response chunks (ModelResponse objects with delta). If False, returns the complete ModelResponse object after generation finishes.
    • temperature (float, optional, default=Provider default, often ~0.7-1.0): Controls randomness. Lower values (~0.1-0.3) make output more focused and deterministic; higher values (~1.0+) make it more creative and random. Usually between 0.0 and 2.0.
    • max_tokens (int, optional, default=Provider default): The maximum number of tokens to generate in the completion. Does not include prompt tokens. Be mindful of the model's total context window limit.
    • top_p (float, optional, default=Provider default, often 1.0): Nucleus sampling. Considers only the smallest set of tokens whose cumulative probability exceeds top_p. A value of 0.9 means only tokens comprising the top 90% probability mass are considered. Lower values restrict the sampling pool more. It's generally recommended to alter either temperature or top_p, not both.
    • stop (Union[str, List[str]], optional): One or more sequences where the API should stop generating further tokens. The returned text will not contain the stop sequence.
    • presence_penalty (float, optional, default=0.0): Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. Supported by OpenAI and some others.
    • frequency_penalty (float, optional, default=0.0): Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. Supported by OpenAI and some others.
    • logit_bias (Dict[int, float], optional): Modifies the probability of specific token IDs appearing in the completion. Keys are token IDs, values are bias adjustments (-100 to 100). Advanced usage for controlling output content. Requires knowing token IDs.
    • user (str, optional): A unique identifier representing the end-user making the request. This can help API providers monitor for and prevent abusive behavior. Highly recommended for production applications.
    • response_format (Dict[str, str], optional): Request a specific structure for the output. Standardized key is "type". Example: {"type": "json_object"} requests the model to output valid JSON. Support varies by model (check litellm.utils.supports_response_schema).
    • seed (int, optional): For models supporting deterministic outputs (like newer OpenAI models), providing the same seed and identical parameters should lead to mostly reproducible results (best-effort). Beta feature.
    • tools (List[Dict], optional): A list of tool definitions the model can choose to call. Each definition follows the OpenAI tool format (see Tool Use example below or litellm.utils.function_to_dict). Requires model support (litellm.utils.supports_function_calling).
    • tool_choice (Union[str, Dict], optional): Controls how the model uses the provided tools.
      • "none" (default if tools not provided): Model will not call any tool.
      • "auto" (default if tools provided): Model decides whether to call a tool or respond directly.
      • "required": Model must call one or more tools.
      • {"type": "function", "function": {"name": "my_specific_tool"}}: Forces the model to call the specified tool. Requires model support (litellm.utils.supports_tool_choice).
    • logprobs (bool, optional, default=False): Whether to return log probabilities for the output tokens. Support varies by model.
    • top_logprobs (int, optional): If logprobs is True, specifies the number of most likely tokens to return log probabilities for at each token position. Support varies by model.
    • LiteLLM Overrides & Controls:
      • api_key, api_base, api_version, azure_ad_token, aws_access_key_id, etc. (str, optional): Override authentication details (see Authentication section).
      • custom_llm_provider (str, optional): Manually specify the provider if LiteLLM cannot infer it correctly from the model string.
      • metadata (Dict, optional): Pass custom key-value data through the request lifecycle. Accessible in callbacks (kwargs["metadata"]) and potentially logged. Useful for tracing, user IDs, A/B testing flags, etc.
      • caching (bool | Dict, optional): Override global caching configuration for this call. Examples: True (use global cache), False (don't use cache), {"ttl": 300} (cache this response for 300s), {"no-cache": True} (force bypass cache). Requires litellm.cache to be configured globally.
      • num_retries (int, optional): Override litellm.num_retries for this call.
      • timeout (int | float, optional): Override litellm.request_timeout for this call.
      • fallbacks, context_window_fallbacks, content_policy_fallbacks (List[Dict], optional): Override global fallback lists for this specific call.
      • mock_response (str | ModelResponse | Exception, optional): For testing. Returns this mock instead of making a real API call. If an Exception is provided, it will be raised.
      • stream_options (Dict, optional): Provider-specific options related to streaming. Example for OpenAI: {"include_usage": True} attempts to include token usage data in the final chunk of the stream.
    • **kwargs: Pass-through for additional provider-specific parameters not directly mapped by LiteLLM (e.g., top_k for Cohere/Anthropic). Behavior depends on litellm.drop_params setting and provider support.
  • Return Object (stream=False): litellm.ModelResponse (Pydantic model, often subclassed like CompletionResponse)
    • id (str): Unique ID for the completion.
    • choices (List[Choice]): List of completion choices generated. Often n=1, so usually one item. Each Choice contains:
      • finish_reason (str): Why generation stopped ("stop", "length", "tool_calls", "content_filter", etc.).
      • index (int): Index of the choice (usually 0).
      • message (Message): The generated message object. Contains:
        • content (Optional[str]): The text content of the response.
        • role (str): Usually "assistant".
        • tool_calls (Optional[List[ToolCall]]): If the model called tools. Each ToolCall has id, type ("function"), and function (with name and arguments as a JSON string).
      • logprobs (Optional[Logprobs]): Log probability information if requested (logprobs=True). Contains content (list of Logprob objects with token, logprob, top_logprobs).
    • created (int): Unix timestamp of creation.
    • model (str): The actual model name that generated the response (useful if fallbacks occurred).
    • object (str): Object type, usually "chat.completion".
    • system_fingerprint (Optional[str]): Identifier for backend configuration used (OpenAI).
    • usage (Usage): Token counts: prompt_tokens, completion_tokens, total_tokens.
    • _hidden_params (Dict): Internal LiteLLM dictionary containing resolved call parameters, metadata, cost (if calculated by callback), deployment info (if using Router), etc. Useful for debugging and logging.
  • Return Object (stream=True): Stream Wrapper (litellm.CustomStreamWrapper / litellm.AsyncCustomStreamWrapper)
    • An iterator (sync) or async iterator (async).
    • Each item yielded is a litellm.ModelResponse chunk.
    • Chunks typically contain changes via the delta attribute within chunk.choices[0].message. Common delta fields:
      • delta.content (str): The next piece of generated text.
      • delta.role (str): Usually present only in the first chunk ("assistant").
      • delta.tool_calls (List[ToolCallChunk]): Information about tool calls being streamed. Each ToolCallChunk has index, id, type, and function (with name and arguments chunks).
    • The final chunk might contain finish_reason and potentially usage (if requested via stream_options={"include_usage": True} and supported).

Examples:

  • (Basic Sync/Async Calls): See Quickstart section.
  • (Detailed Tool Use - Parallel): See previous detailed Tool Use example.
  • (Streaming - Async):

    import litellm
    import asyncio
    async def stream_example():
        print("--- Streaming Example (Async) ---")
        try:
            stream = await litellm.acompletion(
                model="gpt-4o", # Or other fast model
                messages=[{"role": "user", "content": "Write a short paragraph about the benefits of streaming API responses."}],
                stream=True,
                max_tokens=150,
                stream_options={"include_usage": True} # Request usage in final chunk (OpenAI)
            )
            full_text = ""
            print("Stream Output:")
            async for chunk in stream:
                # Check for content delta
                if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
                    text_chunk = chunk.choices[0].delta.content
                    print(text_chunk, end="", flush=True)
                    full_text += text_chunk
                # Check for finish reason or usage in the final chunk
                if chunk.choices and chunk.choices[0].finish_reason:
                     print(f"\n[Stream End] Finish Reason: {chunk.choices[0].finish_reason}")
                if hasattr(chunk, 'usage') and chunk.usage:
                     print(f"[Stream End] Usage: Prompt={chunk.usage.prompt_tokens}, Completion={chunk.usage.completion_tokens}")
            print("\n--- Full Reconstructed Text ---")
            print(full_text)
        except Exception as e: print(f"\nStreaming Error: {e}")
    # asyncio.run(stream_example())
    
  • (Multimodal/Vision Input):

    import litellm
    import base64
    
    # Function to encode image (replace with your image loading)
    def encode_image_to_base64(image_path):
        try:
            with open(image_path, "rb") as image_file:
                return base64.b64encode(image_file.read()).decode('utf-8')
        except Exception as e:
            print(f"Error encoding image {image_path}: {e}")
            return None
    
    image_path = "path/to/your/image.jpg" # Replace with actual image path
    base64_image = encode_image_to_base64(image_path)
    
    if base64_image:
        vision_messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail. What objects are present?"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high" # Use 'high' for detailed analysis, 'low' for faster/cheaper overview
                        }
                    }
                ]
            }
        ]
        print("\n--- Multimodal Vision Call (Conceptual) ---")
        # try:
        #     response = litellm.completion(
        #         model="gpt-4-vision-preview", # Or "gpt-4o", "claude-3-opus-20240229", "vertex_ai/gemini-pro-vision"
        #         messages=vision_messages,
        #         max_tokens=300
        #     )
        #     print("Vision Model Response:", response.choices[0].message.content)
        # except Exception as e: print(f"Vision call failed: {e}")
    else:
        print("Skipping vision example, image encoding failed.")
    
  • (JSON Mode):

    import litellm
    import json
    print("\n--- JSON Mode Example ---")
    json_messages = [
        {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
        {"role": "user", "content": "Extract the name and job title from this text: 'Dr. Jane Doe is the Lead Scientist.'"}
    ]
    try:
        response = litellm.completion(
            model="gpt-4-turbo", # Or other model supporting JSON mode
            messages=json_messages,
            response_format={"type": "json_object"} # Request JSON output
        )
        json_content = response.choices[0].message.content
        print("Raw JSON Output:", json_content)
        # Validate and parse the JSON
        try:
            parsed_json = json.loads(json_content)
            print("Parsed JSON:", parsed_json)
            # Access data: print(parsed_json.get("name"))
        except json.JSONDecodeError:
            print("ERROR: Model did not return valid JSON despite request.")
    except Exception as e: print(f"JSON mode call failed: {e}")
    

Embeddings (litellm.embedding / aembedding)

  • Purpose: Generate dense vector representations (embeddings) of text, capturing semantic meaning for tasks like search, RAG, clustering, and classification.
  • Detailed Parameters:
    • model (str): Required. Embedding model identifier (e.g., "text-embedding-3-small", "text-embedding-3-large", "text-embedding-ada-002", "cohere/embed-english-v3.0", "azure/").
    • input (Union[str, List[str]]): Required. A single string or a list of strings to embed. Providers impose limits on batch size (number of strings) and total tokens per request.
    • dimensions (int, optional): For OpenAI text-embedding-3-* models. Specifies the desired output embedding dimension (e.g., 256, 512, 1024, 1536). Reducing dimensions can save storage/bandwidth but may slightly impact performance. Check OpenAI docs for supported values per model.
    • encoding_format (Literal["float", "base64"], optional): For OpenAI text-embedding-3-* models. Format for returning embeddings. "float" (default) returns standard lists of numbers. "base64" returns base64-encoded strings, which can be smaller for transmission but require decoding.
    • user (str, optional): End-user identifier for abuse monitoring.
    • input_type (str, optional, passed via **kwargs): For Cohere models. Specify input type for potentially better performance: "search_document", "search_query", "classification", "clustering".
    • LiteLLM Overrides & Controls: api_key, api_base, custom_llm_provider, metadata, caching, num_retries, timeout, mock_response.
  • Return Object: litellm.EmbeddingResponse
    • object (str): Usually "list".
    • data (List[Embedding]): A list of Embedding objects, one per input string, maintaining original order. Each Embedding contains:
      • object (str): Usually "embedding".
      • embedding (List[float] | str): The embedding vector (list of numbers if encoding_format="float", base64 string if "base64").
      • index (int): The original index of the input string corresponding to this embedding.
    • model (str): The model name used for the embedding.
    • usage (Usage): Token counts: prompt_tokens (tokens in the input text(s)), total_tokens (same as prompt_tokens for most embedding models).

Examples: OpenAI v3, Cohere, Azure

import litellm
import os
import asyncio
from typing import List, Union

# Required Keys in Environment:
# export OPENAI_API_KEY="sk-..."
# export COHERE_API_KEY="..."
# export AZURE_API_KEY="..."
# export AZURE_API_BASE="https://..." # Your Azure embedding endpoint base
# export AZURE_API_VERSION="..."

input_texts: List[str] = ["The quick brown fox jumps over the lazy dog.", "LiteLLM provides a unified API."]

# --- OpenAI text-embedding-3-small with dimension/format options ---
print("--- OpenAI text-embedding-3-small ---")
try:
    response_oai_v3 = litellm.embedding(
        model="text-embedding-3-small",
        input=input_texts,
        dimensions=256, # Request smaller dimension
        encoding_format="float" # Request standard float output
    )
    print(f"Generated {len(response_oai_v3.data)} embeddings.")
    print(f"  Dimension: {len(response_oai_v3.data[0].embedding)}") # Should be 256
    print(f"  Format: {type(response_oai_v3.data[0].embedding[0])}") # Should be float
    print(f"  Usage: {response_oai_v3.usage}")
except Exception as e: print(f"FAILED: {e}")

# --- Cohere embed-english-v3.0 with input_type ---
print("\n--- Cohere embed-english-v3.0 ---")
try:
    response_cohere_emb = litellm.embedding(
        model="cohere/embed-english-v3.0",
        input=input_texts,
        input_type="search_document" # Pass cohere-specific param
    )
    print(f"Generated {len(response_cohere_emb.data)} embeddings.")
    print(f"  Dimension: {len(response_cohere_emb.data[0].embedding)}")
    print(f"  Usage: {response_cohere_emb.usage}")
except Exception as e: print(f"FAILED: {e}")

# --- Azure OpenAI Embedding (Conceptual Async) ---
async def run_azure_embedding():
    print("\n--- Azure OpenAI Embedding (Async) ---")
    azure_embedding_deployment = "my-azure-embedding-deploy" # Replace with your deployment name
    try:
        response_azure_emb = await litellm.aembedding(
            model=f"azure/{azure_embedding_deployment}",
            input=input_texts
            # LiteLLM uses AZURE_API_KEY, AZURE_API_BASE, AZURE_API_VERSION from env
        )
        print(f"Generated {len(response_azure_emb.data)} embeddings.")
        print(f"  Dimension: {len(response_azure_emb.data[0].embedding)}") # Depends on your Azure deployed model
        print(f"  Usage: {response_azure_emb.usage}")
    except Exception as e: print(f"FAILED: {e}")

# asyncio.run(run_azure_embedding()) # Uncomment to run

(Remaining Core API functions will follow this exhaustive format)

Continuing with the exhaustive guide, focusing on Image Generation, Transcription, and TTS:

Image Generation (litellm.image_generation / aimage_generation)

  • Purpose: Generate images from textual descriptions (prompts) using generative image models like DALL-E 2, DALL-E 3, Stable Diffusion variants (via providers like Bedrock, Stability AI, Azure AI Studio), etc.
  • Detailed Parameters:
    • model (str): Required. Identifier for the target image generation model. Examples: "dall-e-3", "dall-e-2", "azure/", "stability.stable-diffusion-xl-1024-v1-0" (Stability AI direct), "bedrock/stability.stable-diffusion-xl-v1" (Bedrock), "bedrock/amazon.titan-image-generator-v1".
    • prompt (str): Required. The textual description of the image you want to create. Prompt engineering is key here; be descriptive. Provider safety filters apply.
    • n (int, optional, default=1): The number of images to generate for the given prompt. Check provider documentation for limits (e.g., DALL-E 2 supports > 1, DALL-E 3 currently supports n=1 via API).
    • size (str, optional): The desired dimensions of the generated image(s). Supported values depend heavily on the model.
      • DALL-E 3: "1024x1024", "1792x1024", "1024x1792".
      • DALL-E 2: "256x256", "512x512", "1024x1024".
      • Stable Diffusion: Often accepts various dimensions, sometimes specific ones like "1024x1024", "512x512", etc. Bedrock might use height and width parameters instead. Consult provider docs.
    • quality (str, optional): For DALL-E 3. "standard" or "hd" (higher detail, potentially slower/more expensive).
    • style (str, optional): For DALL-E 3. "vivid" (hyper-realistic and dramatic) or "natural" (more realistic, less intense).
    • response_format (Literal["url", "b64_json"], optional): Specifies how image data is returned.
      • "url" (Default for OpenAI): Provides temporary HTTPS URLs pointing to the generated images. URLs typically expire after a short period (e.g., 1 hour). You need to download the image from the URL.
      • "b64_json": Returns the raw image data encoded as a Base64 string directly within the JSON response. This is common for providers like Bedrock, Stability AI. Requires decoding on the client side.
    • user (str, optional): A unique identifier for the end-user for abuse monitoring.
    • Provider-Specific Params (kwargs):** Models like Stable Diffusion accept many additional parameters controlling the generation process. These are often passed via keyword arguments. Examples (names might vary slightly by provider/LiteLLM mapping):
      • height (int), width (int): Explicit dimensions (used by some SD providers).
      • cfg_scale (float): Classifier-Free Guidance scale. How strongly the image should conform to the prompt (e.g., 7-15).
      • style_preset (str): Predefined style options (e.g., "photorealistic", "cinematic", "anime"). Check provider docs.
      • steps (int): Number of diffusion steps (e.g., 30-100). More steps can increase detail but take longer.
      • seed (int): Seed for the random number generator for potentially reproducible results (if other params are identical).
      • negative_prompt (str): Describe what not to include in the image.
    • LiteLLM Overrides & Controls: api_key, api_base, custom_llm_provider, metadata, timeout, num_retries, mock_response.
  • Return Object: litellm.ImageResponse
    • created (int): Unix timestamp of creation.
    • data (List[Image]): A list where each item (Image object) represents one generated image. Each Image contains:
      • b64_json (Optional[str]): The Base64-encoded image data (if response_format="b64_json").
      • url (Optional[str]): The temporary URL for the image (if response_format="url").
      • revised_prompt (Optional[str]): The prompt might be rewritten by the provider (especially DALL-E 3) for safety or clarity. This field contains the revised version if applicable.

Examples: DALL-E 3 (URL), Bedrock SDXL (Base64)

import litellm
import os
import base64
import io
import time
import asyncio
# Optional: from PIL import Image # To decode/save/show base64 images

# Required Keys in Environment:
# export OPENAI_API_KEY="sk-..."
# Assumes standard AWS Credentials configured for Bedrock

# --- DALL-E 3 Example (URL Response Format) ---
print("--- Generating Image with DALL-E 3 (Response Format: URL) ---")
dalle_prompt = "Pop art style illustration of a cat wearing sunglasses, riding a skateboard on a rainbow."
try:
    response_dalle = litellm.image_generation(
        model="dall-e-3",
        prompt=dalle_prompt,
        n=1, # DALL-E 3 API currently supports n=1
        size="1024x1024",
        quality="standard", # Or "hd"
        style="vivid",      # Or "natural"
        response_format="url"
    )
    print(f"DALL-E 3: Request successful. Created at: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(response_dalle.created))}")
    if response_dalle.data:
        image_info = response_dalle.data[0]
        print(f"  Image URL (expires soon): {image_info.url}")
        if image_info.revised_prompt:
            print(f"  Revised Prompt by DALL-E 3: {image_info.revised_prompt}")
        # In a real app, you would now download the image from image_info.url
    else:
        print("  No image data received in the response.")

except Exception as e:
    print(f"DALL-E 3 generation FAILED: {type(e).__name__} - {e}")


# --- Bedrock Stable Diffusion XL Example (Base64 Response Format, Async) ---
async def run_bedrock_sdxl_gen():
    print("\n--- Generating Image with Bedrock SDXL (Response Format: b64_json) ---")
    bedrock_prompt = "Epic fantasy landscape painting, waterfalls cascading down mossy cliffs into a hidden valley, dramatic lighting, style of Albert Bierstadt."
    output_image_path = "bedrock_sdxl_output.png"
    try:
        response_bedrock = await litellm.aimage_generation(
            model="bedrock/stability.stable-diffusion-xl-v1", # Use correct Bedrock model identifier
            prompt=bedrock_prompt,
            response_format="b64_json", # Request Base64 encoded data
            # Pass Stable Diffusion specific parameters via kwargs
            height=1024,
            width=1024,
            cfg_scale=8.0,
            steps=40,
            seed=45678,
            style_preset="fantasy-art" # Check Bedrock docs for valid style presets
        )
        print(f"Bedrock SDXL: Request successful. Created at: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(response_bedrock.created))}")
        if response_bedrock.data:
            b64_image_data = response_bedrock.data[0].b64_json
            if b64_image_data:
                print(f"  Received Base64 data (first 60 chars): {b64_image_data[:60]}...")
                # Decode and save the image
                try:
                    image_bytes = base64.b64decode(b64_image_data)
                    with open(output_image_path, "wb") as f:
                        f.write(image_bytes)
                    print(f"  Image successfully decoded and saved to: {output_image_path}")
                    # Optional: Open with PIL
                    # img = Image.open(io.BytesIO(image_bytes))
                    # img.show()
                except Exception as decode_err:
                    print(f"  ERROR: Failed to decode/save base64 image: {decode_err}")
            else:
                print("  ERROR: Response received but 'b64_json' field was empty.")
        else:
            print("  No image data received in the response.")

    except litellm.exceptions.AuthenticationError:
        print("Bedrock generation FAILED: Authentication Error. Ensure AWS credentials are properly configured (env vars, IAM role, etc.).")
    except Exception as e:
        print(f"Bedrock generation FAILED: {type(e).__name__} - {e}")

# asyncio.run(run_bedrock_sdxl_gen()) # Uncomment to run
# Remember to clean up the output image if desired:
# if os.path.exists(output_image_path): os.remove(output_image_path)

Audio Transcription (litellm.transcription / atranscription)

  • Purpose: Convert spoken audio from an audio file into written text using speech-to-text models like OpenAI's Whisper.
  • Detailed Parameters:
    • model (str): Required. Identifier for the transcription model (e.g., "whisper-1", "azure/").
    • file (FileTypes): Required. The audio file to transcribe. This argument accepts several types:
      • File Object (Recommended): An opened file handle in binary read mode (rb). Use a with open(...) as f: block to ensure the file is properly closed. Example: with open("meeting_audio.mp3", "rb") as audio_file: ... litellm.transcription(file=audio_file).
      • File Path (str): A string containing the path to the audio file. LiteLLM will attempt to open and read this file. Example: litellm.transcription(file="path/to/recording.wav").
      • Bytes: Raw bytes of the audio file content. Example: audio_bytes = request.data; litellm.transcription(file=audio_bytes).
    • language (str, optional): The language of the speech in the audio file, specified as an ISO-639-1 code (e.g., "en", "es", "fr", "de", "ja"). Providing the language significantly improves accuracy, especially for non-English audio or audio with accents. If omitted, the model attempts auto-detection.
    • prompt (str, optional): A textual prompt to provide context to the model, which can improve the transcription of specific words, names, acronyms, or guide the style/formatting. Example: "The speakers are discussing Project Nightingale and ACME Corp. They mention Dr. Evelyn Reed.".
    • response_format (Literal["json", "text", "srt", "verbose_json", "vtt"], optional, default="json"): The desired format for the transcription output.
      • "json": Returns a JSON object with a text field containing the full transcription.
      • "text": Returns the transcription as a single plain text string.
      • "srt": Returns the transcription formatted as an SubRip Text (SRT) subtitle file string, including timestamps.
      • "vtt": Returns the transcription formatted as a Web Video Text Tracks (WebVTT) subtitle file string, including timestamps.
      • "verbose_json": Returns a detailed JSON object containing the full text, language, duration, and potentially lists of segments and words with associated timestamps (if requested via timestamp_granularities).
    • temperature (float, optional, default=0.0): Sampling temperature between 0 and 1. Higher values (> 0) make the output more random, which can sometimes help with creative interpretations but may also lead to hallucinations or nonsensical text, especially if the audio quality is low. Lower values (close to 0) make the output more deterministic and conservative. For transcription, the default of 0.0 is usually best.
    • timestamp_granularities (List[Literal["word", "segment"]], optional, default=[]): When response_format="verbose_json", specify whether to include timestamps at the "word" level, "segment" level (longer chunks of speech), or both. Requesting word timestamps significantly increases the size and detail of the response. Example: timestamp_granularities=["word", "segment"].
    • LiteLLM Overrides & Controls: api_key, api_base, api_version (Azure), custom_llm_provider, metadata, timeout, num_retries, mock_response.
  • Return Object: litellm.TranscriptionResponse (or subclasses depending on format). The structure varies greatly based on response_format:
    • If response_format="text": May return just the str. Check isinstance.
    • If response_format="srt" or "vtt": Returns the formatted subtitle str.
    • If response_format="json" or "verbose_json": Returns a Pydantic model (or dict) with fields like:
      • text (str): The full transcribed text.
      • (Verbose Only) language (str): Detected language code.
      • (Verbose Only) duration (float): Duration of the audio in seconds.
      • (Verbose Only) segments (Optional[List[Dict]]): List of speech segments, each with id, seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob.
      • (Verbose Only) words (Optional[List[Dict]]): List of recognized words, each with word, start, end (if timestamp_granularities included "word").

Example: Text and Verbose JSON Transcription with Timestamps

import litellm
import os
import time
from typing import List, Dict, Optional

# Required Keys in Environment:
# export OPENAI_API_KEY="sk-..."

# --- Create a dummy audio file if none exists ---
audio_file_path = "transcript_test.mp3" # Replace with your actual audio file
real_file_exists = os.path.exists(audio_file_path)
if not real_file_exists:
    print(f"Warning: Creating dummy audio file '{audio_file_path}' (transcription will likely fail).")
    # Create a minimal fake WAV file (Whisper needs more substantial audio)
    with open(audio_file_path, "wb") as f:
        f.write(b'RIFF$\x00\x00\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\x80>\x00\x00\x00\xfa\x00\x00\x02\x00\x10\x00data\x00\x00\x00\x00')
else:
    print(f"Using existing audio file: {audio_file_path}")

# --- 1. Simple Text Transcription ---
print("\n--- Transcribing (Output: Text) ---")
try:
    with open(audio_file_path, "rb") as audio_stream:
        start_time = time.time()
        transcript_obj = litellm.transcription(
            model="whisper-1",
            file=audio_stream,
            response_format="text",
            # language="en", # Optional: Specify language
            # prompt="Keywords: LiteLLM, Whisper" # Optional: Provide context
        )
        duration = time.time() - start_time
    print(f"Text Transcription Result ({duration:.2f}s):")
    # Handle potential string return for 'text' format
    if isinstance(transcript_obj, str):
        print(transcript_obj)
    elif hasattr(transcript_obj, 'text'):
        print(transcript_obj.text)
    else:
        print("ERROR: Could not extract text from response.")
        print("Raw Response:", transcript_obj)
except FileNotFoundError:
    print(f"ERROR: Audio file not found at '{audio_file_path}'")
except Exception as e:
    print(f"ERROR during text transcription: {type(e).__name__} - {e}")

# --- 2. Verbose JSON Transcription with Word Timestamps ---
print("\n--- Transcribing (Output: Verbose JSON with Word Timestamps) ---")
try:
    with open(audio_file_path, "rb") as audio_stream:
        start_time = time.time()
        verbose_response = litellm.transcription(
            model="whisper-1",
            file=audio_stream,
            response_format="verbose_json",
            timestamp_granularities=["word"] # Request word timestamps
        )
        duration = time.time() - start_time
    print(f"Verbose Transcription Result ({duration:.2f}s):")
    if hasattr(verbose_response, 'text'):
        print(f"  Full Text: {verbose_response.text[:100]}...") # Show snippet
        print(f"  Detected Language: {getattr(verbose_response, 'language', 'N/A')}")
        print(f"  Audio Duration: {getattr(verbose_response, 'duration', 'N/A'):.2f}s")

        if hasattr(verbose_response, 'words') and verbose_response.words:
            print("  Word Timestamps (First 10):")
            for word_info in verbose_response.words[:10]:
                 # word_info is dict: {'word': 'Hello', 'start': 0.5, 'end': 0.9}
                 print(f"    - '{word_info['word']}' ({word_info['start']:.2f}s - {word_info['end']:.2f}s)")
        else:
            print("  No word timestamp data returned.")
    else:
        print("ERROR: Verbose JSON response structure not recognized.")
        print("Raw Response:", verbose_response)

except FileNotFoundError:
    print(f"ERROR: Audio file not found at '{audio_file_path}'")
except Exception as e:
    print(f"ERROR during verbose transcription: {type(e).__name__} - {e}")

finally:
    # Clean up dummy file
    if not real_file_exists and os.path.exists(audio_file_path):
        os.remove(audio_file_path)
        print(f"\nCleaned up dummy file: {audio_file_path}")

Text-to-Speech (litellm.speech / aspeech)

  • Purpose: Synthesize natural-sounding spoken audio from input text using Text-to-Speech (TTS) models (e.g., OpenAI TTS, Azure Cognitive Services TTS).
  • Detailed Parameters:
    • model (str): Required. Identifier for the TTS model (e.g., "tts-1", "tts-1-hd" for OpenAI; "azure/" or use standard Azure voice names with appropriate provider config).
    • input (str): Required. The text string to be converted into speech. Provider limits apply (e.g., OpenAI TTS models have a limit around 4096 characters). Longer texts need to be chunked and synthesized separately.
    • voice (str): Required. Specifies the voice to use for synthesis. Values depend on the provider:
      • OpenAI: "alloy", "echo", "fable", "onyx", "nova", "shimmer".
      • Azure: The exact voice short name (e.g., "en-US-JennyNeural", "es-ES-ElviraNeural"). Check Azure documentation for available voices.
    • response_format (str, optional): The desired audio encoding format for the output. Defaults vary by provider.
      • OpenAI Default: "mp3". Supported: "opus" (low latency, variable bitrate), "aac" (good compression, digital audio), "flac" (lossless compression), "wav" (uncompressed PCM), "pcm" (raw uncompressed PCM).
      • Check provider docs for formats supported by other TTS models.
    • speed (float, optional, default=1.0): Controls the speed of the generated speech relative to the normal speed. Supported range depends on the provider.
      • OpenAI: 0.25 (slower) to 4.0 (faster).
    • LiteLLM Overrides & Controls: api_key, api_base, api_version (Azure), custom_llm_provider, metadata, timeout, mock_response.
  • Return Object: litellm.types.llms.openai.HttpxBinaryResponseContent (or similar binary response wrapper)
    • Crucial Usage Note: This object represents the streaming HTTP response containing the binary audio data. You cannot access the audio directly as an attribute. You must use its methods, ideally within a with statement to ensure proper resource handling:
      • response.stream_to_file(file_path: str): Efficiently writes the incoming audio stream directly to the specified file path. This is the recommended way to save the audio.
      • response.read() -> bytes: Reads the entire audio content into a bytes object in memory. Use this if you need to process the bytes directly (e.g., send over network, analyze). Be cautious with very large audio files as this loads everything into RAM.
      • response.iter_bytes(chunk_size=...) -> Iterator[bytes]: Provides an iterator yielding chunks of bytes, suitable for custom streaming logic.
      • response.close(): Explicitly closes the response stream (handled automatically by with statement).
  • Use Case: Generating voiceovers for videos or presentations, creating audiobooks or podcasts from text, building interactive voice response (IVR) systems or voice assistants, providing audio feedback in applications, enabling accessibility features (read-aloud).

Example: Generating Speech and Handling the Response

import litellm
import os
import time
from typing import Optional

# Required Keys in Environment:
# export OPENAI_API_KEY="sk-..."

text_for_speech = """
LiteLLM makes generating speech simple. You provide text, choose a model and voice,
and receive audio data. Remember to handle the binary response correctly using methods
like stream_to_file or read.
"""
output_directory = "./tts_output"
output_filename_mp3 = os.path.join(output_directory, f"litellm_speech_{int(time.time())}.mp3")
output_filename_flac = os.path.join(output_directory, f"litellm_speech_{int(time.time())}.flac")

# Ensure output directory exists
os.makedirs(output_directory, exist_ok=True)

print("--- Generating Speech (OpenAI TTS-1) ---")

# --- 1. Generate MP3 (Default) and save using stream_to_file ---
print(f"\nGenerating MP3 output, saving to {output_filename_mp3}...")
try:
    # Use 'with' statement for resource management
    with litellm.speech(
        model="tts-1",       # Standard quality, lower latency
        input=text_for_speech,
        voice="shimmer",     # Choose a voice
        # response_format="mp3" # Default for OpenAI
        speed=1.0           # Normal speed
    ) as response_mp3:
        # Check status code (optional, but good practice)
        if hasattr(response_mp3, 'http_response') and response_mp3.http_response.status_code == 200:
             print("  HTTP request successful. Streaming to file...")
             response_mp3.stream_to_file(output_filename_mp3)
             print(f"  MP3 speech successfully saved.")
        else:
             status = getattr(response_mp3, 'http_response', None)
             status_code = getattr(status, 'status_code', 'N/A')
             print(f"  ERROR: HTTP request failed with status {status_code}")
             # You might want to read error content if available: error_content = response_mp3.read()

except litellm.exceptions.AuthenticationError:
    print("  ERROR: Authentication failed. Check OPENAI_API_KEY.")
except Exception as e:
    print(f"  ERROR during MP3 generation/saving: {type(e).__name__} - {e}")


# --- 2. Generate FLAC (Higher Quality) and read bytes ---
print(f"\nGenerating FLAC output, reading bytes into memory...")
audio_bytes_flac: Optional[bytes] = None
try:
    with litellm.speech(
        model="tts-1-hd",      # Higher definition model
        input="Generating high definition audio in FLAC format.",
        voice="onyx",
        response_format="flac", # Request FLAC format
        speed=0.9              # Slightly slower
    ) as response_flac:
         if hasattr(response_flac, 'http_response') and response_flac.http_response.status_code == 200:
             print("  HTTP request successful. Reading bytes...")
             # Read the entire audio data into memory
             audio_bytes_flac = response_flac.read()
             print(f"  Successfully read {len(audio_bytes_flac)} bytes of FLAC data.")
             # Now you could process audio_bytes_flac (e.g., send elsewhere, analyze)
             # Optionally save the bytes manually:
             # with open(output_filename_flac, "wb") as f:
             #    f.write(audio_bytes_flac)
             # print(f"  Manually saved FLAC bytes to {output_filename_flac}")
         else:
             status = getattr(response_flac, 'http_response', None)
             status_code = getattr(status, 'status_code', 'N/A')
             print(f"  ERROR: HTTP request failed with status {status_code}")

except litellm.exceptions.AuthenticationError:
     print("  ERROR: Authentication failed. Check OPENAI_API_KEY.")
except Exception as e:
    print(f"  ERROR during FLAC generation/reading: {type(e).__name__} - {e}")

# --- Optional Cleanup ---
# print("\nConsider cleaning up generated files in:", output_directory)
# if os.path.exists(output_filename_mp3): os.remove(output_filename_mp3)
# if os.path.exists(output_filename_flac): os.remove(output_filename_flac)
# if not os.listdir(output_directory): os.rmdir(output_directory)

(Continuing with the remaining Core API functions, Router, Exceptions, etc., maintaining exhaustive detail)

Okay, continuing the exhaustive guide with the remaining Core API functions, Router details, and other sections.

Content Moderation (litellm.moderation / amoderation)

  • Purpose: Classify input text against safety policies to detect potentially harmful or inappropriate content. Currently, this function primarily acts as a wrapper for the OpenAI Moderation API.
  • Detailed Parameters:
    • input (Union[str, List[str]]): Required. The text content(s) to be evaluated. Providing a list of strings may result in multiple internal calls depending on the underlying API's batching support (OpenAI's endpoint supports single string input).
    • model (str, optional): The specific moderation model version to use.
      • "text-moderation-latest": (Default) Always points to OpenAI's recommended latest moderation model.
      • "text-moderation-stable": Points to the current stable version (less likely to change unexpectedly).
      • "text-moderation-007" (or other specific versions): Use a specific dated model version if needed, though generally latest or stable are preferred.
    • LiteLLM Overrides & Controls: api_key (OpenAI key), api_base (if using Azure OpenAI for moderation or a proxy), custom_llm_provider (should typically be "openai" or "azure" if applicable), metadata, timeout, num_retries, mock_response.
  • Return Object (Matches OpenAI SDK): openai.types.moderation.ModerationCreateResponse (Pydantic model or compatible dict)
    • id (str): A unique identifier for the moderation request.
    • model (str): The specific moderation model version used (e.g., "text-moderation-007").
    • results (List[Result]): A list containing moderation results, typically one item corresponding to the input string. Each Result object contains:
      • flagged (bool): True if the input text was flagged in any category, False otherwise. This is the primary indicator of potentially problematic content.
      • categories (Categories): An object (or dict) where keys are category names and values are booleans indicating if that category was triggered (True) or not (False). Categories include:
        • hate
        • hate_threatening (Hate speech also involving violence/threats)
        • harassment
        • harassment_threatening (Harassment also involving violence/threats)
        • self_harm
        • self_harm_intent (Expressing intent for self-harm)
        • self_harm_instructions (Providing instructions for self-harm)
        • sexual
        • sexual_minors (Sexual content involving minors)
        • violence
        • violence_graphic (Depicting graphic violence)
      • category_scores (CategoryScores): An object (or dict) where keys are the same category names and values are floating-point confidence scores (typically 0.0 to 1.0, though not strictly bounded) indicating the model's confidence level that the text belongs to that category. Higher scores indicate higher confidence. Note: Use the boolean categories flags for actual policy violation decisions, not just the raw scores, as thresholds are applied internally by OpenAI.
  • Use Case: Pre-screening user prompts before sending them to generative models, filtering user-generated content (comments, posts), ensuring chatbot responses comply with safety guidelines, implementing content warning systems.

Example: Moderating Multiple Text Inputs

import litellm
import os
from typing import List, Union

# Required Keys in Environment:
# export OPENAI_API_KEY="sk-..."

inputs_to_moderate: List[str] = [
    "What a wonderful day filled with sunshine and puppies!", # Safe
    "I strongly dislike this product, it's terrible.", # Negative sentiment, likely safe
    "This input contains explicit threats of violence towards a group.", # Should be flagged
    "Learn how to build a website using Python and Flask." # Safe, technical
]

print("--- Moderating Multiple Text Inputs ---")

moderation_results = []
for index, text_input in enumerate(inputs_to_moderate):
    print(f"\nModerating Input #{index + 1}: '{text_input[:60]}...'")
    try:
        # Call moderation for each input individually
        response = litellm.moderation(
            input=text_input,
            model="text-moderation-latest" # Use the latest model
        )

        if response and response.results:
            result_data = response.results[0] # Get the result for this input
            moderation_results.append({
                "input_index": index,
                "input_text": text_input,
                "flagged": result_data.flagged,
                "categories": dict(result_data.categories), # Convert Pydantic model to dict
                "scores": dict(result_data.category_scores) # Convert Pydantic model to dict
            })
            # Print immediate feedback
            print(f"  Flagged: {result_data.flagged}")
            if result_data.flagged:
                triggered = {k: f"{v:.4f}" for k, flag in result_data.categories.items() if flag and k in result_data.category_scores}
                print(f"  Triggered Categories (Score): {triggered}")
        else:
            print("  ERROR: Invalid or empty response received from moderation API.")
            moderation_results.append({"input_index": index, "input_text": text_input, "error": "Invalid Response"})

    except litellm.exceptions.AuthenticationError:
        print("  ERROR: Authentication failed. Check OPENAI_API_KEY.")
        moderation_results.append({"input_index": index, "input_text": text_input, "error": "AuthenticationError"})
        # break # Option to stop if auth fails
    except Exception as e:
        print(f"  ERROR during moderation: {type(e).__name__} - {e}")
        moderation_results.append({"input_index": index, "input_text": text_input, "error": str(e)})

# --- Process Overall Results ---
print("\n--- Moderation Summary ---")
flagged_count = sum(1 for r in moderation_results if isinstance(r, dict) and r.get('flagged'))
error_count = sum(1 for r in moderation_results if isinstance(r, dict) and r.get('error'))
print(f"Total Inputs: {len(inputs_to_moderate)}")
print(f"Flagged Inputs: {flagged_count}")
print(f"Errors: {error_count}")

# Example: Filter flagged inputs for review
flagged_inputs = [r for r in moderation_results if isinstance(r, dict) and r.get('flagged')]
if flagged_inputs:
    print("\nFlagged Inputs for Review:")
    for item in flagged_inputs:
         print(f"  Index {item['input_index']}: '{item['input_text'][:80]}...' Categories: {[k for k,v in item['categories'].items() if v]}")

Text Completions (Legacy) (litellm.text_completion / atext_completion)

  • Purpose: Interact with older completion-style models (like OpenAI's gpt-3.5-turbo-instruct) or custom endpoints that use a simple prompt-in, text-out format rather than the structured chat message format. This is generally not recommended for newer chat-optimized models.
  • Detailed Parameters:
    • model (str): Required. Identifier for the text completion model (e.g., "gpt-3.5-turbo-instruct", "azure/").
    • prompt (Union[str, List[str]]): Required. The input prompt string(s). While OpenAI's endpoint accepts a list for batching, standard usage is a single string.
    • max_tokens (int, optional, default=16): Maximum number of tokens to generate in the completion.
    • temperature (float, optional, default=1.0): Controls randomness (0.0-2.0).
    • top_p (float, optional, default=1.0): Nucleus sampling.
    • n (int, optional, default=1): How many completions to generate for each prompt.
    • stream (bool, optional, default=False): If True, returns a generator yielding completion chunks.
    • logprobs (int, optional, default=None): Include log probabilities for the logprobs most likely tokens at each position. E.g., logprobs=5.
    • echo (bool, optional, default=False): If True, include the prompt text in the response choices alongside the completion.
    • stop (Union[str, List[str]], optional): Sequence(s) where generation should stop.
    • presence_penalty (float, optional, default=0.0): Penalize new tokens based on presence (-2.0 to 2.0).
    • frequency_penalty (float, optional, default=0.0): Penalize new tokens based on frequency (-2.0 to 2.0).
    • best_of (int, optional, default=1): Server-side parameter. Generates best_of completions and returns the one with the highest log probability per token. Caution: Uses best_of * n tokens. Must be > n.
    • suffix (str, optional, default=None): Text to append to the end of the generation (after the main completion).
    • user (str, optional): End-user identifier.
    • LiteLLM Overrides & Controls: api_key, api_base, custom_llm_provider, metadata, caching, num_retries, timeout, mock_response.
  • Return Object (stream=False): litellm.TextCompletionResponse (Pydantic model or compatible dict)
    • id (str): Unique ID.
    • object (str): Usually "text_completion".
    • created (int): Unix timestamp.
    • model (str): Model name used.
    • choices (List[TextChoice]): List of completion choices (length depends on n). Each TextChoice contains:
      • text (str): The generated completion text.
      • index (int): Index of the choice.
      • logprobs (Optional[Logprobs]): Log probability information if requested. Contains fields like tokens (list of token strings), token_logprobs (list of log probabilities), top_logprobs (list of dicts mapping token string to logprob), text_offset (list of character offsets).
      • finish_reason (str): "stop" or "length".
    • usage (Usage): Token counts (prompt_tokens, completion_tokens, total_tokens).
  • Return Object (stream=True): Stream Wrapper (litellm.TextCompletionStreamWrapper / litellm.AsyncTextCompletionStreamWrapper)
    • An iterator (sync) or async iterator (async).
    • Each item yielded is a TextCompletionResponse chunk.
    • Chunks contain changes, typically within chunk.choices[0].text (the next piece of text) or chunk.choices[0].logprobs. The final chunk usually contains the finish_reason.

Example: Using Text Completion (Instruct Model)

import litellm
import os

# Required Keys in Environment:
# export OPENAI_API_KEY="sk-..."

instruct_model = "gpt-3.5-turbo-instruct"
instruct_prompt = """Translate the following English text to French:
English: Hello, how are you today?
French:"""

print(f"--- Text Completion Example ({instruct_model}) ---")
try:
    response = litellm.text_completion(
        model=instruct_model,
        prompt=instruct_prompt,
        max_tokens=50,
        temperature=0.3,
        stop=["\n"], # Stop at the first newline
        echo=False # Don't include the prompt in the output text
    )

    print(f"Prompt:\n{instruct_prompt}")
    if response.choices:
        completion_text = response.choices[0].text.strip()
        finish_reason = response.choices[0].finish_reason
        print(f"\nCompletion:\n{completion_text}")
        print(f"[Finish Reason: {finish_reason}]")
    else:
        print("\nERROR: No completion choices received.")

    if response.usage:
        print(f"[Usage: Prompt={response.usage.prompt_tokens}, Completion={response.usage.completion_tokens}]")

except litellm.exceptions.NotFoundError:
     print(f"\nERROR: Model '{instruct_model}' not found or not available for text completion.")
except Exception as e:
    print(f"\nERROR during text completion: {type(e).__name__} - {e}")

Adapter Completions (litellm.adapter_completion / aadapter_completion)

  • Purpose: A powerful feature for advanced users needing to integrate LiteLLM with systems or workflows that use custom request and response formats different from LiteLLM's standard (OpenAI-like) structures. Adapters act as bi-directional translators.
  • Concept:

    1. Define Custom Schemas: Specify the exact structure of your input request (what adapter_completion will receive) and your desired output format (what adapter_completion will return). Pydantic models are excellent for this.
    2. Implement Adapter Class: Create a Python class inheriting from litellm.types.adapter.AdapterV1. This class needs:
      • A unique id: str attribute.
      • translate_completion_input_params(self, kwargs: dict) -> dict: This method receives the **kwargs passed to litellm.adapter_completion. It must validate these against your custom input schema and translate them into a dictionary containing standard litellm.completion parameters (model, messages, temperature, etc.).
      • translate_completion_output_params(self, response: litellm.ModelResponse) -> Any: This method receives the standard litellm.ModelResponse object resulting from the underlying litellm.completion call (made using the translated input params). It must translate this standard response into your desired custom output format (e.g., your Pydantic output model). It can return any type.
    3. Register Adapter: Create an instance of your adapter class and register it with LiteLLM before calling adapter_completion:

      my_adapter = MyCustomAdapter()
      if not any(a['id'] == my_adapter.id for a in litellm.adapters):
          litellm.adapters.append({"id": my_adapter.id, "adapter": my_adapter})
      
  • Detailed Parameters (litellm.adapter_completion):

    • adapter_id (str): Required. The unique id string of your registered adapter class instance.
    • **kwargs: Required. All keyword arguments expected by your adapter's translate_completion_input_params method (matching your custom input schema). Do NOT pass standard LiteLLM params like model or messages here unless your adapter specifically expects them.
  • Returns: Any. The value returned by your adapter's translate_completion_output_params method. The type and structure are entirely defined by your adapter implementation.

  • Use Case: Integrating LiteLLM into existing Flask/FastAPI applications with specific request bodies, calling LiteLLM from systems expecting a different response structure, adding complex pre-processing or post-processing logic tied to specific request formats.

Example: Implementing and Using a Simple Adapter

import litellm
from litellm import ModelResponse, Choices, Message, Usage # For mocking response
from litellm.types.adapter import AdapterV1 # Base class for adapter
from pydantic import BaseModel, Field
from typing import Optional, List, Dict, Any

# --- 1. Define Custom Schemas ---
class InternalRequest(BaseModel):
    """Input schema expected by our adapter."""
    task_id: str
    user_prompt: str
    target_language: str = "English"
    max_output_len: int = Field(default=100, gt=0) # Pydantic validation

class InternalResponse(BaseModel):
    """Output schema returned by our adapter."""
    request_task_id: str
    llm_response_text: str
    language_used: str
    llm_model_used: Optional[str] = None
    input_tokens: Optional[int] = None
    output_tokens: Optional[int] = None

# --- 2. Implement Adapter Class ---
class TranslationAdapter(AdapterV1):
    id: str = "internal-translator-v1" # Unique ID

    def translate_completion_input_params(self, kwargs: dict) -> dict:
        """Translate from InternalRequest format to litellm.completion format."""
        print(f"\n[Adapter {self.id}] Received input kwargs: {kwargs}")
        # Validate input using our Pydantic model
        try:
            internal_input = InternalRequest(**kwargs)
        except Exception as e:
             print(f"[Adapter {self.id}] Input validation failed: {e}")
             raise ValueError(f"Invalid input for adapter {self.id}: {e}") from e

        # Construct standard litellm messages
        messages = [
            {"role": "system", "content": f"You are a translator. Translate the user's text to {internal_input.target_language}."},
            {"role": "user", "content": internal_input.user_prompt}
        ]

        # Map to litellm.completion parameters
        litellm_params = {
            "model": "gpt-3.5-turbo", # Adapter decides the underlying model
            "messages": messages,
            "max_tokens": internal_input.max_output_len,
            "temperature": 0.2, # Fixed temperature for translation
            # Pass original task_id through metadata for tracking
            "metadata": {"adapter_id": self.id, "original_task_id": internal_input.task_id}
        }
        print(f"[Adapter {self.id}] Translated to litellm params: {litellm_params}")
        return litellm_params

    def translate_completion_output_params(self, response: litellm.ModelResponse) -> Optional[InternalResponse]:
        """Translate from litellm.ModelResponse format to InternalResponse format."""
        print(f"\n[Adapter {self.id}] Received LiteLLM response (Type: {type(response)}). Translating output...")
        # Extract original task_id from metadata passed through
        original_task_id = response._hidden_params.get('metadata', {}).get('original_task_id', 'UNKNOWN_TASK')
        target_language = "Unknown" # Ideally, we'd get this from the input translation state if needed

        if isinstance(response, ModelResponse) and response.choices and response.choices[0].message.content:
            # Create our custom response object
            try:
                internal_output = InternalResponse(
                    request_task_id=original_task_id,
                    llm_response_text=response.choices[0].message.content.strip(),
                    language_used=target_language, # Placeholder - could be refined
                    llm_model_used=response.model,
                    input_tokens=response.usage.prompt_tokens if response.usage else None,
                    output_tokens=response.usage.completion_tokens if response.usage else None
                )
                print(f"[Adapter {self.id}] Translated to InternalResponse: {internal_output.model_dump()}")
                return internal_output
            except Exception as e:
                 print(f"[Adapter {self.id}] Error creating InternalResponse: {e}")
                 # Decide how to handle translation failure - return None or raise?
                 return None # Returning None in this case
        else:
            print(f"[Adapter {self.id}] Translation failed: No valid content in LiteLLM response.")
            return None

# --- 3. Register the Adapter ---
translator_adapter_instance = TranslationAdapter()
if not any(a['id'] == translator_adapter_instance.id for a in litellm.adapters):
    litellm.adapters.append({"id": translator_adapter_instance.id, "adapter": translator_adapter_instance})
    print(f"\nAdapter '{translator_adapter_instance.id}' registered.")
else:
    print(f"\nAdapter '{translator_adapter_instance.id}' already registered.")


# --- 4. Use adapter_completion ---
print("\n--- Calling litellm.adapter_completion ---")
# Define the input matching the InternalRequest schema
adapter_input_data = {
    "task_id": "translate-eng-fr-123",
    "user_prompt": "Hello world, how are you?",
    "target_language": "French",
    "max_output_len": 40
}

try:
    # Mock the underlying litellm.completion response for testing the adapter flow
    mock_llm_response = ModelResponse(
        id="cmpl-mock-adapter", model="gpt-3.5-turbo", choices=[
            Choices(message=Message(content=" Bonjour le monde, comment allez-vous?"), index=0)
        ], usage=Usage(prompt_tokens=35, completion_tokens=9, total_tokens=44),
        _hidden_params={"metadata": {"adapter_id": "internal-translator-v1", "original_task_id": "translate-eng-fr-123"}}
    )

    # Call adapter_completion with the adapter ID and the custom input data
    final_response: Optional[InternalResponse] = litellm.adapter_completion(
        adapter_id=translator_adapter_instance.id,
        mock_response=mock_llm_response, # Mock the call *after* input translation
        **adapter_input_data # Pass custom input as kwargs
    )

    print("\nFinal Response Received from adapter_completion:")
    if isinstance(final_response, InternalResponse):
        print(json.dumps(final_response.model_dump(), indent=2))
    else:
        print(f"Adapter did not return the expected InternalResponse object. Received: {final_response}")

except ValueError as e:
    print(f"\nERROR: Input validation likely failed in adapter: {e}")
except KeyError as e:
     print(f"\nERROR: Adapter ID '{e}' not found. Was it registered?")
except Exception as e:
    print(f"\nERROR during adapter_completion call: {type(e).__name__} - {e}")

Batch Completions (litellm.batch_completion / litellm.abatch_completion)

  • Purpose: Efficiently send multiple chat completion requests concurrently. Ideal for processing large volumes of independent text generation tasks, significantly improving throughput over sequential requests.
  • Detailed Parameters:
    • requests (List[Union[Dict, litellm.types.completion.CompletionRequest]]): Required. A list where each element is a dictionary defining a single completion request. Each dictionary must include "model" and "messages", and can include any other valid parameter for litellm.completion specific to that request (e.g., temperature, max_tokens, tools, metadata, api_key, api_base, etc.).
    • max_concurrent_requests (int, optional): The maximum number of requests allowed to be in flight simultaneously. LiteLLM tries to determine a sensible default based on system limits (e.g., asyncio loop limits, OS open file limits), often around 100-1000. Tuning this might be necessary based on system resources and provider rate limits.
    • use_threadpool (bool, optional, default=False): For synchronous batch_completion only. If True, uses a concurrent.futures.ThreadPoolExecutor for concurrency instead of the default asyncio event loop approach. This might be useful in specific synchronous application contexts but generally asyncio (used by abatch_completion and the default batch_completion) offers better performance for I/O-bound tasks like API calls.
  • Return Value: List[Union[litellm.ModelResponse, Exception]]
    • A list containing the results for each request, in the same order as the input requests list.
    • Each element is either:
      • A successful litellm.ModelResponse object if that specific request completed without errors.
      • An Exception object (e.g., RateLimitError, AuthenticationError, APIConnectionError) if that specific request failed.
    • The batch_completion / abatch_completion call itself generally only raises an exception if there's a fundamental issue setting up the batch process (e.g., invalid requests structure), not for individual request failures. You must iterate through the returned list to check the status of each request.
  • Use Case: Large-scale data processing (classification, summarization, extraction), parallel prompt evaluation, A/B testing variations, generating personalized content in bulk.

Example: Processing Batch Results

import litellm
import os
import asyncio
import time
from typing import List, Dict, Any, Union

# Assume API keys set in environment (OpenAI, Anthropic used)

# --- Define Batch Requests ---
list_of_requests: List[Dict[str, Any]] = []
for i in range(5): # Create 5 requests
    list_of_requests.append({
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": f"Request {i+1}: Tell me a unique fact."}],
        "max_tokens": 30, "temperature": 0.6 + i*0.1, # Vary temperature slightly
        "metadata": {"batch_index": i, "attempt_id": f"fact_batch_{int(time.time())}"}
    })
list_of_requests.append({ # Add a request to a different provider
    "model": "claude-3-haiku-20240307",
    "messages": [{"role": "user", "content": "Request 6: What is Anthropic known for?"}],
    "max_tokens": 40, "metadata": {"batch_index": 5}
})
list_of_requests.append({ # Add a failing request
    "model": "gpt-4", "messages": [{"role": "user", "content": "Fail this."}],
    "api_key": "invalid-key", "metadata": {"batch_index": 6}
})

print(f"--- Sending Batch of {len(list_of_requests)} Requests (Async) ---")

async def run_and_process_batch():
    start_batch = time.time()
    batch_results: List[Union[litellm.ModelResponse, Exception]] = []
    try:
        batch_results = await litellm.abatch_completion(
            requests=list_of_requests,
            max_concurrent_requests=5 # Limit concurrency slightly for demo
        )
    except Exception as batch_e:
        print(f"CRITICAL ERROR during batch execution: {batch_e}")
        return # Stop if the whole batch call fails

    duration_batch = time.time() - start_batch
    print(f"\nBatch processing finished in {duration_batch:.2f} seconds.")

    # --- Process Individual Results ---
    successes = []
    failures = []
    print("\n--- Individual Request Results ---")
    for i, result in enumerate(batch_results):
        original_request = list_of_requests[i]
        model_requested = original_request.get("model", "N/A")
        metadata = original_request.get("metadata", {})
        print(f"\nResult for Request Index {i} (Model: {model_requested}, Meta: {metadata}):")

        if isinstance(result, litellm.ModelResponse):
            print(f"  Status: SUCCESS")
            successes.append({
                "index": i,
                "model": result.model,
                "response": result.choices[0].message.content[:80].strip() + "...",
                "usage": result.usage
            })
            print(f"  Response: {successes[-1]['response']}")
        elif isinstance(result, Exception):
            print(f"  Status: FAILED")
            failures.append({
                "index": i,
                "model": model_requested, # Model from request, might not be in exception
                "error_type": type(result).__name__,
                "error_message": str(result)
            })
            print(f"  Error: {failures[-1]['error_type']} - {failures[-1]['error_message']}")
        else:
            print(f"  Status: UNEXPECTED TYPE - {type(result)}")
            failures.append({
                "index": i, "model": model_requested, "error_type": "UnknownResultType", "error_message": str(result)
            })

    print("\n--- Final Batch Summary ---")
    print(f"Total attempted: {len(list_of_requests)}")
    print(f"Successful: {len(successes)}")
    print(f"Failed: {len(failures)}")
    # You can now work with the 'successes' and 'failures' lists

# asyncio.run(run_and_process_batch()) # Uncomment to run

Reranking (litellm.rerank / litellm.arerank)

  • Purpose: Improve search relevance or context quality by reordering a list of documents based on their semantic relevance to a specific query, using specialized reranking models (like Cohere Rerank). Typically used in RAG pipelines after initial document retrieval.
  • Detailed Parameters:
    • model (str): Required. Identifier for the reranking model (e.g., "cohere/rerank-english-v2.0", "cohere/rerank-multilingual-v2.0").
    • query (str): Required. The search query, question, or topic to rank documents against.
    • documents (List[Union[str, Dict[str, str]]]): Required. The list of documents retrieved from the initial search phase (e.g., vector DB). Can be:
      • List of strings: ["doc text 1", "doc text 2", ...]
      • List of dictionaries: [{"text": "doc text 1", "id": "d1"}, {"text": "doc text 2", "id": "d2"}, ...] (Recommended if you have metadata like IDs).
    • top_n (int, optional): Return only the top top_n most relevant documents. If omitted, all documents are returned, sorted by relevance.
    • rank_fields (List[str], optional, default=["text"]): If documents is a list of dictionaries, specify the key(s) within each dictionary that contain the text content to be used for ranking.
    • return_documents (bool, optional, default=True): Whether to include the original document content in the response results. Set to False to only receive indices and relevance scores.
    • LiteLLM Overrides & Controls: api_key (often required, e.g., COHERE_API_KEY), api_base, custom_llm_provider (should be "cohere" usually), metadata, timeout, num_retries, mock_response.
  • Return Object: litellm.RerankResponse
    • id (str): Unique response ID.
    • results (List[RerankResult]): List of RerankResult objects, sorted by relevance_score (highest first). Each RerankResult contains:
      • document (Optional[Dict]): The original document dictionary (if input was dicts and return_documents=True) or {"text": "..."} (if input was strings and return_documents=True). None if return_documents=False.
      • index (int): The original 0-based index of this document in the input documents list.
      • relevance_score (float): The relevance score assigned by the model (higher is better). Range depends on model (e.g., 0.0-1.0 for Cohere).
    • meta (Dict): Metadata, potentially including billing info like meta['billed_units']['input_tokens'] for Cohere.

Example: Improving RAG Context with Rerank

import litellm
import os
from typing import List, Dict, Union

# Required Keys in Environment:
# export COHERE_API_KEY="..."

# --- Simulate Initial Retrieval (e.g., from Vector Search) ---
user_query: str = "What were the main challenges in the Apollo moon landing missions?"
retrieved_docs_rag: List[Dict[str, str]] = [
    {"id": "nasa-report-1972", "text": "The Apollo program successfully landed humans on the Moon, overcoming significant engineering hurdles related to the Saturn V rocket and lunar module descent."}, # Relevant
    {"id": "space-race-overview", "text": "The Space Race between the USA and USSR spurred rapid advancements in rocketry and space exploration technology throughout the 1960s."}, # Context, less specific
    {"id": "apollo13-incident", "text": "Apollo 13 faced a critical oxygen tank explosion, requiring immense ingenuity from mission control and the crew to return safely to Earth. This highlighted onboard system risks."}, # Highly relevant challenge
    {"id": "lunar-geology-summary", "text": "Samples returned by Apollo missions provided invaluable data about the Moon's geological history and composition."}, # Less relevant to challenges
    {"id": "computer-systems-apollo", "text": "The Apollo Guidance Computer (AGC) had limited memory and processing power by modern standards, presenting software development and operational challenges."}, # Relevant challenge
    {"id": "budget-politics-apollo", "text": "Securing consistent funding and navigating the political landscape were ongoing challenges for the long-term Apollo program."}, # Relevant challenge
    {"id": "mars-rover-tech", "text": "Modern Mars rovers utilize advanced autonomous navigation systems not available during the Apollo era."} # Irrelevant
]
print(f"Retrieved {len(retrieved_docs_rag)} initial documents for RAG.")

# --- Use Rerank to select the best context ---
print("\n--- Reranking retrieved documents for relevance ---")
try:
    rerank_response = litellm.rerank(
        model="cohere/rerank-english-v2.0",
        query=user_query,
        documents=retrieved_docs_rag,
        top_n=3 # Select the top 3 most relevant documents for final context
    )

    print(f"\nTop {len(rerank_response.results)} Reranked Documents:")
    final_context_docs = []
    if rerank_response.results:
        for rank, result in enumerate(rerank_response.results):
            print(f"\nRank {rank + 1}: Score={result.relevance_score:.4f}, Original Index={result.index}")
            if result.document: # Check if document data was returned
                 doc_id = result.document.get('id', 'N/A')
                 doc_text = result.document.get('text', '')
                 print(f"  ID: {doc_id}, Text: {doc_text[:150]}...")
                 final_context_docs.append(doc_text) # Collect text for final LLM call
            else:
                 print("  (Document content not returned)")
        # Expected Order: Likely apollo13-incident, computer-systems-apollo, budget-politics-apollo / nasa-report-1972

        # --- Use the reranked context in a final Completion call ---
        print("\n--- Using Top 3 Reranked Docs as Context for Final Completion ---")
        context_str = "\n\n".join([f"Document {i+1}:\n{doc}" for i, doc in enumerate(final_context_docs)])
        final_messages = [
            {"role": "system", "content": "Based ONLY on the following documents, answer the user's question concisely."},
            {"role": "user", "content": f"CONTEXT DOCUMENTS:\n{context_str}\n\nQUESTION: {user_query}"}
        ]
        # print("Final Prompt Preview:\n", json.dumps(final_messages, indent=2)) # Uncomment to see prompt
        # final_answer_response = litellm.completion(model="gpt-4-turbo", messages=final_messages, max_tokens=200)
        # print("\nFinal LLM Answer based on Reranked Context:")
        # print(final_answer_response.choices[0].message.content)

    else:
        print("Reranking returned no results.")

except litellm.exceptions.AuthenticationError:
    print("\nERROR: Reranking failed. Check COHERE_API_KEY.")
except Exception as e:
    print(f"\nERROR during reranking or final completion: {type(e).__name__} - {e}")

OpenAI API Pass-through Functions

LiteLLM provides direct wrappers for many specialized OpenAI APIs, allowing you to use them while potentially benefiting from LiteLLM's configuration handling (keys, base URLs for Azure) and features like the Router.

  • Concept: These functions (litellm.create_file, litellm.list_fine_tuning_jobs, litellm.create_run, etc.) are designed to accept parameters and return response objects that directly mirror the official OpenAI Python SDK.
  • Reference OpenAI Docs: For the exact function signatures, required parameters, optional arguments, and the detailed structure of the response objects (usually Pydantic models from the openai library), you must refer to the official OpenAI API Reference for the relevant API (Files, Fine-tuning, Batch, Assistants v1/v2).
  • LiteLLM Additions: LiteLLM adds its standard override parameters (api_key, api_base, timeout, metadata, num_retries, etc.) to these calls.
  • Use Cases:
    • Files API: Uploading (create_file), listing (list_files), retrieving (retrieve_file), deleting (delete_file), getting content (retrieve_file_content) for use with Assistants, Fine-tuning, Batch.
    • Fine-tuning API: Creating (create_fine_tuning_job), listing (list_fine_tuning_jobs), retrieving (retrieve_fine_tuning_job), canceling (cancel_fine_tuning_job), listing events (list_fine_tuning_job_events) for custom model training.
    • Batch API: Creating (create_batch), retrieving (retrieve_batch), canceling (cancel_batch), listing (list_batches) asynchronous batch processing jobs.
    • Assistants API (v1/v2): Building stateful conversational agents by managing Assistants (create_assistants, retrieve_assistants, etc.), Threads (create_thread, etc.), Messages (create_message, list_messages, etc.), Runs (create_run, retrieve_run, submit_tool_outputs_to_run, etc.), and Run Steps.

Conceptual Examples (Illustrating Call Patterns - Refer to OpenAI Docs for Details):

import litellm
import os
import time

# Assumes OPENAI_API_KEY or AZURE_* vars are set

# --- File API: List Files ---
print("--- File API: List Files (Conceptual) ---")
# try:
#     my_files = litellm.list_files(purpose="assistants") # Filter by purpose if needed
#     print(f"Found {len(my_files.data)} files with purpose 'assistants'.")
#     # for file_obj in my_files.data: print(f"  - ID: {file_obj.id}, Name: {file_obj.filename}")
# except Exception as e: print(f"Failed: {e}")

# --- Fine-tuning API: List Jobs ---
print("\n--- Fine-tuning API: List Jobs (Conceptual) ---")
# try:
#     ft_jobs = litellm.list_fine_tuning_jobs(limit=5) # Get recent jobs
#     print(f"Found {len(ft_jobs.data)} fine-tuning jobs.")
#     # for job in ft_jobs.data: print(f"  - ID: {job.id}, Model: {job.fine_tuned_model}, Status: {job.status}")
# except Exception as e: print(f"Failed: {e}")

# --- Batch API: Retrieve Batch Job ---
print("\n--- Batch API: Retrieve Job (Conceptual) ---")
# batch_id_to_check = "batch_xxxxxxxxxxxx" # Replace with a real Batch ID
# try:
#     batch_status = litellm.retrieve_batch(batch_id=batch_id_to_check)
#     print(f"Batch Job {batch_status.id} Status: {batch_status.status}")
#     # print(f"  Input File ID: {batch_status.input_file_id}")
#     # print(f"  Output File ID: {batch_status.output_file_id}") # Available when completed
# except Exception as e: print(f"Failed: {e}")

# --- Assistants API: List Assistants ---
print("\n--- Assistants API: List Assistants (Conceptual) ---")
# try:
#     my_assistants = litellm.list_assistants(limit=10, order="desc")
#     print(f"Found {len(my_assistants.data)} assistants.")
#     # for assistant in my_assistants.data: print(f"  - ID: {assistant.id}, Name: {assistant.name}, Model: {assistant.model}")
# except Exception as e: print(f"Failed: {e}")

Health Checks (litellm.health_check / litellm.ahealth_check)

  • Purpose: Perform a simple check to verify if a specific model endpoint is reachable and if the provided credentials are valid for a minimal interaction. Useful for monitoring and configuration validation.
  • Detailed Parameters:
    • model (str): Required. The model identifier string for the endpoint to check (e.g., "gpt-3.5-turbo", "azure/my-healthcheck-deploy").
    • mode (Literal["completion", "embedding"], optional, default="completion"): Specifies the type of minimal API call to make for the check. "completion" usually tries a very short completion; "embedding" tries a very short embedding.
    • LiteLLM Overrides & Controls: api_key, api_base, api_version, custom_llm_provider, timeout.
  • Return Object: litellm.utils.HealthCheckResponse (TypedDict)
    • healthy (bool): True if the minimal API call succeeded (e.g., received HTTP 200 OK), indicating reachability and valid authentication. False otherwise.
    • error_message (Optional[str]): If healthy is False, contains a string representation of the error encountered (e.g., "AuthenticationError: Incorrect API key provided", "NotFoundError: The model xyz does not exist", "APIConnectionError: Connection refused").

Example: Checking Multiple Endpoints

import litellm
import os
import asyncio
from typing import Dict, List

# Required Keys in Environment for models being checked

async def run_detailed_health_checks():
    endpoints_to_check: List[Dict[str, str]] = [
        {"name": "OpenAI GPT-3.5", "model": "gpt-3.5-turbo"},
        {"name": "Azure GPT-4 (Example)", "model": "azure/your-gpt4-deployment"}, # Replace with your deployment
        {"name": "Anthropic Claude Haiku", "model": "claude-3-haiku-20240307"},
        {"name": "Invalid Model Name", "model": "this-model-does-not-exist-at-all"},
        {"name": "Ollama Local", "model": "ollama/llama3"} # Needs Ollama running + OLLAMA_API_BASE
    ]

    print("--- Running Detailed Health Checks (Async) ---")
    health_results: Dict[str, Dict] = {}

    for endpoint_info in endpoints_to_check:
        name = endpoint_info["name"]
        model_id = endpoint_info["model"]
        print(f"\nChecking: {name} ({model_id})")
        try:
            # Use async health check
            status = await litellm.ahealth_check(model=model_id, timeout=15) # 15s timeout
            health_results[name] = status
            print(f"  Result -> Healthy: {status.get('healthy')}")
            if not status.get('healthy'):
                print(f"            Error: {status.get('error_message')}")
        except Exception as e:
            # Catch errors in the health_check call itself (e.g., if LiteLLM has internal issue)
            print(f"  Health check call itself failed: {type(e).__name__} - {e}")
            health_results[name] = {"healthy": False, "error_message": f"Health check function error: {e}"}

    print("\n--- Health Check Summary ---")
    for name, status in health_results.items():
         health_str = "✅ Healthy" if status.get('healthy') else f"❌ Unhealthy ({status.get('error_message', 'Unknown error')})"
         print(f"  {name:<30}: {health_str}")

# asyncio.run(run_detailed_health_checks()) # Uncomment to run

(Final sections: Router, Exceptions, Cost, Budget, Utilities, Constants will follow with exhaustive detail)

Okay, this is the final stretch, covering the remaining sections exhaustively as requested.

6. Router (litellm.Router)

The litellm.Router class is a sophisticated component for managing and routing requests across a pool of multiple LLM API deployments. It provides advanced capabilities for high availability, load balancing, performance/cost optimization, and centralized management of diverse endpoints.

Benefits of Using the Router (Detailed)

  • High Availability & Failover: Define multiple deployments (e.g., different Azure regions, OpenAI + Anthropic keys) for the same logical model group. If one deployment fails (due to API errors, rate limits, timeouts after retries), the Router automatically routes subsequent requests for that group to the next available healthy deployment. You can also configure explicit fallback groups (e.g., if the entire "gpt-4-pool" fails, automatically try the "claude-prod-pool").
  • Load Balancing: Distribute incoming requests across multiple identical deployments within a group. This is crucial for scaling beyond the rate limits (RPM/TPM) of a single API key or endpoint. Strategies include:
    • simple-shuffle: Randomly distributes load (stateless).
    • least-busy: Intelligently sends requests to the deployment currently handling the lowest load relative to its configured TPM/RPM limits (stateful, requires Redis).
    • usage-based-routing: Similar to least-busy, focuses on distributing load evenly based on usage percentage (stateful, requires Redis).
  • Performance Optimization (latency-based-routing): Continuously track the average request latency for each deployment within a group. Route incoming requests to the deployment with the lowest recent latency. Adapts to changing network conditions or provider performance (stateful, requires Redis).
  • Cost Optimization (cost-based-routing): Define costs per token for each deployment. The Router estimates the cost of an incoming request (based on prompt token count) for each available deployment in the group and routes it to the cheapest option. Useful when deployments for the same capability have different pricing (e.g., different cloud provider models, spot instances).
  • Centralized Endpoint & Credential Management: Define all your LLM deployments, including their specific models, API keys, base URLs, versions, and capacity limits (tpm/rpm) within the model_list configuration passed to the Router. Simplifies managing diverse endpoints.
  • Health Checks & Cooldowns: The Router automatically performs background health checks on deployments (if Redis is configured). Deployments that fail consecutively (allowed_fails) are temporarily removed from the routing pool for a configured cooldown_time, preventing requests from being sent to known unhealthy endpoints.
  • A/B Testing & Gradual Rollouts: Can be configured (often via custom strategies or careful model_list setup) to route specific percentages of traffic to different model versions or deployment configurations.
  • Unified Interface: Interact with all your routed deployments using the same familiar router.completion, router.embedding, etc., methods, abstracting away the underlying complexity.

Initialization (__init__)

Create and configure a Router instance. This is where you define the pool of deployments and the core routing behavior.

  • Exhaustive Parameters:
    • model_list (List[Dict]): Required. The heart of the Router configuration. A list where each dictionary defines one deployment. Key fields within each deployment dictionary:
      • "model_name" (str): Required. A logical name or alias for a group of deployments (e.g., "production-chat", "embedding-cluster", "gpt4-eu-deployments"). Requests made to the router using this name will be routed among the deployments sharing this model_name.
      • "litellm_params" (Dict): Required. A dictionary containing the parameters needed to call this specific deployment using the underlying litellm.* functions.
        • Must include "model": The actual provider-specific model identifier string (e.g., "gpt-4o", "azure/my-deployment-name-123", "bedrock/anthropic.claude-3-sonnet-v1:0").
        • Include Credentials/Endpoints: Provide "api_key", "api_base", "api_version", provider-specific keys ("aws_region_name", "vertex_project", etc.) here if they are unique to this deployment and should not be picked up from global litellm.* settings or environment variables. If credentials are the same for all deployments in a group and set globally/in env, they might not be needed here (though explicit is often clearer).
      • "tpm" (int, optional): Tokens Per Minute capacity limit estimated for this specific deployment endpoint/key. Used by "least-busy" and "usage-based-routing" strategies. If omitted, the router might try to infer from litellm.model_cost but explicitly setting it is better for load balancing.
      • "rpm" (int, optional): Requests Per Minute capacity limit estimated for this specific deployment. Used by "least-busy" and "usage-based-routing".
      • "model_info" (Dict, optional): Can contain arbitrary metadata about the deployment. A key field is "id" (str): A unique identifier for this specific deployment instance within the router. If not provided, LiteLLM assigns a UUID. Providing explicit IDs (e.g., "azure-eastus-gpt4-key1") makes management easier (delete_deployment, get_deployment, specific_deployment override).
      • "deployment_tags" (List[str], optional): Tags associated with this deployment, used for advanced tag-based routing (Enterprise feature).
      • "weight" (float, optional): Used by weighted-random routing strategy to influence selection probability.
      • "latency" (float, optional): Initial latency estimate (seconds). latency-based-routing will update this dynamically.
      • "cost_per_token" (Dict[str, float], optional): Pre-define {"input_cost_per_token": ..., "output_cost_per_token": ...} for this deployment, overriding litellm.model_cost. Used by cost-based-routing.
    • routing_strategy (str, optional, default="simple-shuffle"): Defines the algorithm for selecting the next available deployment within a requested model_name group.
      • "simple-shuffle": Randomly shuffles the list of healthy deployments in the group and picks the first one. Stateless, good default.
      • "least-busy": Selects the healthy deployment currently processing the fewest tokens/requests relative to its declared tpm/rpm limits. Aims to keep load balanced based on capacity. Requires Redis.
      • "latency-based-routing": Selects the healthy deployment with the lowest exponentially weighted moving average (EWMA) latency based on recent requests. Adapts to performance changes. Requires Redis.
      • "cost-based-routing": Estimates the cost of the current request for each healthy deployment (using prompt_tokens and deployment cost_per_token info) and selects the cheapest.
      • "usage-based-routing": Selects the healthy deployment with the lowest current usage percentage (based on recent TPM/RPM vs limits). Requires Redis.
      • "weighted-random": Selects randomly but weighted by the weight parameter defined in the deployment's model_list entry.
      • Custom function name: If you registered a custom strategy using set_custom_routing_strategy.
    • Redis Configuration (Required for stateful strategies, cooldowns, usage tracking):
      • redis_host (str, optional): Hostname or IP of your Redis server.
      • redis_port (int, optional): Port number of your Redis server (default 6379).
      • redis_password (str, optional): Password for your Redis server.
      • redis_url (str, optional): Alternative way to provide connection info as a single Redis URL string (e.g., "redis://:password@hostname:port/0"). Overrides individual host/port/password if provided.
    • cache_responses (bool, optional, default=False): Enables response caching via the router. Requires Redis connection parameters to be set (uses Redis as the cache backend). Responses are cached based on request parameters (model, messages, etc.).
    • num_retries (int, optional, default=Value of litellm.num_retries): Default number of retries per deployment before considering it failed for the current request (triggering routing to the next deployment or fallback).
    • timeout (float, optional, default=Value of litellm.request_timeout): Default request timeout in seconds per deployment attempt.
    • fallbacks (List[Dict[str, List[str]]], optional): Router-level fallback rules between model groups. Example: [{"group_A": ["group_B", "group_C"]}, {"*": ["default_group"]}].
    • context_window_fallbacks, content_policy_fallbacks (List[Dict], optional): Similar fallback lists triggered only by specific error types.
    • allowed_fails (int, optional): Number of consecutive failures from a specific deployment before it is temporarily marked unhealthy (put in cooldown). Requires Redis.
    • cooldown_time (float, optional, default=litellm.DEFAULT_COOLDOWN_TIME_SECONDS e.g., 300): Duration (seconds) a deployment stays in cooldown. Requires Redis.
    • set_verbose (bool, optional, default=False): Enables detailed logging messages from the Router about health checks, deployment selection, cooldowns, etc., to stderr. Useful for debugging routing logic.
    • model_group_alias (Dict[str, str], optional): Define aliases mapping a user-friendly name (key) to an internal model_name group defined in model_list (value). Example: {"chat": "chat-prod-main"}. Allows users to call router.completion(model="chat", ...) which routes to the "chat-prod-main" group.
    • router_ttl (float, optional): Time-to-live in seconds for Redis keys used by the router (e.g., latency, usage stats). Defaults to None (keys persist).
    • enable_health_checks (bool, optional, default=True if Redis configured, else False): Whether the router should perform periodic background health checks on deployments. Requires Redis.
    • health_check_interval (int, optional, default=litellm.ROUTER_DEFAULT_HEALTH_CHECK_INTERVAL e.g., 60): Frequency (seconds) of background health checks.
  • Returns: A configured litellm.Router instance.

Example: Comprehensive Router Initialization

import litellm
import os
from typing import List, Dict

# Assume API keys are set in environment

# Define Deployments with details for different strategies
deployments: List[Dict] = [
    # --- Production Chat Group ---
    {
        "model_name": "prod-chat",
        "litellm_params": {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")},
        "tpm": 60000, "rpm": 500, "model_info": {"id": "openai-gpt4o-1"},
        "cost_per_token": {"input_cost_per_token": 0.000005, "output_cost_per_token": 0.000015} # $5/$15 per 1M
    },
    {
        "model_name": "prod-chat",
        "litellm_params": {"model": "azure/my-gpt4o-deploy", "api_key": os.getenv("AZURE_API_KEY"), "api_base": os.getenv("AZURE_API_BASE"), "api_version": os.getenv("AZURE_API_VERSION")},
        "tpm": 50000, "rpm": 400, "model_info": {"id": "azure-gpt4o-eastus"},
        "cost_per_token": {"input_cost_per_token": 0.000006, "output_cost_per_token": 0.000018} # Slightly different Azure cost
    },
    # --- Production Embedding Group ---
    {
        "model_name": "prod-embed",
        "litellm_params": {"model": "text-embedding-3-large", "api_key": os.getenv("OPENAI_API_KEY")},
        "tpm": 1000000, "rpm": 3000, "model_info": {"id": "openai-embed-large"},
        "cost_per_token": {"input_cost_per_token": 0.00000013, "output_cost_per_token": 0.0} # $0.13 per 1M input
    },
    # --- Fallback/Cheaper Chat Group ---
    {
        "model_name": "backup-chat",
        "litellm_params": {"model": "claude-3-haiku-20240307", "api_key": os.getenv("ANTHROPIC_API_KEY")},
        "tpm": 80000, "rpm": 1000, "model_info": {"id": "anthropic-haiku-backup"},
        "cost_per_token": {"input_cost_per_token": 0.00000025, "output_cost_per_token": 0.00000125} # $0.25/$1.25 per 1M
    }
]

# Configure Router with Redis for stateful features
print("--- Initializing Router with Redis and Multiple Strategies Possible ---")
try:
    router = litellm.Router(
        model_list=deployments,
        # Choose a stateful strategy - e.g., least-busy
        routing_strategy="least-busy",
        # Or latency-based-routing, or cost-based-routing etc.

        # Redis connection for state
        redis_host=os.getenv("REDIS_HOST", "127.0.0.1"),
        redis_port=int(os.getenv("REDIS_PORT", 6379)),
        redis_password=os.getenv("REDIS_PASSWORD", None),

        # Enable response caching via Redis
        cache_responses=True,
        # cache_kwargs={"ttl": 3600}, # Optional: Cache TTL if different from default

        # Fallback rules
        fallbacks=[{"prod-chat": ["backup-chat"]}],

        # Cooldown settings
        allowed_fails=5,     # Mark unhealthy after 5 consecutive fails
        cooldown_time=120,   # Keep unhealthy for 120 seconds

        # General settings
        num_retries=2,       # Retry each deployment twice before failing/routing
        timeout=300,         # 5 minute timeout per attempt
        set_verbose=True     # Enable detailed router logging
    )
    print("\nRouter initialized successfully with Redis.")
    print(f"Strategy: {router.routing_strategy}, Cache Enabled: {router.cache_responses}")
    print(f"Model Groups: {router.get_model_names()}")

except ImportError:
     print("\nERROR: Router init failed. Requires 'redis' extra. Run: pip install litellm[redis]")
except Exception as e:
    print(f"\nERROR initializing router (Is Redis running and accessible?): {type(e).__name__} - {e}")

Core Router Methods (completion, embedding, etc.)

Use the router instance's methods, mirroring the core litellm.* API, to make routed requests.

  • Usage: router.completion(...), router.aembedding(...), router.aimage_generation(...), etc.
  • model Parameter: Use the model_name (group alias) from your model_list (e.g., "prod-chat").
  • Other Parameters: Pass all standard arguments for the corresponding task (messages, input, prompt, temperature, stream, etc.).
  • Parameter Overrides:
    • specific_deployment (str): Pass a deployment's unique model_info["id"] (e.g., "azure-gpt4o-eastus") to force the request to that specific deployment, bypassing the routing strategy.
    • metadata, caching, num_retries, timeout, fallbacks: Can be overridden per-call on the router method, just like on core litellm.* functions.
  • Return Value: Same type as the core function (ModelResponse, EmbeddingResponse, etc.). The response object's _hidden_params dictionary is crucial for understanding routing:
    • _hidden_params['deployment'] (Dict): Contains the full configuration dictionary of the deployment that ultimately successfully processed the request (including its model_name, litellm_params, model_info, etc.). Useful for logging and debugging which endpoint was used.

Example: Router Calls (Chat, Embedding, Streaming, Override)

import litellm
import asyncio
import os

# --- Assume 'router' is initialized from the comprehensive example above ---

async def run_router_calls():
    if 'router' not in locals() or not isinstance(router, litellm.Router):
         print("Router not initialized, skipping calls.")
         return

    # --- 1. Routed Chat Completion ---
    print("\n--- 1. Router Chat Completion ('prod-chat' group) ---")
    try:
        response_chat = await router.acompletion(
            model="prod-chat", # Target the group
            messages=[{"role": "user", "content": "How does the router select a deployment?"}],
            max_tokens=100
        )
        dep_id = response_chat._hidden_params.get('deployment', {}).get('model_info', {}).get('id', 'N/A')
        print(f"Routed chat completed using deployment ID: {dep_id}")
        print(f"Response: {response_chat.choices[0].message.content[:80]}...")
    except Exception as e: print(f"FAILED: {e}")

    # --- 2. Routed Embedding ---
    print("\n--- 2. Router Embedding ('prod-embed' group) ---")
    try:
        response_embed = await router.aembedding(
            model="prod-embed", # Target the embedding group
            input=["Text to embed via router."]
        )
        dep_id = response_embed._hidden_params.get('deployment', {}).get('model_info', {}).get('id', 'N/A')
        print(f"Routed embedding completed using deployment ID: {dep_id}")
        print(f"Received {len(response_embed.data)} embedding(s).")
    except Exception as e: print(f"FAILED: {e}")

    # --- 3. Routed Streaming Completion ---
    print("\n--- 3. Router Streaming Completion ('prod-chat' group) ---")
    try:
        stream = await router.acompletion(
            model="prod-chat",
            messages=[{"role": "user", "content": "Stream a short explanation of load balancing."}],
            stream=True, max_tokens=100
        )
        print("Streaming Output:")
        full_stream_text = ""
        async for chunk in stream:
            if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
                print(chunk.choices[0].delta.content, end="", flush=True)
                full_stream_text += chunk.choices[0].delta.content
        print("\nStream finished.")
        # Check which deployment was used for the stream
        # Note: Accessing _hidden_params might require collecting the stream or specific handling
        # For simplicity, we assume it's accessible after stream completion if stored appropriately by LiteLLM
        # dep_id_stream = getattr(stream, '_hidden_params', {}).get('deployment', {}).get('model_info',{}).get('id','N/A')
        # print(f"(Stream likely used deployment: {dep_id_stream})") # This access pattern might change
    except Exception as e: print(f"\nStreaming FAILED: {e}")

    # --- 4. Force Request to a Specific Deployment ---
    target_deployment_id = "azure-gpt4o-eastus" # Use an ID from your model_list
    print(f"\n--- 4. Forcing Call to Specific Deployment: {target_deployment_id} ---")
    try:
        response_specific = await router.acompletion(
            model="prod-chat", # Still need group name for context, but override matters
            messages=[{"role": "user", "content": "Confirm you are the Azure deployment."}],
            specific_deployment=target_deployment_id # Force this deployment
        )
        dep_id = response_specific._hidden_params.get('deployment', {}).get('model_info', {}).get('id', 'N/A')
        print(f"Call forced to deployment ID: {dep_id} (Should match target)")
        print(f"Response: {response_specific.choices[0].message.content[:80]}...")
        assert dep_id == target_deployment_id
    except Exception as e: print(f"FAILED: {e}")

# asyncio.run(run_router_calls()) # Uncomment to run

Deployment Management

Dynamically add, update, or remove deployments from the router's model_list while the application is running.

  • Methods:
    • router.set_model_list(new_model_list: List[Dict]): Replaces the entire list. Use carefully.
    • router.add_deployment(deployment: Union[Dict, Deployment]): Adds one new deployment config (dict or litellm.Deployment object). Returns the added Deployment object or None.
    • router.delete_deployment(id: str): Removes a deployment by its unique model_info["id"]. Returns the deleted deployment dict or None.
    • router.upsert_deployment(deployment: Union[Dict, Deployment]): Adds if ID is new, updates if ID exists. Returns the added/updated Deployment object.

Example: Adding, Updating, and Deleting Deployments

import litellm
import os
from litellm.types.router import Deployment # Optional type hint

# --- Assume 'router' is initialized from the comprehensive example ---

if 'router' in locals() and isinstance(router, litellm.Router):
    print("\n--- Dynamic Deployment Management ---")
    print(f"Initial Groups: {router.get_model_names()}")
    print(f"Initial IDs: {router.get_model_ids()}")

    # --- 1. Add a new Cohere deployment to a new group ---
    cohere_deployment_dict = {
        "model_name": "cohere-prod", # New group
        "litellm_params": {"model": "command-r-plus", "api_key": os.getenv("COHERE_API_KEY")},
        "tpm": 90000, "rpm": 1000,
        "model_info": {"id": "cohere-cmdr-plus"}
    }
    print(f"\nAdding deployment '{cohere_deployment_dict['model_info']['id']}'...")
    try:
        added_dep = router.add_deployment(cohere_deployment_dict)
        if added_dep: print(f"  Success. New Groups: {router.get_model_names()}")
        else: print("  Failed (maybe ID exists?).")
    except Exception as e: print(f"  Error adding: {e}")

    # --- 2. Update an existing deployment's TPM/RPM ---
    target_update_id = "openai-gpt4o-1"
    print(f"\nUpdating deployment '{target_update_id}'...")
    try:
        current_dep = router.get_deployment(target_update_id)
        if current_dep:
            # Modify the Pydantic object directly
            current_dep.tpm = 75000
            current_dep.rpm = 600
            # Use upsert to apply changes
            updated_dep = router.upsert_deployment(current_dep)
            if updated_dep:
                print(f"  Success. Updated TPM: {updated_dep.tpm}, RPM: {updated_dep.rpm}")
            else: print("  Upsert failed.")
        else: print(f"  Deployment '{target_update_id}' not found.")
    except Exception as e: print(f"  Error updating: {e}")

    # --- 3. Delete a deployment ---
    target_delete_id = "azure-gpt4o-eastus"
    print(f"\nDeleting deployment '{target_delete_id}'...")
    try:
        deleted_dict = router.delete_deployment(id=target_delete_id)
        if deleted_dict:
            print(f"  Success. Deleted deployment model: {deleted_dict['litellm_params']['model']}")
            print(f"  Remaining IDs: {router.get_model_ids()}")
        else: print(f"  Deployment '{target_delete_id}' not found.")
    except Exception as e: print(f"  Error deleting: {e}")

else:
    print("\nRouter object not found. Skipping dynamic management examples.")

Getting Router Information

Inspect the router's configuration, health status (requires Redis), usage metrics (requires Redis), and deployment details.

  • Methods:
    • router.get_model_list(model_name=None) -> List[Dict]: Get raw deployment dicts (optionally filtered by group).
    • router.get_model_names() -> List[str]: Get list of unique group names.
    • router.get_model_ids(model_name=None) -> List[str]: Get list of unique deployment IDs (optionally filtered).
    • router.get_deployment(model_id: str) -> Optional[Deployment]: Get the full Deployment object for an ID.
    • router.get_model_group_info(model_group: str) -> Optional[ModelInfo]: Get aggregated info (max tokens, total capacity, providers, features) for a group.
    • router.get_available_deployments(model_group: str) -> List[Deployment]: Get list of healthy deployments in a group. Requires Redis.
    • router.get_model_group_usage(model_group: str) -> Tuple[int, int]: (Async) Get estimated current (tpm, rpm) usage for a group. Requires Redis & recent activity.
    • router.get_settings() -> Dict: Get router's init config settings (strategy, retries, redis settings, etc.).

Example: Comprehensive Router Inspection

import litellm
import asyncio

# --- Assume 'router' is initialized from the comprehensive example ---

if 'router' in locals() and isinstance(router, litellm.Router):
    print("\n--- Comprehensive Router Inspection ---")

    # 1. Settings
    settings = router.get_settings()
    print("\nRouter Settings:")
    print(f"  Strategy: {settings.get('routing_strategy')}")
    print(f"  Retries: {settings.get('num_retries')}, Timeout: {settings.get('timeout')}s")
    print(f"  Cooldown Time: {settings.get('cooldown_time')}s, Allowed Fails: {settings.get('allowed_fails')}")
    print(f"  Redis Connected: {bool(settings.get('redis_host') or settings.get('redis_url'))}")
    print(f"  Cache Enabled: {settings.get('cache_responses')}")
    print(f"  Health Checks Enabled: {settings.get('enable_health_checks')}")

    # 2. Groups and Deployments
    groups = router.get_model_names()
    print(f"\nModel Groups: {groups}")
    for group in groups:
        print(f"\nDetails for Group: '{group}'")
        group_ids = router.get_model_ids(model_name=group)
        print(f"  Deployment IDs: {group_ids}")
        try:
            group_info = router.get_model_group_info(group)
            if group_info:
                 print(f"  Aggregated Info:")
                 print(f"    Max Input Tokens: {group_info.max_input_tokens}") # type: ignore
                 print(f"    Total TPM: {group_info.tpm}, Total RPM: {group_info.rpm}") # type: ignore
                 print(f"    Providers: {group_info.providers}") # type: ignore
            else: print("    Could not get aggregated info.")
        except Exception as e: print(f"    Error getting group info: {e}")

        # Check available deployments (requires Redis)
        try:
            available = router.get_available_deployments(group)
            print(f"  Available (Healthy) Deployments: {[d.model_info.id for d in available]} (Requires Redis state)")
        except Exception as e: print(f"    Could not get available deployments (Requires Redis state): {e}")

    # 3. Specific Deployment Detail
    test_id = "openai-gpt4o-1" # Use an ID from your list
    print(f"\nDetails for Specific Deployment ID: '{test_id}'")
    deployment = router.get_deployment(test_id)
    if deployment:
        print(f"  Model Name (Group): {deployment.model_name}")
        print(f"  Underlying Model: {deployment.litellm_params['model']}")
        print(f"  Configured TPM: {deployment.tpm}, RPM: {deployment.rpm}")
    else: print(f"  Deployment '{test_id}' not found.")

    # 4. Check Usage (Async, Requires Redis & Activity)
    async def check_all_usage():
        print("\nChecking Current Group Usage (Async, Requires Redis & Activity)...")
        usage_results = {}
        for group in router.get_model_names():
            try:
                usage = await router.get_model_group_usage(group)
                usage_results[group] = usage
                print(f"  Group '{group}': Current Usage (TPM={usage[0]}, RPM={usage[1]})")
            except Exception as e:
                print(f"  Could not get usage for group '{group}': {e}")
        return usage_results
    # usage_data = asyncio.run(check_all_usage()) # Uncomment to run

else:
    print("\nRouter object not found. Skipping inspection examples.")

Advanced Router Features

  • router.flush_cache(): Clears the Redis response cache if cache_responses=True.

    # if router.cache_responses:
    #    print("Flushing router cache...")
    #    router.flush_cache()
    
  • router.reset(): Resets internal state stored in Redis (cooldowns, latency EWMA, usage counters). Does not affect the model_list configuration.

    # print("Resetting router state...")
    # router.reset() # Requires Redis to have effect
    
  • router.set_custom_routing_strategy(strategy_name: str, strategy_func: Callable): Register your own Python function to handle deployment selection logic.

    • strategy_func signature: def my_strategy(model_group: str, healthy_deployments: List[Deployment], **kwargs) -> Deployment:
    • The function must return one of the healthy_deployments. kwargs may contain context from the original call.
    • Register: router.set_custom_routing_strategy("my_logic", my_strategy_function)
    • Use: Initialize a new router instance with routing_strategy="my_logic".

Router and Assistants API

LiteLLM's Router can manage calls to the OpenAI Assistants API (and potentially Azure OpenAI Assistants). When you use router methods like router.create_run, router.add_message, etc., the routing logic applies based on the model specified within the Assistant object itself or potentially overridden in the Run creation.

  • Configuration: Define deployments in the model_list for the models your Assistants will use (e.g., gpt-4-turbo, azure/my-assistants-model). Give them appropriate group names (model_name).
  • Usage:

    # 1. Create Assistant (can be done outside router or via router)
    # assistant = litellm.create_assistants(model="gpt-4-turbo", ...)
    # assistant_id = assistant.id
    
    # 2. Create Thread, Add Message (can use router or litellm.*)
    # thread = router.create_thread()
    # router.add_message(thread_id=thread.id, ...)
    
    # 3. Create Run using the Router
    # The router looks at the assistant's model ("gpt-4-turbo" in this case),
    # finds the corresponding group name(s) in its model_list,
    # and applies the routing strategy to select a deployment for this run.
    # run = router.create_run(
    #     thread_id=thread.id,
    #     assistant_id=assistant_id
    #     # model="my-routed-group" # Optional: Override assistant's model with a router group
    # )
    

This allows you to have failover, load balancing, etc., for your Assistant interactions if you have multiple suitable deployments configured in the router.

7. Handling Exceptions (litellm.exceptions.*)

Robust error handling is critical. LiteLLM provides a hierarchy of custom exceptions, often mirroring OpenAI's structure and adding useful context like llm_provider and model. Catching specific exceptions allows for tailored recovery strategies.

Exception Hierarchy Overview

  • litellm.exceptions.LiteLLMException (Base class for most LiteLLM specific errors)
  • litellm.exceptions.APIError (Inherits LiteLLMException, base for provider API errors)
    • AuthenticationError (HTTP 401/403 - Invalid Key/Credentials)
    • PermissionDeniedError (HTTP 403 - Key valid but lacks permission)
    • BadRequestError (HTTP 400 - Malformed request/params)
      • ContextWindowExceededError (Input too long)
      • InvalidRequestError (General invalid structure/params)
      • ContentPolicyViolationError (Blocked by safety filters)
      • JSONSchemaValidationError (Tool arguments invalid)
      • UnsupportedParamsError (drop_params=False and unsupported param passed)
    • RateLimitError (HTTP 429 - RPM/TPM limit exceeded)
    • NotFoundError (HTTP 404 - Model/Resource not found)
    • ConflictError (HTTP 409 - Resource conflict, e.g., already exists)
    • UnprocessableEntityError (HTTP 422 - Understandable but unprocessable, e.g., bad tool args structure)
    • InternalServerError (HTTP 500 - Provider server-side error)
    • APIConnectionError (~HTTP 500 - Network issue connecting to API)
    • Timeout (HTTP 408 - Request timed out)
    • ServiceUnavailableError (HTTP 503 - Provider overloaded/unavailable)
  • litellm.exceptions.BudgetExceededError (Client-side budget limit reached)
  • litellm.exceptions.RejectedRequestError (Blocked by pre-call hook/guardrail)

Detailed Descriptions

  • AuthenticationError: Check API keys, credentials, potential permission issues. Ensure keys match the target provider/model. For Azure, check key, base URL, and version match the deployment.
  • PermissionDeniedError: Key is valid but doesn't have access to the specific model or action. Check API key permissions in the provider's dashboard.
  • RateLimitError: Implement exponential backoff and retry logic. Consider distributing load with litellm.Router or requesting quota increases from the provider. Check Retry-After headers if available on the exception object.
  • ContextWindowExceededError: The prompt (messages + potentially tools) exceeds the model's maximum token limit. Use litellm.utils.token_counter to check length beforehand and litellm.utils.trim_messages to shorten the history before retrying, or use litellm.context_window_fallbacks. Contains input_tokens, max_tokens attributes.
  • InvalidRequestError/BadRequestError: Usually indicates a problem with the parameters sent (wrong type, invalid value, missing required field). Check the error message details and the API documentation for the specific model/provider.
  • ContentPolicyViolationError: The input or generation triggered the provider's safety filters. Review the content, adjust prompts, or use litellm.content_policy_fallbacks. The error might contain details on the category flagged.
  • NotFoundError: Double-check the model string for typos. Ensure the model exists, is available in the specified region (for Azure/Bedrock/Vertex), and that the API key has access rights.
  • Timeout: Increase the timeout parameter (globally via litellm.request_timeout or per-call), check network connectivity, or consider if the request is too complex for the default timeout. Retrying might help for transient issues.
  • APIConnectionError: Usually client-side network issues (DNS, firewall, proxy, unreachable endpoint). Check network settings and ensure the api_base (if used) is correct and reachable.
  • ServiceUnavailableError/InternalServerError: Issues on the provider's end. Usually transient. Retrying with backoff is the best strategy. Check the provider's status page.
  • BudgetExceededError: Raised client-side by litellm.max_budget or litellm.BudgetManager. Stop sending requests for the affected user/scope or increase the budget.
  • UnsupportedParamsError: Remove the unsupported parameter(s) mentioned in the error message from your call, or set litellm.drop_params = True globally if you want LiteLLM to silently ignore them.

Comprehensive Handling Example

import litellm
import time
import traceback
import logging

# Configure logging for errors
error_logger = logging.getLogger("LiteLLM_Error_Handler")
logging.basicConfig(level=logging.INFO)

def make_robust_call(call_func, *args, max_retries=2, initial_backoff=1, **kwargs):
    """Wrapper to handle common LiteLLM errors with retries and backoff."""
    retries = 0
    backoff = initial_backoff
    while retries <= max_retries:
        try:
            # Make the actual LiteLLM call (e.g., litellm.completion)
            result = call_func(*args, **kwargs)
            error_logger.info(f"Call successful on attempt {retries + 1}.")
            return result # Success!
        # --- Specific Retryable Errors ---
        except (litellm.exceptions.RateLimitError,
                litellm.exceptions.Timeout,
                litellm.exceptions.APIConnectionError,
                litellm.exceptions.ServiceUnavailableError,
                litellm.exceptions.InternalServerError) as e:
            error_logger.warning(f"Attempt {retries + 1}/{max_retries + 1} failed: Caught retryable error {type(e).__name__}: {e}")
            if retries == max_retries:
                error_logger.error("Max retries exceeded.")
                raise e # Re-raise the exception after final retry
            error_logger.info(f"Retrying in {backoff:.2f} seconds...")
            time.sleep(backoff)
            retries += 1
            backoff *= 2 # Exponential backoff
        # --- Non-Retryable Errors (Log and re-raise or handle) ---
        except litellm.exceptions.AuthenticationError as e:
            error_logger.critical(f"CRITICAL: Authentication Error - Check credentials! {e}")
            raise e
        except litellm.exceptions.PermissionDeniedError as e:
            error_logger.critical(f"CRITICAL: Permission Denied - Check API key permissions! {e}")
            raise e
        except litellm.exceptions.ContextWindowExceededError as e:
            error_logger.error(f"Context Window Exceeded for model {e.model}. Input tokens: {e.input_tokens}, Max: {e.max_tokens}. {e}")
            # Potential Action: Trigger message trimming logic here before raising/returning
            raise e # Or return a specific signal to trim and retry outside
        except litellm.exceptions.ContentPolicyViolationError as e:
             error_logger.warning(f"Content Policy Violation - Request blocked. {e}")
             raise e # Or handle based on policy (e.g., return safe response)
        except litellm.exceptions.BadRequestError as e: # Catch InvalidRequest, UnsupportedParams etc.
             error_logger.error(f"Bad Request Error (400 level) - Check parameters/input format. {type(e).__name__}: {e}")
             raise e
        except litellm.exceptions.NotFoundError as e:
            error_logger.error(f"Not Found Error (404) - Check model name/endpoint/permissions. {e}")
            raise e
        except litellm.exceptions.BudgetExceededError as e:
            error_logger.warning(f"Budget Exceeded - Call blocked. {e}")
            raise e
        # --- Catch other LiteLLM specific errors ---
        except litellm.exceptions.LiteLLMException as e:
             error_logger.error(f"Caught other LiteLLM Exception: {type(e).__name__}: {e}")
             raise e
        # --- Catch any other unexpected exceptions ---
        except Exception as e:
            error_logger.exception(f"Caught UNEXPECTED non-LiteLLM error during API call:") # Logs traceback
            raise e # Re-raise unexpected errors

# --- Example Usage of the Wrapper ---
print("\n--- Testing Robust Call Wrapper ---")
# Simulate a call that might initially face a rate limit
try:
     result = make_robust_call(
         litellm.completion, # Pass the function to call
         model="gpt-3.5-turbo", # Arguments for the function
         messages=[{"role": "user", "content": "Test robust call"}],
         max_tokens=10,
         # Use mock to simulate initial failure then success (complex mock setup needed)
         # For real use, remove mock_response
         mock_response="Success after simulated retries!" # Simplified mock
         # To properly test retries, mock_response would need to raise errors initially
         # e.g., mock_response=[RateLimitError(), RateLimitError(), "Success!"]
     )
     # print("Wrapper call successful:", result.choices[0].message.content)
except Exception as e:
     print(f"Wrapper call ultimately failed: {type(e).__name__}")

# Simulate a non-retryable error
try:
    make_robust_call(
        litellm.completion, model="gpt-4", messages=[{"role":"user","content":"test"}],
        mock_response=litellm.exceptions.AuthenticationError("Wrapper test bad key.")
    )
except litellm.exceptions.AuthenticationError:
    print("Wrapper correctly handled and re-raised non-retryable AuthenticationError.")
except Exception as e:
    print(f"Wrapper test failed unexpectedly: {type(e).__name__}")

Explanation: This Exceptions section is now exhaustive, listing all major and minor exception types with detailed descriptions and potential causes/actions. The example uses a wrapper function to demonstrate a more realistic error handling flow with retries and specific logging for different error categories.

(Final sections: Cost, Budget, Utilities, Constants)

Finalizing the exhaustive guide with Cost Calculation, Budget Manager, Utilities, and Constants.

8. Cost Calculation (litellm.*)

LiteLLM provides utilities to estimate the monetary cost (in USD) of LLM API calls based on token usage and per-token pricing information. This pricing is primarily sourced from LiteLLM's internal mapping (litellm.model_cost, often loaded from bundled JSON data) but can be dynamically updated or overridden.

litellm.completion_cost(...) -> float

This is the primary function for estimating the cost of a completed LLM call (chat completion, embedding generation, etc.). It intelligently determines the model, token counts, and applies the corresponding pricing.

  • Logic:
    1. Identify Model: Determines the model name (e.g., "gpt-4-turbo") either directly from the completion_response object or the model parameter.
    2. Determine Token Counts: Extracts prompt_tokens and completion_tokens from the completion_response.usage attribute. If not available, it calculates them using litellm.utils.token_counter based on the provided text inputs (prompt, messages, completion, etc.). If explicit prompt_tokens and completion_tokens are passed, they are used directly. For embeddings (call_type="embedding"), typically only prompt_tokens are relevant.
    3. Find Pricing: Looks up the input_cost_per_token and output_cost_per_token for the identified model in litellm.model_cost. This lookup can be bypassed by providing custom_cost_per_token.
    4. Calculate Cost: Computes (prompt_tokens * input_cost_per_token) + (completion_tokens * output_cost_per_token). For embeddings, it's typically just prompt_tokens * input_cost_per_token.
  • Exhaustive Parameters:
    • completion_response (Union[litellm.ModelResponse, litellm.EmbeddingResponse, litellm.ImageResponse, litellm.TextCompletionResponse, Dict], optional): (Recommended Input) The response object returned by a successful LiteLLM API call. If provided, LiteLLM attempts to automatically extract model, usage.prompt_tokens, and usage.completion_tokens.
    • model (str, optional): The model identifier string (e.g., "gpt-4o", "azure/my-deploy"). Required if completion_response is not provided or lacks necessary info. Used for both pricing lookup and potentially internal token counting if text is provided.
    • Text Inputs (Used if token counts not in completion_response and explicit counts not given):
      • prompt (str, optional): Input text for legacy text_completion.
      • completion (str, optional): Output text from legacy text_completion.
      • messages (List[Dict], optional): Input messages for completion.
      • input_text (Union[str, List[str]], optional): Input(s) for embedding.
      • output_text (str, optional): Output text if completion_response structure is non-standard or only text is available.
    • Explicit Token Counts (Overrides other methods):
      • prompt_tokens (int, optional): Manually provide the number of input tokens.
      • completion_tokens (int, optional): Manually provide the number of generated output tokens.
    • call_type (Literal["completion", "embedding", "image_generation", "text_completion"], optional, default="completion"): Specifies the type of API call, important as cost structures differ (e.g., embeddings typically only charge for input). Inferred if completion_response is provided.
    • custom_llm_provider (str, optional): Explicitly specify the provider (e.g., "openai", "azure") to help LiteLLM resolve the model name to the correct pricing entry, especially if using aliases or non-standard model strings.
    • custom_cost_per_token (Dict[str, float], optional): A dictionary to override LiteLLM's internal pricing map for this specific calculation. Format: {"input_cost_per_token": float, "output_cost_per_token": float}. Essential for models not in LiteLLM's map or for using custom pricing tiers.
    • custom_cost_per_second (float, optional): For experimental support of models priced per second of processing time rather than tokens. Provide cost in USD per second.
  • Returns: float: The estimated cost of the API call in USD. Returns 0.0 if pricing information cannot be determined (e.g., unknown model and no custom cost provided).
  • Raises: litellm.exceptions.NotFoundException: If the model cannot be found in litellm.model_cost and custom_cost_per_token is not supplied.

litellm.cost_per_token(...) -> Tuple[float, float]

A lower-level utility that directly calculates prompt cost and completion cost based only on provided token counts and the model's known price per token.

  • Purpose: Calculate cost components when precise token counts are already known (e.g., from external counters, provider billing APIs) without needing text input or response objects.
  • Parameters:
    • model (str): Required. Model identifier string for price lookup.
    • prompt_tokens (int): Required. Number of input tokens.
    • completion_tokens (int): Required. Number of output tokens.
    • custom_llm_provider (str, optional): Explicit provider name override.
    • custom_cost_per_token (Dict[str, float], optional): Override internal pricing map with {"input_cost_per_token": ..., "output_cost_per_token": ...}.
  • Returns: Tuple[float, float]: A tuple containing (prompt_cost_in_usd, completion_cost_in_usd).
  • Raises: litellm.exceptions.NotFoundException if model pricing isn't found and no custom cost is given.

litellm.response_cost_calculator(...) -> Optional[float]

(Internal Utility) This function is primarily designed for internal use, especially within LiteLLM's built-in logging callbacks (like the "cost" callback string). It acts as a wrapper around litellm.completion_cost, attempting to extract all necessary arguments (response_object, model, call_type, token counts, etc.) directly from the kwargs dictionary passed into the callback functions.

  • Users should generally prefer using litellm.completion_cost directly in their application code for better clarity and control over inputs. Understanding this function exists helps interpret how callbacks might automatically add cost information (kwargs["response_cost"]) to the callback arguments.

Comprehensive Cost Calculation Examples

import litellm
from litellm import ModelResponse, EmbeddingResponse, Usage, Choices, Message, Embedding # For mock objects
from typing import Optional, Dict, Tuple, List, Union

print("--- Comprehensive Cost Calculation Examples ---")

# --- 1. Cost from CompletionResponse (Recommended) ---
print("\n1. Cost from CompletionResponse object:")
try:
    mock_resp_compl = ModelResponse(
        model="gpt-4o", # Use a known model with pricing
        usage=Usage(prompt_tokens=15000, completion_tokens=2000),
        choices=[Choices(message=Message(content="..."), index=0, finish_reason="stop")],
        id="cmpl-xyz", created=int(time.time()), object="chat.completion"
    )
    cost1 = litellm.completion_cost(completion_response=mock_resp_compl)
    print(f"  Cost for '{mock_resp_compl.model}' response: ${cost1:.6f}")
    # Manual Check GPT-4o ($5/$15 per 1M): (15000/1M * 5) + (2000/1M * 15) = 0.075 + 0.03 = $0.105
    print(f"  Manual Check: Expected ~${0.105:.6f}")
except Exception as e: print(f"  Error: {e}")

# --- 2. Cost from EmbeddingResponse ---
print("\n2. Cost from EmbeddingResponse object:")
try:
    mock_resp_embed = EmbeddingResponse(
        model="text-embedding-ada-002", # OpenAI Ada v2 ($0.10 / 1M)
        usage=Usage(prompt_tokens=500000, total_tokens=500000),
        data=[Embedding(embedding=[0.1]*1536, index=0, object="embedding")], object="list"
    )
    # Explicitly set call_type for clarity, though it might be inferred
    cost2 = litellm.completion_cost(completion_response=mock_resp_embed, call_type="embedding")
    print(f"  Cost for '{mock_resp_embed.model}' response: ${cost2:.6f}")
    # Manual Check: (500000 / 1000000 * 0.10) = $0.05
    print(f"  Manual Check: Expected ~${0.050000:.6f}")
except Exception as e: print(f"  Error: {e}")

# --- 3. Cost from Explicit Token Counts (using completion_cost) ---
print("\n3. Cost from Explicit Tokens (using completion_cost):")
model3 = "mistral/mistral-large-latest" # Mistral Large ($8/$24 per 1M)
p_tokens3, c_tokens3 = 30000, 5000
try:
    cost3 = litellm.completion_cost(
        model=model3, prompt_tokens=p_tokens3, completion_tokens=c_tokens3
    )
    print(f"  Cost for '{model3}' ({p_tokens3}p + {c_tokens3}c tokens): ${cost3:.6f}")
    # Manual Check: (30000/1M * 8) + (5000/1M * 24) = 0.24 + 0.12 = $0.36
    print(f"  Manual Check: Expected ~${0.360000:.6f}")
except Exception as e: print(f"  Error: {e}")

# --- 4. Cost from Explicit Token Counts (using cost_per_token) ---
print("\n4. Cost Components from Explicit Tokens (using cost_per_token):")
try:
    prompt_cost4, completion_cost4 = litellm.cost_per_token(
        model=model3, prompt_tokens=p_tokens3, completion_tokens=c_tokens3
    )
    total_cost4 = prompt_cost4 + completion_cost4
    print(f"  Components for '{model3}' ({p_tokens3}p + {c_tokens3}c tokens):")
    print(f"    Prompt Cost: ${prompt_cost4:.6f}")     # Expected: 0.24
    print(f"    Completion Cost: ${completion_cost4:.6f}") # Expected: 0.12
    print(f"    Total Cost: ${total_cost4:.6f}")         # Expected: 0.36
    assert abs(total_cost4 - cost3) < 1e-9 # Should match completion_cost result
except Exception as e: print(f"  Error: {e}")

# --- 5. Cost from Text Inputs (Internal Token Counting) ---
print("\n5. Cost from Text Inputs (requires token counting):")
model5 = "gpt-3.5-turbo"
messages5 = [{"role": "user", "content": "Calculate cost based on this message text." * 5}]
completion_text5 = "This is the simulated response text for cost calculation." * 3
try:
    # Provide model, messages, and output text
    cost5 = litellm.completion_cost(
        model=model5, messages=messages5, output_text=completion_text5
    )
    # Verify with manual token count + cost_per_token (optional)
    prompt_t5 = litellm.token_counter(model=model5, messages=messages5)
    completion_t5 = litellm.token_counter(model=model5, text=completion_text5)
    cost5_manual = sum(litellm.cost_per_token(model=model5, prompt_tokens=prompt_t5, completion_tokens=completion_t5))
    print(f"  Cost for '{model5}' based on text inputs: ${cost5:.8f}")
    print(f"  Manual Check: Tokens=(P:{prompt_t5}, C:{completion_t5}), Cost=${cost5_manual:.8f}")
    assert abs(cost5 - cost5_manual) < 1e-9
except Exception as e: print(f"  Error: {e}")

# --- 6. Cost with Custom Pricing Override ---
print("\n6. Cost with Custom Pricing Override:")
model6 = "my-company/special-finetune-v3" # Does not need to be in litellm.model_cost
custom_pricing = {"input_cost_per_token": 0.000002, "output_cost_per_token": 0.000004} # $2/$4 per 1M
p_tokens6, c_tokens6 = 50000, 10000
try:
    cost6 = litellm.completion_cost(
        model=model6, # Model name still useful context, but price is overridden
        prompt_tokens=p_tokens6, completion_tokens=c_tokens6,
        custom_cost_per_token=custom_pricing
    )
    print(f"  Cost for '{model6}' with custom pricing: ${cost6:.6f}")
    # Manual Check: (50000 * 0.000002) + (10000 * 0.000004) = 0.1 + 0.04 = $0.14
    print(f"  Manual Check: Expected $0.140000")
except Exception as e: print(f"  Error: {e}")

# --- 7. Cost Calculation Failure (Unknown Model) ---
print("\n7. Cost Calculation Failure (Unknown Model):")
try:
    cost7 = litellm.completion_cost(model="completely-unknown-model-xyz", prompt_tokens=100, completion_tokens=10)
    print(f"  Cost calculated (unexpected): ${cost7}") # Should not reach here unless model exists
except litellm.exceptions.NotFoundException as e:
    print(f"  Caught expected NotFoundException: {e}")
except Exception as e: print(f"  Caught unexpected error: {e}")

Client-Side Budget Management (litellm.BudgetManager)

Provides a class for basic, client-side tracking and enforcement of spending limits per user or project. Data is typically stored locally in a JSON file (user_cost.json).

❗ Crucial Distinction: The BudgetManager is intended for local development, testing, simple scripts, or single-instance applications. It is NOT suitable for production environments requiring robust, persistent, scalable, or multi-user/multi-instance budget control. For those scenarios, use the LiteLLM Proxy Server, which offers far more advanced and reliable budgeting features managed centrally.

Initialization (__init__)

  • litellm.BudgetManager(project_name: Optional[str]=None, client_type: str = "local", api_base: Optional[str] = None, headers: Optional[dict] = None)
    • project_name (str, optional): An identifier for the budget scope. Primarily relevant if using the experimental client_type="hosted" to interact with a (usually custom) backend service for budget management. For the default "local" client, this parameter currently doesn't significantly affect behavior or filenames.
    • client_type (str, default="local"): Determines storage backend.
      • "local": Stores budget data (user budgets, current costs, reset times) in a local JSON file named user_cost.json in the current working directory. Data persists between script runs if the file is not deleted. Prone to race conditions if multiple processes access the same file.
      • "hosted": (Experimental/Uncommon) Intended to interact with a remote HTTP API (specified by api_base and headers) for budget operations. Requires a compatible backend service; not a standard LiteLLM feature.
    • api_base (str, optional): Endpoint URL for the budget management API (only used if client_type="hosted").
    • headers (dict, optional): Authentication headers for the budget management API (only used if client_type="hosted").

Exhaustive Method Descriptions

(Operates on the data stored in user_cost.json for client_type="local")

  • Budget Creation & Info:
    • create_budget(total_budget: float, user: str, duration: Optional[Literal["daily", "weekly", "monthly", "yearly"]] = None, created_at: Optional[datetime.datetime] = None, budget_id: Optional[str] = None): Creates or updates the budget entry for a user. If the user exists, updates total_budget, duration, etc.
      • total_budget: Spending limit in USD.
      • user: Unique user ID string.
      • duration: If set (e.g., "daily"), enables time-based automatic cost resets via reset_on_duration or update_budget_all_users.
      • created_at: UTC datetime object setting the budget's start time (for duration resets). Defaults to now if None.
      • budget_id: Optional custom ID for the budget entry.
    • get_total_budget(user: str) -> float: Returns the allocated total_budget for the user (0.0 if user not found).
    • is_valid_user(user: str) -> bool: Checks if a budget entry exists for the user.
    • get_users() -> List[str]: Returns a list of all user IDs tracked in the budget file.
    • get_budget(user: str, budget_id: Optional[str] = None) -> Optional[Dict]: Retrieves the full budget dictionary for a user (including total_budget, current_cost, duration, created_at, last_reset_at).
  • Cost Tracking & Updates:
    • update_cost(user: str, cost: float = 0.0, completion_obj=None, model=None, input_text=None, output_text=None, call_type="completion", prompt_tokens=None, completion_tokens=None) -> bool: Adds cost to the user's current_cost. Either provide the calculated cost directly, or provide enough info for litellm.completion_cost to calculate it (completion_obj or model + text/tokens). Saves data asynchronously to file. Returns True on success, False if user not found.
    • check_cost_and_update(user: str, potential_cost: float = 0.0, completion_obj=None, model=None, input_text=None, output_text=None, call_type="completion", prompt_tokens=None, completion_tokens=None) -> bool: Atomically checks if adding the potential_cost (calculated if not provided directly) would exceed the budget. If not, adds the cost to current_cost, saves, and returns True. If it exceeds, returns False without updating the cost. Recommended check before making calls.
    • get_current_cost(user: str) -> float: Returns the user's current accumulated spending since the last reset.
    • projected_cost(model: str, messages: list, user: str) -> float: Estimates user's cost after adding the prompt cost of the next potential call to their current cost.
    • get_model_cost(model_name: str) -> Optional[Dict[str, float]]: Retrieves stored cost-per-token info for a model (if populated during previous update_cost calls).
  • Resetting Costs:
    • reset_cost(user: str): Manually sets the user's current_cost back to 0.0 and saves.
    • reset_on_duration(user: str): Checks if the user has a duration set and if that period has elapsed since last_reset_at (or created_at). If so, resets current_cost to 0.0, updates last_reset_at, saves, and returns True. Otherwise returns False.
    • update_budget_all_users(): Iterates through all users and calls reset_on_duration for each. Intended to be called periodically (e.g., via a daily cron job or scheduler).
  • Data Persistence:
    • save_data(): Manually triggers saving the current in-memory budget state to the user_cost.json file. Note that update_cost and reset methods usually trigger saves asynchronously already.

Exhaustive Budget Management Example

import litellm
import time
import os
import datetime
import uuid
import json # To inspect the file
import traceback
from typing import Optional

# --- Setup ---
budget_file_path = "exhaustive_user_cost.json" # Use a specific name
# Clear previous file for clean run
if os.path.exists(budget_file_path): os.remove(budget_file_path)

print("--- Exhaustive Budget Manager Example ---")
# Initialize BudgetManager, explicitly telling it the file path via project_name hack
# (Note: LiteLLM's local BudgetManager file handling might evolve; check source if needed)
# A common pattern might be to subclass or manage the file read/write more explicitly if needed.
# For now, we assume it uses user_cost.json or potentially project_name influences it.
# Let's stick to the default user_cost.json assumption for simplicity of example.
if os.path.exists("user_cost.json"): os.remove("user_cost.json") # Clear default too
budget_manager = litellm.BudgetManager(project_name="exhaustive_demo")
print(f"BudgetManager initialized (using default file: user_cost.json)")

# --- Define Users and Budgets ---
user_daily = f"user_daily_{uuid.uuid4().hex[:6]}"
user_monthly = f"user_monthly_{uuid.uuid4().hex[:6]}"
user_unlimited = f"user_unlimited_{uuid.uuid4().hex[:6]}"

print("\n--- Creating Budgets ---")
# Daily Budget: Starts Yesterday
start_daily = litellm.utils.get_utc_datetime() - datetime.timedelta(days=1)
budget_manager.create_budget(user=user_daily, total_budget=0.02, duration="daily", created_at=start_daily)
print(f"Created Daily Budget ($0.02) for {user_daily}, starting {start_daily.date()}")

# Monthly Budget
budget_manager.create_budget(user=user_monthly, total_budget=1.50, duration="monthly")
print(f"Created Monthly Budget ($1.50) for {user_monthly}")

# Unlimited Budget (High value, effectively no limit for demo)
budget_manager.create_budget(user=user_unlimited, total_budget=9999.00)
print(f"Created 'Unlimited' Budget ($9999.00) for {user_unlimited}")

print(f"\nTracked Users: {budget_manager.get_users()}")

# --- Simulate Spending & Checks ---
print("\n--- Simulating Spending ---")

def simulate_call(user, model, p_tokens, c_tokens):
    """Simulates a call and updates budget, returning if allowed."""
    print(f"\nSimulating call for {user} (Model: {model}, P: {p_tokens}, C: {c_tokens})")
    cost = 0.0
    allowed = False
    try:
        cost = litellm.completion_cost(model=model, prompt_tokens=p_tokens, completion_tokens=c_tokens)
        print(f"  Estimated Cost: ${cost:.6f}")
        print(f"  Current Cost Before: ${budget_manager.get_current_cost(user):.6f} / ${budget_manager.get_total_budget(user):.2f}")
        # Use check_cost_and_update
        if budget_manager.check_cost_and_update(user=user, potential_cost=cost):
             print(f"  ACTION: Call ALLOWED. Cost Updated.")
             allowed = True
        else:
             print(f"  ACTION: Call BLOCKED (Budget Exceeded). Cost Not Updated.")
             allowed = False
        print(f"  Current Cost After: ${budget_manager.get_current_cost(user):.6f}")
    except litellm.exceptions.NotFoundException:
        print(f"  WARNING: Cannot calculate cost for model {model}, budget not updated.")
        allowed = True # Allow call if cost is unknown, alternative is to block
    except Exception as e:
        print(f"  ERROR during budget check/update: {e}")
        allowed = False # Block on error
    time.sleep(0.1) # Allow async save potential time
    return allowed

# User Daily Spending
simulate_call(user_daily, "gpt-3.5-turbo", 10000, 1000) # ~ $0.0065 - Should pass
simulate_call(user_daily, "gpt-4o", 1000, 500)         # ~ $0.0125 - Should pass (Total ~0.019)
simulate_call(user_daily, "gpt-3.5-turbo", 5000, 500)  # ~ $0.00325 - Should FAIL (Total would be ~0.02225 > 0.02)

# User Monthly Spending
simulate_call(user_monthly, "claude-3-sonnet-20240229", 50000, 10000) # ~ $0.165 - Should pass
simulate_call(user_monthly, "gpt-4-turbo", 20000, 3000)              # ~ $0.29 - Should pass (Total ~0.455)

# --- Check Resets ---
print("\n--- Checking Budget Resets ---")
# User Daily started yesterday, duration is daily, so it should reset
reset_occurred_daily = budget_manager.reset_on_duration(user=user_daily)
print(f"Reset occurred for {user_daily} (Daily)?: {reset_occurred_daily}") # Expect True
print(f"Cost for {user_daily} after reset check: ${budget_manager.get_current_cost(user_daily):.6f}") # Expect 0.0

# User Monthly started today, duration monthly, should not reset yet
reset_occurred_monthly = budget_manager.reset_on_duration(user=user_monthly)
print(f"Reset occurred for {user_monthly} (Monthly)?: {reset_occurred_monthly}") # Expect False
print(f"Cost for {user_monthly} after reset check: ${budget_manager.get_current_cost(user_monthly):.6f}") # Expect previous value (~0.455)

# Run global reset update (will reset user_daily again if called on same day, but cost is already 0)
print("\nRunning update_budget_all_users()...")
budget_manager.update_budget_all_users()
print("Finished update_budget_all_users().")
print(f"Cost for {user_daily} after global update: ${budget_manager.get_current_cost(user_daily):.6f}")
print(f"Cost for {user_monthly} after global update: ${budget_manager.get_current_cost(user_monthly):.6f}")

# --- Inspect Budget File (Optional) ---
budget_file_to_check = "user_cost.json"
if os.path.exists(budget_file_to_check):
    print(f"\nContent of {budget_file_to_check}:")
    try:
        with open(budget_file_to_check, 'r') as f:
            data = json.load(f)
            print(json.dumps(data, indent=2))
    except Exception as e:
        print(f"Error reading budget file: {e}")
else:
    print(f"\nBudget file {budget_file_to_check} not found.")

# --- Clean up ---
if os.path.exists(budget_file_to_check): os.remove(budget_file_to_check)
print(f"\nCleaned up budget file: {budget_file_to_check}")

10. Utilities (litellm.utils.*)

LiteLLM includes a wide array of utility functions in litellm.utils to assist with common tasks related to LLM interactions.

Tokenizer Utilities

Functions for encoding/decoding text and counting tokens based on model-specific tokenizers. Requires tiktoken (installed with LiteLLM) and potentially tokenizers (pip install tokenizers or pip install litellm[huggingface]) for non-OpenAI models.

  • litellm.utils.token_counter(model: str = "", text: Optional[Union[str, List[str]]] = None, messages: Optional[List[Dict]] = None, count_response_tokens: bool = False, tools: Optional[List] = None, tool_choice: Optional[Any] = None, use_default_image_token_count: bool = False, default_token_count: Optional[int] = None, custom_tokenizer: Optional[dict] = None) -> int

    • Primary function for token counting. Estimates tokens for text, chat messages (with role/formatting overhead for models like OpenAI's), multimodal image inputs, and tool definitions. Automatically selects the appropriate tokenizer based on model unless custom_tokenizer is provided.
    • Key Args: model (essential), messages (preferred for chat), text, tools. use_default_image_token_count uses fixed cost for images. default_token_count provides fallback on error.
    • Returns: Integer token count.
    from litellm.utils import token_counter
    # Example: Count multimodal message tokens for GPT-4o
    vision_msgs = [{"role": "user","content": [{"type": "text", "text": "What's in this image?"},{"type": "image_url", "image_url": {"url": "placeholder", "detail": "high"}}]}]
    try:
        # Note: Actual count depends on image resolution if detail='high'
        # LiteLLM uses heuristics if only placeholder URL provided
        count_vision = token_counter(model="gpt-4o", messages=vision_msgs)
        print(f"Estimated tokens for GPT-4o vision message (high detail heuristic): {count_vision}")
    except Exception as e: print(f"Token count error: {e}")
    
  • litellm.utils.encode(model: str = "", text: str = "", custom_tokenizer: Optional[dict] = None) -> List[int]

    • Converts text to a list of token IDs.
    • Requires correct model or custom_tokenizer (from create_..._tokenizer).
    from litellm.utils import encode
    ids = encode(model="claude-3-haiku-20240307", text="Tokenize using Anthropic's method.")
    print(f"Claude 3 Haiku Token IDs: {ids}")
    
  • litellm.utils.decode(model: str = "", tokens: List[int] = [], custom_tokenizer: Optional[dict] = None) -> str

    • Converts a list of token IDs back to text.
    from litellm.utils import decode
    # Assuming 'ids' from previous example
    decoded_text = decode(model="claude-3-haiku-20240307", tokens=ids)
    print(f"Decoded Claude 3 Haiku text: '{decoded_text}'")
    
  • litellm.utils.create_pretrained_tokenizer(identifier: str, revision: str = "main", auth_token: Optional[str] = None) -> dict

    • Loads tokenizer from Hugging Face Hub. Requires pip install tokenizers.
    • identifier: HF repo ID (e.g., "mistralai/Mistral-7B-Instruct-v0.1").
    • Returns {'type': 'huggingface_tokenizer', 'tokenizer': tokenizer_object}.
    # Requires: pip install tokenizers
    from litellm.utils import create_pretrained_tokenizer, encode
    # try:
    #    tokenizer_dict = create_pretrained_tokenizer("bert-base-uncased")
    #    print("Loaded BERT tokenizer.")
    #    # Use it: bert_ids = encode(text="Encode with BERT.", custom_tokenizer=tokenizer_dict)
    #    # print(f"BERT Tokens: {bert_ids}")
    # except ImportError: print("Install 'tokenizers' for this example.")
    # except Exception as e: print(f"Error: {e}")
    
  • litellm.utils.create_tokenizer(json_str: str) -> dict

    • Creates a tokenizers.Tokenizer from its JSON string representation. Requires pip install tokenizers.
  • litellm.utils.openai_token_counter(messages: List, model: str, ...) -> int

    • (Internal) Implements OpenAI's specific chat token counting rules (overhead per message, name, etc.). Called by token_counter.

Model Information & Capability Checks

Retrieve metadata and check feature support based on litellm.model_cost.

  • litellm.utils.get_model_info(model: str, custom_llm_provider=None) -> ModelInfo

    • Retrieves comprehensive ModelInfo TypedDict (context window, costs, provider, supports_X flags, etc.). Raises NotFoundException if unknown.
  • litellm.utils.get_max_tokens(model: str) -> Optional[int]

    • Convenience function for getting the total context window size.
  • litellm.utils.supports_X(model: str, ...) -> bool Functions:

    • Check feature support: supports_function_calling, supports_parallel_function_calling, supports_tool_choice, supports_system_messages, supports_vision, supports_embedding_image_input, supports_prompt_caching, supports_response_schema (JSON mode), supports_native_streaming, supports_web_search, supports_audio_input, supports_pdf_input, supports_audio_output. Returns False if model unknown or feature unsupported.
  • litellm.utils.is_prompt_caching_valid_prompt(model: str, messages: list, tools=None) -> bool

    • Checks if token count (messages + tools) meets provider prompt caching threshold (>= 1024 tokens).

Parameter Handling & Validation

Mostly internal helpers, plus environment/key checks.

  • litellm.utils.get_optional_params* (Internal): Translate standard OpenAI params to provider-specific formats.
  • litellm.utils.validate_environment(model=None, api_key=None, api_base=None) -> dict

    • Checks if required env vars (keys, endpoints) for a model/provider are set. Returns {'keys_in_environment': bool, 'missing_keys': List[str]}.
    from litellm.utils import validate_environment
    # Example: Check if Bedrock env vars are likely set
    # bedrock_check = validate_environment(model="bedrock/claude-v2")
    # if not bedrock_check['keys_in_environment']:
    #     print(f"Missing Bedrock env vars: {bedrock_check['missing_keys']}")
    # else:
    #     print("Required Bedrock env vars appear to be set.")
    
  • litellm.utils.check_valid_key(model: str, api_key: str) -> bool

    • Makes real API call to test key validity for a model. Use sparingly. Returns True on success, False on AuthenticationError/other failures.

Configuration & Registration

Dynamically configure LiteLLM's knowledge.

  • litellm.utils.register_model(model_cost: Union[str, dict]) -> dict

    • Add/update model info (cost, context, provider, features) in litellm.model_cost. Input is dict or URL to JSON config.
    from litellm.utils import register_model, get_model_info
    # Register override for gpt-4 context window (example only)
    register_model({"gpt-4": {"max_tokens": 8192, "litellm_provider": "openai", "mode":"chat", "input_cost_per_token": 0.00003, "output_cost_per_token": 0.00006}})
    print(f"GPT-4 Max Tokens (after override): {get_model_info('gpt-4').get('max_tokens')}")
    
  • litellm.utils.register_prompt_template(model: str, roles={}, ...) -> dict

    • Define custom prompt structures (e.g., Llama-2 [INST]). Overrides default OpenAI formatting for the specified model name. roles dict maps role name to {"pre_message": str, "post_message": str}.
  • litellm.utils.read_config_args(config_path: str) -> dict

    • Reads JSON config file from config_path into a dict. Raises FileNotFoundError or JSONDecodeError.
  • litellm.utils.get_provider_fields(custom_llm_provider: str) -> List[ProviderField]

    • (Beta/Limited) Gets structured info on required config fields for databricks, ollama, azure_ai. Used for UI generation.

Testing & Mocking Utilities

Helpers for testing LiteLLM integrations.

  • litellm.utils.load_test_model(model: str, num_calls=100, ...) -> dict

    • Basic concurrent load test using batch_completion. Makes real API calls. Returns summary dict.
    from litellm.utils import load_test_model
    # print("Running load test (3 calls)...")
    # # Ensure key for 'gpt-3.5-turbo' is set
    # result = load_test_model(model="gpt-3.5-turbo", num_calls=3, force_timeout=20)
    # print(f"Load test status: {result['status']}, Time: {result.get('total_response_time'):.2f}s")
    
  • litellm.utils.mock_completion_streaming_obj(...) -> Generator

    • Creates sync generator mimicking litellm.completion(stream=True) output from a string or Exception. Unit test sync stream consumers.
  • litellm.utils.async_mock_completion_streaming_obj(...) -> AsyncGenerator

    • Creates async generator mimicking litellm.acompletion(stream=True) output. Unit test async stream consumers.

Miscellaneous Utilities

Other helpers.

  • litellm.utils.trim_messages(messages: List[Dict], model=None, ...) -> Union[List[Dict], Tuple[List[Dict], int]]

    • Reduces message list/content to fit context window. Preserves recent messages. Returns trimmed list, optionally (trimmed_list, remaining_tokens).
  • litellm.utils.get_valid_models(check_provider_endpoint=False, ...) -> List[str]

    • Lists model names likely usable based on set environment API keys. check_provider_endpoint=True verifies via (potentially slow/costly) API calls.
  • litellm.utils.function_to_dict(input_function: Callable) -> dict

    • Converts Python function (with type hints & NumPy docstring) to OpenAI Tool schema. Requires pip install litellm[numpydoc].
  • litellm.utils.return_raw_request(endpoint: CallTypes, kwargs: dict) -> RawRequestTypedDict

    • (Beta) Debugging utility. Intercepts call, returns dict with prepared url, headers, json_data. Uses fake key, expects failure.
  • litellm.utils.get_utc_datetime() -> datetime.datetime

    • Gets current time as timezone-aware UTC datetime object.

11. Reference: Constants (litellm.*)

LiteLLM exposes several lists and default values as module-level constants.

  • Provider & Model Lists:
    • litellm.provider_list: List[str] - All recognized provider names.
    • litellm.models_by_provider: Dict[str, List[str]] - Maps provider name to list of known model strings.
    • Provider-specific lists: litellm.open_ai_chat_completion_models, litellm.open_ai_embedding_models, litellm.open_ai_completion_models, litellm.anthropic_models, litellm.cohere_models, litellm.vertex_models, litellm.bedrock_models, litellm.huggingface_models, litellm.ollama_models, litellm.mistral_models, litellm.groq_models, litellm.replicate_models, litellm.perplexity_models, litellm.openrouter_models, litellm.ai21_models, litellm.aleph_alpha_models, litellm.baseten_models, litellm.nlp_cloud_models, litellm.deepinfra_models, etc.
  • Default Settings:
    • litellm.DEFAULT_MAX_RETRIES (int): Default value for litellm.num_retries.
    • litellm.DEFAULT_REQUEST_TIMEOUT (int): Default value for litellm.request_timeout (seconds).
    • litellm.AZURE_DEFAULT_API_VERSION (str): Default Azure API version used if not specified.
    • litellm.ROUTER_DEFAULT_HEALTH_CHECK_INTERVAL (int): Default Router health check interval (seconds).
    • litellm.ROUTER_MAX_FALLBACKS (int): Default max fallbacks Router will attempt.
    • litellm.DEFAULT_COOLDOWN_TIME_SECONDS (int): Default Router deployment cooldown time (seconds).
  • Parameter Lists:
    • litellm.OPENAI_CHAT_COMPLETION_PARAMS: List[str] - Standard OpenAI chat parameters LiteLLM recognizes.
    • litellm.ALL_AVAILABLE_PARAMS: List[str] - Broader list of parameters potentially recognized across providers.
  • Internal Mappings:
    • litellm.model_cost: Dict[str, Dict] - The internal dictionary mapping model names to their properties (cost, context window, provider, etc.). Can be modified via register_model.
    • litellm.custom_prompt_dict: Dict[str, Dict] - Stores custom prompt templates registered via register_prompt_template.
import litellm

print(f"Number of listed providers: {len(litellm.provider_list)}")
# Example: Check if 'groq' is listed
print(f"Is 'groq' in provider list? {'groq' in litellm.provider_list}")

print("\nSample of OpenAI Chat Models:")
print(litellm.open_ai_chat_completion_models[:5])

print(f"\nDefault Router Max Fallbacks: {litellm.ROUTER_MAX_FALLBACKS}")