AI Agents Behavior Versioning and Evaluation in Practice
As AI agents move from prototypes to production, one of the most important — and most overlooked — aspects of working with them is how to test and evaluate their different behaviors. Developers are constantly tweaking prompts, adjusting tools, or changing logic. QA engineers are tasked with verifying that responses are accurate, relevant, and aligned with product expectations. Yet, most teams lack a structured way to experiment, compare, and improve AI agents across different versions. This guide outlines the core challenges and introduces a solution based on isolating agent versions, logging structured evaluations, and making experimentation reproducible and measurable. The Challenge: Experimentation Without Chaos AI agents'behavior changes based on: Prompt instructions Tool availability Model versions System context or user inputs When you have multiple versions of an agent — say one that summarizes and another that provides detailed analysis — it's critical to compare them side-by-side. But teams often face several issues: Data contamination: Logs and test results from different agents get mixed together, making it hard to know which version produced which response. Lack of reproducibility: There's no easy way to re-run a previous experiment exactly as it was. No structured QA: Lacking structured QA feedback (such as latency, token count, relevance). No baseline comparison: Teams can’t measure progress across versions with consistent criteria. These problems slow down iteration, introduce risk in production deployments, and leave QA teams guessing whether an agent is truly ready. A Better Approach: Isolated Agent Versioning and Structured Evaluation To solve this, we propose a workflow where each version of an AI agent is treated as a first-class experiment, with its own: Configuration (prompts, tools, goals) Interaction logs Quality metrics This allows teams to compare behaviors in a clean, consistent, and repeatable way. Here’s how it works using separate database branches in Neon Serverless Postgres for each version makes experimentation safe, testable, and reversible. We create our AI Agents using the Azure AI Agent Service. Explore the source code on https://github.com/neondatabase-labs/neon-azure-multi-agent-evaluation. Multi-agent Versioning Setup Instructions Prerequisites Before you start, make sure you have: Python 3.9+ An Azure subscription (create one for free) Azure AI Developer RBAC role (assign it) Step 1: Create Neon Postgres on Azure Visit the Neon resource creation page on Azure Marketplace. Fill in the resource details and deploy it. After the resource is created, go to the Neon Serverless Postgres Organization service and click on the Portal URL. This brings you to the Neon Console. Click "New Project". Choose your region and project name (e.g., “Multiversion AI Agents”). Step 2: Create Two Branches for Versioning In the Neon Console: Go to your project > Branches Create branch v1 from production Create branch v2 from production After both branches are created successfully, copy the Neon connection strings for each branch and note them down. You can find the connection details in the Connection Details widget on the Neon Dashboard. NEON_DB_CONNECTION_STRING_V1=postgresql://[user]:[password]@[hostname]/[dbname]?sslmode=require NEON_DB_CONNECTION_STRING_V2=postgresql://[user]:[password]@[hostname]/[dbname]?sslmode=require Step 3: Create Database Schemas Create database schemas to store agent configs and metric logs. Each branch will log data separately for each agent version. Run the following two SQL scripts for each branch using SQL Editor in the Neon Console. Agent configs table stores agent configurations where each agent version has a unique set of parameters: Prompt template List of tools Agent goal (e.g. summarization, classification) CREATE TABLE agent_configs ( id SERIAL PRIMARY KEY, agent_name TEXT, version TEXT, prompt_template TEXT, tools TEXT[], goal TEXT, created_at TIMESTAMP DEFAULT now() ); Agent logs table stores interactions with QA metrics, where every response from an agent is logged with detailed information, including: How long the response took Number of words returned Whether important keywords were mentioned Whether the behavior met a heuristic definition of success CREATE TABLE agent_logs ( id SERIAL PRIMARY KEY, config_id INT REFERENCES agent_configs(id), user_input TEXT, agent_response TEXT, tool_used TEXT, success BOOLEAN, response_length INT, latency FLOAT, keyword_hit BOOLEAN, heuristic_success BOOLEAN, created_at TIMESTAMP DEFAULT now() ); Step 3: Set Up Azure AI Agent Project Create a new hub and project in the Azure AI Foundry portal by following the guide in the Microsoft docs. You also need to deploy a model like GPT-4o. You only need the Proj

As AI agents move from prototypes to production, one of the most important — and most overlooked — aspects of working with them is how to test and evaluate their different behaviors.
Developers are constantly tweaking prompts, adjusting tools, or changing logic. QA engineers are tasked with verifying that responses are accurate, relevant, and aligned with product expectations. Yet, most teams lack a structured way to experiment, compare, and improve AI agents across different versions.
This guide outlines the core challenges and introduces a solution based on isolating agent versions, logging structured evaluations, and making experimentation reproducible and measurable.
The Challenge: Experimentation Without Chaos
AI agents'behavior changes based on:
- Prompt instructions
- Tool availability
- Model versions
- System context or user inputs
When you have multiple versions of an agent — say one that summarizes and another that provides detailed analysis — it's critical to compare them side-by-side. But teams often face several issues:
- Data contamination: Logs and test results from different agents get mixed together, making it hard to know which version produced which response.
- Lack of reproducibility: There's no easy way to re-run a previous experiment exactly as it was.
- No structured QA: Lacking structured QA feedback (such as latency, token count, relevance).
- No baseline comparison: Teams can’t measure progress across versions with consistent criteria.
These problems slow down iteration, introduce risk in production deployments, and leave QA teams guessing whether an agent is truly ready.
A Better Approach: Isolated Agent Versioning and Structured Evaluation
To solve this, we propose a workflow where each version of an AI agent is treated as a first-class experiment, with its own:
- Configuration (prompts, tools, goals)
- Interaction logs
- Quality metrics
This allows teams to compare behaviors in a clean, consistent, and repeatable way.
Here’s how it works using separate database branches in Neon Serverless Postgres for each version makes experimentation safe, testable, and reversible. We create our AI Agents using the Azure AI Agent Service. Explore the source code on https://github.com/neondatabase-labs/neon-azure-multi-agent-evaluation.
Multi-agent Versioning Setup Instructions
Prerequisites
Before you start, make sure you have:
- Python 3.9+
- An Azure subscription (create one for free)
- Azure AI Developer RBAC role (assign it)
Step 1: Create Neon Postgres on Azure
- Visit the Neon resource creation page on Azure Marketplace.
- Fill in the resource details and deploy it.
- After the resource is created, go to the Neon Serverless Postgres Organization service and click on the Portal URL. This brings you to the Neon Console.
- Click "New Project".
- Choose your region and project name
(e.g., “Multiversion AI Agents”)
.
Step 2: Create Two Branches for Versioning
In the Neon Console:
- Go to your project > Branches
- Create branch
v1
fromproduction
- Create branch
v2
fromproduction
After both branches are created successfully, copy the Neon connection strings for each branch and note them down. You can find the connection details in the Connection Details widget on the Neon Dashboard.
NEON_DB_CONNECTION_STRING_V1=postgresql://[user]:[password]@[hostname]/[dbname]?sslmode=require
NEON_DB_CONNECTION_STRING_V2=postgresql://[user]:[password]@[hostname]/[dbname]?sslmode=require
Step 3: Create Database Schemas
Create database schemas to store agent configs and metric logs. Each branch will log data separately for each agent version. Run the following two SQL scripts for each branch using SQL Editor in the Neon Console.
- Agent configs table stores agent configurations where each agent version has a unique set of parameters:
- Prompt template
- List of tools
- Agent goal (e.g. summarization, classification)
CREATE TABLE agent_configs (
id SERIAL PRIMARY KEY,
agent_name TEXT,
version TEXT,
prompt_template TEXT,
tools TEXT[],
goal TEXT,
created_at TIMESTAMP DEFAULT now()
);
- Agent logs table stores interactions with QA metrics, where every response from an agent is logged with detailed information, including:
- How long the response took
- Number of words returned
- Whether important keywords were mentioned
- Whether the behavior met a heuristic definition of success
CREATE TABLE agent_logs (
id SERIAL PRIMARY KEY,
config_id INT REFERENCES agent_configs(id),
user_input TEXT,
agent_response TEXT,
tool_used TEXT,
success BOOLEAN,
response_length INT,
latency FLOAT,
keyword_hit BOOLEAN,
heuristic_success BOOLEAN,
created_at TIMESTAMP DEFAULT now()
);
Step 3: Set Up Azure AI Agent Project
Create a new hub and project in the Azure AI Foundry portal by following the guide in the Microsoft docs. You also need to deploy a model like GPT-4o.
You only need the Project connection string and Model Deployment Name from the Azure AI Foundry portal. You can also find your connection string in the overview for your project in the Azure AI Foundry portal, under Project details > Project connection string.
Once you have all three values on hand: Neon connection strings, Project connection string, and Model Deployment Name, you are ready to set up the Python project to create an Agent from Python SDK.
Step 4: Clone Python Project and Install Dependencies
git clone https://github.com/neondatabase-labs/neon-azure-multi-agent-evaluation.git
cd neon-azure-multi-agent-evaluation
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
Step 5: Set Up Environment Variables
Create a .env
file in the root directory:
AGENT_VERSION=v1
NEON_DB_CONNECTION_STRING_V1=your_neon_connection_string_branch_v1
NEON_DB_CONNECTION_STRING_V2=your_neon_connection_string_branch_v2
PROJECT_CONNECTION_STRING=your_azure_project_connection_string
AZURE_OPENAI_DEPLOYMENT_NAME=your_azure_openai_model
Step 6: Run Python script to generate agent metric logs
The script agents.py
defines different behaviors for two agent versions: one generates concise summaries, the other provides detailed outputs using an external tool. It selects the Neon branches dynamically based on version. This setup lets teams compare how prompt engineering and tool usage affect the quality and relevance of AI responses. Also, this keeps logs clean, avoids accidental overlap, and allows rollback or re-analysis of specific runs.
...
project_client = AIProjectClient.from_connection_string(
credential=DefaultAzureCredential(),
conn_str=os.environ["PROJECT_CONNECTION_STRING"],
)
NEON_DB_URL = os.getenv(
f"NEON_DB_CONNECTION_STRING_{os.getenv('AGENT_VERSION').upper()}"
)
conn = psycopg2.connect(NEON_DB_URL)
cursor = conn.cursor(cursor_factory=RealDictCursor)
...
tools = FunctionTool([search_ibm_news])
toolset = ToolSet()
toolset.add(tools)
agent_version = os.getenv("AGENT_VERSION", "v1").lower()
if agent_version == "v1":
prompt_template = "Summarize input in 2–3 sentences with only key insights."
tools_used = []
goal = "Concise summarization"
toolset_used = toolset
else:
prompt_template = "Summarize content with full detail using the available tools."
tools_used = ["query_summaries"]
goal = "Detailed summarization with tools"
toolset_used = toolset
agent = project_client.agents.create_agent(
model=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
name=f"summarizer-{agent_version}-{datetime.now().strftime('%H%M%S')}",
description=f"{goal} agent",
instructions=prompt_template,
toolset=toolset_used,
)
print(f"