Building a Robust AI Guardrails System with OpenAI-Part II
In today's AI landscape, ensuring responsible **and **safeinteractions with language models has become as important as the capabilities of the models themselves. Implementing effective guardrails is no longer optional—it's essential for any organization deploying AI systems. This blog explores how to build a comprehensive guardrails system using OpenAI's newly released Agents SDK tools to filter both user inputs and AI-generated outputs. Why Guardrails Matter AI systems without proper safeguards can inadvertently generate harmful, biased, or inappropriate content. A well-designed guardrails system serves as a dual-layer protection mechanism: Input filtering prevents users from prompting the AI with harmful or inappropriate requests Output screening ensures that even if problematic inputs slip through, the AI's responses remain safe and appropriate Implementation Overview Our implementation leverages OpenAI's moderation API alongside custom filtering logic. Here's the complete code for a practical guardrails system: import json import openai import os from typing import Dict, List, Any, Optional # Set up OpenAI client (replace with your own API key) client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) class GuardrailsSystem: def __init__(self): # Define input guardrails self.input_topics_to_avoid = ["weapons", "illegal activities", "exploitation"] # Define output guardrails self.harmful_categories = [ "hate", "harassment", "self-harm", "sexual content involving minors", "violence", "dangerous content", "illegal activity" ] def validate_input(self, user_input: str) -> Dict[str, Any]: """Check if the user input contains topics we want to avoid.""" # Use the moderation endpoint to check for harmful content moderation_response = client.moderations.create(input=user_input) # Extract the results results = moderation_response.results[0] # Check if the input was flagged if results.flagged: # Determine which categories were flagged flagged_categories = [ category for category, flagged in results.categories.model_dump().items() if flagged ] return { "valid": False, "reason": f"Input contains potentially harmful content: {', '.join(flagged_categories)}" } # Perform additional custom checks for topics to avoid for topic in self.input_topics_to_avoid: if topic in user_input.lower(): return { "valid": False, "reason": f"Input contains topic we cannot discuss: {topic}" } return {"valid": True} def apply_output_guardrails(self, generated_text: str) -> Dict[str, Any]: """Apply guardrails to the model output.""" # Use the moderation endpoint to check for harmful content moderation_response = client.moderations.create(input=generated_text) # Extract the results results = moderation_response.results[0] # Check if the output was flagged if results.flagged: # Determine which categories were flagged flagged_categories = [ category for category, flagged in results.categories.model_dump().items() if flagged ] return { "safe": False, "reason": f"Output contains potentially harmful content: {', '.join(flagged_categories)}", "output": "I cannot provide that information as it may violate content guidelines." } # Additional custom checks could be added here return {"safe": True, "output": generated_text} def process_with_guardrails(self, user_input: str) -> str: """Process user input with both input and output guardrails.""" # 1. Apply input guardrails input_validation = self.validate_input(user_input) if not input_validation["valid"]: return f"Sorry, I cannot respond to that request. {input_validation['reason']}" # 2. Generate response with the model try: completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_input} ], temperature=0.7, max_tokens=500 ) generated_text = completion.choices[0].message.content # 3. Apply output guardrails output_check = self.apply_output_guardrails(generated_text) if output_check["safe"]: return output_check["output"] else: return f"I generated a response but it did

In today's AI landscape, ensuring responsible **and **safeinteractions with language models has become as important as the capabilities of the models themselves. Implementing effective guardrails is no longer optional—it's essential for any organization deploying AI systems. This blog explores how to build a comprehensive guardrails system using OpenAI's newly released Agents SDK tools to filter both user inputs and AI-generated outputs.
Why Guardrails Matter
AI systems without proper safeguards can inadvertently generate harmful, biased, or inappropriate content. A well-designed guardrails system serves as a dual-layer protection mechanism:
- Input filtering prevents users from prompting the AI with harmful or inappropriate requests
- Output screening ensures that even if problematic inputs slip through, the AI's responses remain safe and appropriate
Implementation Overview
Our implementation leverages OpenAI's moderation API alongside custom filtering logic. Here's the complete code for a practical guardrails system:
import json
import openai
import os
from typing import Dict, List, Any, Optional
# Set up OpenAI client (replace with your own API key)
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
class GuardrailsSystem:
def __init__(self):
# Define input guardrails
self.input_topics_to_avoid = ["weapons", "illegal activities", "exploitation"]
# Define output guardrails
self.harmful_categories = [
"hate", "harassment", "self-harm", "sexual content involving minors",
"violence", "dangerous content", "illegal activity"
]
def validate_input(self, user_input: str) -> Dict[str, Any]:
"""Check if the user input contains topics we want to avoid."""
# Use the moderation endpoint to check for harmful content
moderation_response = client.moderations.create(input=user_input)
# Extract the results
results = moderation_response.results[0]
# Check if the input was flagged
if results.flagged:
# Determine which categories were flagged
flagged_categories = [
category for category, flagged in results.categories.model_dump().items()
if flagged
]
return {
"valid": False,
"reason": f"Input contains potentially harmful content: {', '.join(flagged_categories)}"
}
# Perform additional custom checks for topics to avoid
for topic in self.input_topics_to_avoid:
if topic in user_input.lower():
return {
"valid": False,
"reason": f"Input contains topic we cannot discuss: {topic}"
}
return {"valid": True}
def apply_output_guardrails(self, generated_text: str) -> Dict[str, Any]:
"""Apply guardrails to the model output."""
# Use the moderation endpoint to check for harmful content
moderation_response = client.moderations.create(input=generated_text)
# Extract the results
results = moderation_response.results[0]
# Check if the output was flagged
if results.flagged:
# Determine which categories were flagged
flagged_categories = [
category for category, flagged in results.categories.model_dump().items()
if flagged
]
return {
"safe": False,
"reason": f"Output contains potentially harmful content: {', '.join(flagged_categories)}",
"output": "I cannot provide that information as it may violate content guidelines."
}
# Additional custom checks could be added here
return {"safe": True, "output": generated_text}
def process_with_guardrails(self, user_input: str) -> str:
"""Process user input with both input and output guardrails."""
# 1. Apply input guardrails
input_validation = self.validate_input(user_input)
if not input_validation["valid"]:
return f"Sorry, I cannot respond to that request. {input_validation['reason']}"
# 2. Generate response with the model
try:
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_input}
],
temperature=0.7,
max_tokens=500
)
generated_text = completion.choices[0].message.content
# 3. Apply output guardrails
output_check = self.apply_output_guardrails(generated_text)
if output_check["safe"]:
return output_check["output"]
else:
return f"I generated a response but it didn't pass our safety checks. {output_check['reason']}"
except Exception as e:
return f"An error occurred: {str(e)}"
# Example usage
def main():
guardrails = GuardrailsSystem()
# Example 1: Safe query
print("\n--- Example 1: Safe Query ---")
safe_query = "What are some healthy breakfast options?"
print(f"User: {safe_query}")
response = guardrails.process_with_guardrails(safe_query)
print(f"Assistant: {response}")
# Example 2: Query with avoided topic
print("\n--- Example 2: Query with Avoided Topic ---")
avoided_topic_query = "How can I make weapons at home?"
print(f"User: {avoided_topic_query}")
response = guardrails.process_with_guardrails(avoided_topic_query)
print(f"Assistant: {response}")
# Example 3: Testing output guardrails
print("\n--- Example 3: Testing Output Guardrails ---")
output_test_query = "Write a short story about someone getting revenge."
print(f"User: {output_test_query}")
response = guardrails.process_with_guardrails(output_test_query)
print(f"Assistant: {response}")
if __name__ == "__main__":
main()
Key Components of the System
1. Input Validation
The validate_input
method provides two layers of protection:
- OpenAI Moderation API: Leverages OpenAI's content moderation system to detect potentially harmful content across multiple categories.
- Custom Topic Filtering: Adds a second layer to catch specific topics you want your application to avoid, even if they aren't flagged by the moderation API.
2. Output Screening
The apply_output_guardrails
method ensures that even if a seemingly innocent prompt leads to problematic content, that content won't reach the end user. This is crucial because language models can sometimes generate unexpected outputs.
3. Complete Processing Pipeline
The process_with_guardrails
method ties everything together:
- First, it validates the user input
- If valid, it sends the request to the OpenAI model
- Before returning the response, it checks the output for safety issues
Real-World Applications
This guardrails system can be integrated into various applications:
- Customer support chatbots: Ensure responses remain professional and appropriate
- Educational tools: Filter both inappropriate student queries and ensure age-appropriate answers
- Content generation applications: Prevent creation of harmful or policy-violating content
- Internal enterprise tools: Maintain professional standards even in employee-facing systems
Enhancing the System
The basic implementation can be extended in several ways:
- Topic-specific guardrails: Add specialized filters for particular domains
- User context awareness: Adjust guardrails based on user age, location, or other factors
- Feedback mechanisms: Allow users to report problematic responses that slip through
- Audit logging: Track and analyze both rejected inputs and outputs to improve the system
Conclusion
Building effective AI guardrails is a responsibility that all developers working with generative AI models must take seriously. By implementing a dual-layered approach that screens both inputs and outputs, we can harness the power of large language models while dramatically reducing the risk of harmful content.
The system demonstrated here provides a solid foundation that can be customized to meet the specific needs of your application. As AI capabilities continue to advance, so too should our approaches to ensuring these systems operate within appropriate boundaries.
Thanks
Sreeni Ramadorai