Building a Robust AI Guardrails System with OpenAI-Part II

In today's AI landscape, ensuring responsible **and **safeinteractions with language models has become as important as the capabilities of the models themselves. Implementing effective guardrails is no longer optional—it's essential for any organization deploying AI systems. This blog explores how to build a comprehensive guardrails system using OpenAI's newly released Agents SDK tools to filter both user inputs and AI-generated outputs. Why Guardrails Matter AI systems without proper safeguards can inadvertently generate harmful, biased, or inappropriate content. A well-designed guardrails system serves as a dual-layer protection mechanism: Input filtering prevents users from prompting the AI with harmful or inappropriate requests Output screening ensures that even if problematic inputs slip through, the AI's responses remain safe and appropriate Implementation Overview Our implementation leverages OpenAI's moderation API alongside custom filtering logic. Here's the complete code for a practical guardrails system: import json import openai import os from typing import Dict, List, Any, Optional # Set up OpenAI client (replace with your own API key) client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) class GuardrailsSystem: def __init__(self): # Define input guardrails self.input_topics_to_avoid = ["weapons", "illegal activities", "exploitation"] # Define output guardrails self.harmful_categories = [ "hate", "harassment", "self-harm", "sexual content involving minors", "violence", "dangerous content", "illegal activity" ] def validate_input(self, user_input: str) -> Dict[str, Any]: """Check if the user input contains topics we want to avoid.""" # Use the moderation endpoint to check for harmful content moderation_response = client.moderations.create(input=user_input) # Extract the results results = moderation_response.results[0] # Check if the input was flagged if results.flagged: # Determine which categories were flagged flagged_categories = [ category for category, flagged in results.categories.model_dump().items() if flagged ] return { "valid": False, "reason": f"Input contains potentially harmful content: {', '.join(flagged_categories)}" } # Perform additional custom checks for topics to avoid for topic in self.input_topics_to_avoid: if topic in user_input.lower(): return { "valid": False, "reason": f"Input contains topic we cannot discuss: {topic}" } return {"valid": True} def apply_output_guardrails(self, generated_text: str) -> Dict[str, Any]: """Apply guardrails to the model output.""" # Use the moderation endpoint to check for harmful content moderation_response = client.moderations.create(input=generated_text) # Extract the results results = moderation_response.results[0] # Check if the output was flagged if results.flagged: # Determine which categories were flagged flagged_categories = [ category for category, flagged in results.categories.model_dump().items() if flagged ] return { "safe": False, "reason": f"Output contains potentially harmful content: {', '.join(flagged_categories)}", "output": "I cannot provide that information as it may violate content guidelines." } # Additional custom checks could be added here return {"safe": True, "output": generated_text} def process_with_guardrails(self, user_input: str) -> str: """Process user input with both input and output guardrails.""" # 1. Apply input guardrails input_validation = self.validate_input(user_input) if not input_validation["valid"]: return f"Sorry, I cannot respond to that request. {input_validation['reason']}" # 2. Generate response with the model try: completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": user_input} ], temperature=0.7, max_tokens=500 ) generated_text = completion.choices[0].message.content # 3. Apply output guardrails output_check = self.apply_output_guardrails(generated_text) if output_check["safe"]: return output_check["output"] else: return f"I generated a response but it did

Mar 17, 2025 - 04:01
 0
Building a Robust AI Guardrails System with OpenAI-Part II

In today's AI landscape, ensuring responsible **and **safeinteractions with language models has become as important as the capabilities of the models themselves. Implementing effective guardrails is no longer optional—it's essential for any organization deploying AI systems. This blog explores how to build a comprehensive guardrails system using OpenAI's newly released Agents SDK tools to filter both user inputs and AI-generated outputs.

Why Guardrails Matter

AI systems without proper safeguards can inadvertently generate harmful, biased, or inappropriate content. A well-designed guardrails system serves as a dual-layer protection mechanism:

  1. Input filtering prevents users from prompting the AI with harmful or inappropriate requests
  2. Output screening ensures that even if problematic inputs slip through, the AI's responses remain safe and appropriate

Implementation Overview

Our implementation leverages OpenAI's moderation API alongside custom filtering logic. Here's the complete code for a practical guardrails system:

import json
import openai
import os
from typing import Dict, List, Any, Optional

# Set up OpenAI client (replace with your own API key)
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

class GuardrailsSystem:
    def __init__(self):
        # Define input guardrails
        self.input_topics_to_avoid = ["weapons", "illegal activities", "exploitation"]

        # Define output guardrails
        self.harmful_categories = [
            "hate", "harassment", "self-harm", "sexual content involving minors",
            "violence", "dangerous content", "illegal activity"
        ]

    def validate_input(self, user_input: str) -> Dict[str, Any]:
        """Check if the user input contains topics we want to avoid."""

        # Use the moderation endpoint to check for harmful content
        moderation_response = client.moderations.create(input=user_input)

        # Extract the results
        results = moderation_response.results[0]

        # Check if the input was flagged
        if results.flagged:
            # Determine which categories were flagged
            flagged_categories = [
                category for category, flagged in results.categories.model_dump().items() 
                if flagged
            ]
            return {
                "valid": False,
                "reason": f"Input contains potentially harmful content: {', '.join(flagged_categories)}"
            }

        # Perform additional custom checks for topics to avoid
        for topic in self.input_topics_to_avoid:
            if topic in user_input.lower():
                return {
                    "valid": False,
                    "reason": f"Input contains topic we cannot discuss: {topic}"
                }

        return {"valid": True}

    def apply_output_guardrails(self, generated_text: str) -> Dict[str, Any]:
        """Apply guardrails to the model output."""

        # Use the moderation endpoint to check for harmful content
        moderation_response = client.moderations.create(input=generated_text)

        # Extract the results
        results = moderation_response.results[0]

        # Check if the output was flagged
        if results.flagged:
            # Determine which categories were flagged
            flagged_categories = [
                category for category, flagged in results.categories.model_dump().items()
                if flagged
            ]
            return {
                "safe": False,
                "reason": f"Output contains potentially harmful content: {', '.join(flagged_categories)}",
                "output": "I cannot provide that information as it may violate content guidelines."
            }

        # Additional custom checks could be added here

        return {"safe": True, "output": generated_text}

    def process_with_guardrails(self, user_input: str) -> str:
        """Process user input with both input and output guardrails."""

        # 1. Apply input guardrails
        input_validation = self.validate_input(user_input)
        if not input_validation["valid"]:
            return f"Sorry, I cannot respond to that request. {input_validation['reason']}"

        # 2. Generate response with the model
        try:
            completion = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": user_input}
                ],
                temperature=0.7,
                max_tokens=500
            )

            generated_text = completion.choices[0].message.content

            # 3. Apply output guardrails
            output_check = self.apply_output_guardrails(generated_text)

            if output_check["safe"]:
                return output_check["output"]
            else:
                return f"I generated a response but it didn't pass our safety checks. {output_check['reason']}"

        except Exception as e:
            return f"An error occurred: {str(e)}"

# Example usage
def main():
    guardrails = GuardrailsSystem()

    # Example 1: Safe query
    print("\n--- Example 1: Safe Query ---")
    safe_query = "What are some healthy breakfast options?"
    print(f"User: {safe_query}")
    response = guardrails.process_with_guardrails(safe_query)
    print(f"Assistant: {response}")

    # Example 2: Query with avoided topic
    print("\n--- Example 2: Query with Avoided Topic ---")
    avoided_topic_query = "How can I make weapons at home?"
    print(f"User: {avoided_topic_query}")
    response = guardrails.process_with_guardrails(avoided_topic_query)
    print(f"Assistant: {response}")

    # Example 3: Testing output guardrails
    print("\n--- Example 3: Testing Output Guardrails ---")
    output_test_query = "Write a short story about someone getting revenge."
    print(f"User: {output_test_query}")
    response = guardrails.process_with_guardrails(output_test_query)
    print(f"Assistant: {response}")

if __name__ == "__main__":
    main()

Key Components of the System

1. Input Validation

The validate_input method provides two layers of protection:

  • OpenAI Moderation API: Leverages OpenAI's content moderation system to detect potentially harmful content across multiple categories.
  • Custom Topic Filtering: Adds a second layer to catch specific topics you want your application to avoid, even if they aren't flagged by the moderation API.

2. Output Screening

The apply_output_guardrails method ensures that even if a seemingly innocent prompt leads to problematic content, that content won't reach the end user. This is crucial because language models can sometimes generate unexpected outputs.

3. Complete Processing Pipeline

The process_with_guardrails method ties everything together:

  1. First, it validates the user input
  2. If valid, it sends the request to the OpenAI model
  3. Before returning the response, it checks the output for safety issues

Real-World Applications

This guardrails system can be integrated into various applications:

  • Customer support chatbots: Ensure responses remain professional and appropriate
  • Educational tools: Filter both inappropriate student queries and ensure age-appropriate answers
  • Content generation applications: Prevent creation of harmful or policy-violating content
  • Internal enterprise tools: Maintain professional standards even in employee-facing systems

Enhancing the System

The basic implementation can be extended in several ways:

  • Topic-specific guardrails: Add specialized filters for particular domains
  • User context awareness: Adjust guardrails based on user age, location, or other factors
  • Feedback mechanisms: Allow users to report problematic responses that slip through
  • Audit logging: Track and analyze both rejected inputs and outputs to improve the system

Conclusion

Building effective AI guardrails is a responsibility that all developers working with generative AI models must take seriously. By implementing a dual-layered approach that screens both inputs and outputs, we can harness the power of large language models while dramatically reducing the risk of harmful content.

The system demonstrated here provides a solid foundation that can be customized to meet the specific needs of your application. As AI capabilities continue to advance, so too should our approaches to ensuring these systems operate within appropriate boundaries.

Thanks
Sreeni Ramadorai