Build the Smartest AI Bot You’ve Ever Seen — A 7B Model + Web Search, Right on Your Laptop

Summary: RAG Web is a Python-based application that combines web search and natural language processing to answer user queries. It uses DuckDuckGo for retrieving web search results and a Hugging Face Zephyr-7B-beta model for generating answers based on the retrieved context. Web Search RAG Architecture In this PO user query is sent as an input for the external web search. For this implementation it is DuckDuckGo service to avoid API and security limitations for the more efficient search services like Google. Search Result (as Context) is sent with the original user query as an input for the language transfer model (HuggingFaceH4/zephyr-7b-beta) which summarise and extracts the answer and output to the user. Deployment Instructions 1. Clone / Copy the Project git clone https://github.com/alexander-uspenskiy/rag_web cd rag_web 2. Create and Activate Virtual Environment python3 -m venv venv source venv/bin/activate 3. Install Requirements pip install -r requirements.txt 4. Run the Script python rag_web.py How the Script Works This is a lightweight Retrieval-Augmented Generation (RAG) implementation using: • A 7B language model (Zephyr) from Hugging Face • DuckDuckGo for real-time web search (no API key needed) Code Breakdown 1. Imports and Setup from transformers import pipeline from duckduckgo_search import DDGS import textwrap import re transformers: From Hugging Face, used to load and interact with the LLM. DDGS: DuckDuckGo’s Python interface for search queries. textwrap: Used for formatting the output neatly. re: Regular expressions to clean the model’s output. 2. Web Search Function def search_web(query, num_results=3): with DDGS() as ddgs: results = ddgs.text(query, max_results=num_results) return [r['body'] for r in results] Purpose: Takes a user query and performs a web search. How it works: Uses the DDGS().text(...) method to fetch search results. Returns: A list of snippet texts (just the bodies, without links/titles). 3. Context Generation def get_context(query): snippets = search_web(query) context = " ".join(snippets) return textwrap.fill(context, width=120) Combines all snippet results into one big context paragraph. Applies word wrapping to improve readability (optional for model input but nice for debugging/logging). 4. Model Initialization qa_pipeline = pipeline( "text-generation", model="HuggingFaceH4/zephyr-7b-beta", tokenizer="HuggingFaceH4/zephyr-7b-beta", device_map="auto" ) Loads Zephyr-7B, a chat-tuned model from Hugging Face. device_map="auto" lets Hugging Face offload model parts across available hardware (e.g., MPS or CUDA). 5. Question Answering Function def answer_question(query): a) Get Context context = get_context(query) Performs search and prepares the retrieved content. b) Prepare Prompt prompt = f"""[CONTEXT] {context} [QUESTION] {query} [ANSWER] """ This RAG-style prompt provides the model: [CONTEXT] = retrieved text from the web [QUESTION] = user’s query [ANSWER] = expected model output c) Generate Answer response = qa_pipeline(prompt, max_new_tokens=128, do_sample=True) The model generates text following the [ANSWER] tag. do_sample=True allows some creativity/randomness. d) Post-processing answer_raw = response[0]['generated_text'].split('[ANSWER]')[-1].strip() answer = re.sub(r"]+>", "", answer_raw) Strips the prompt from the output. Removes any stray XML/HTML-style tags (, , etc.) the model might emit. 6. User Interaction Loop if name == "main": Opens a CLI loop. Reads user input from the terminal. Runs the full search + answer pipeline. Displays the answer and continues unless the user types exit or quit. Architecture Summary [User Query] ↓ DuckDuckGo Search API ↓ [Web Snippets] ↓ [CONTEXT] + [QUESTION] Prompt ↓ Zephyr 7B (Hugging Face) ↓ [Generated Answer] ↓ Display in Terminal Why Zephyr-7B? Zephyr is a family of instruction-tuned, open-weight language models developed by Hugging Face. It's designed to be helpful, honest, and harmless — and small enough to run on consumer hardware. Key Characteristics Feature Description Model Size 7 Billion parameters Architecture Based on Mistral-7B (dense transformer, multi-query attention) Tuning Fine-tuned using DPO (Direct Preference Optimization) Context Length Supports up to 8,192 tokens Hardware Runs locally on M1/M2 Macs, GPUs, or even CPU with quantization Use Case Optimized for dialogue, instructions, and chat use Why I Picked Zephyr for This Script Open weights — no API keys, no rate limits Runs on laptop — 7B is small enough for consumer devices Instruction-tuned — great at handling prompts containing context and questions Frie

Apr 22, 2025 - 23:27

Build the Smartest AI Bot You’ve Ever Seen — A 7B Model + Web Search, Right on Your Laptop

Summary:

RAG Web is a Python-based application that combines web search and natural language processing to answer user queries. It uses DuckDuckGo for retrieving web search results and a Hugging Face Zephyr-7B-beta model for generating answers based on the retrieved context.

Web Search RAG Architecture

In this PO user query is sent as an input for the external web search. For this implementation it is DuckDuckGo service to avoid API and security limitations for the more efficient search services like Google. Search Result (as Context) is sent with the original user query as an input for the language transfer model (HuggingFaceH4/zephyr-7b-beta) which summarise and extracts the answer and output to the user.

Deployment Instructions

1. Clone / Copy the Project

git clone https://github.com/alexander-uspenskiy/rag_web
cd rag_web

2. Create and Activate Virtual Environment

python3 -m venv venv
source venv/bin/activate

3. Install Requirements

pip install -r requirements.txt

4. Run the Script

python rag_web.py

How the Script Works

This is a lightweight Retrieval-Augmented Generation (RAG) implementation using:
• A 7B language model (Zephyr) from Hugging Face
• DuckDuckGo for real-time web search (no API key needed)

Code Breakdown

1. Imports and Setup

from transformers import pipeline
from duckduckgo_search import DDGS
import textwrap
import re

transformers: From Hugging Face, used to load and interact with the LLM.
DDGS: DuckDuckGo’s Python interface for search queries.
textwrap: Used for formatting the output neatly.
re: Regular expressions to clean the model’s output.

2. Web Search Function

def search_web(query, num_results=3):
    with DDGS() as ddgs:
        results = ddgs.text(query, max_results=num_results)
        return [r['body'] for r in results]

Purpose: Takes a user query and performs a web search.
How it works: Uses the DDGS().text(...) method to fetch search results.
Returns: A list of snippet texts (just the bodies, without links/titles).

3. Context Generation

def get_context(query):
    snippets = search_web(query)
    context = " ".join(snippets)
    return textwrap.fill(context, width=120)

Combines all snippet results into one big context paragraph.
Applies word wrapping to improve readability (optional for model input but nice for debugging/logging).

4. Model Initialization

qa_pipeline = pipeline(
    "text-generation", 
    model="HuggingFaceH4/zephyr-7b-beta", 
    tokenizer="HuggingFaceH4/zephyr-7b-beta", 
    device_map="auto"
)

Loads Zephyr-7B, a chat-tuned model from Hugging Face.
device_map="auto" lets Hugging Face offload model parts across available hardware (e.g., MPS or CUDA).

5. Question Answering Function

def answer_question(query):

a) Get Context

context = get_context(query)

Performs search and prepares the retrieved content.

b) Prepare Prompt

prompt = f"""[CONTEXT]
{context}

[QUESTION]
{query}

[ANSWER]
"""

This RAG-style prompt provides the model:

[CONTEXT] = retrieved text from the web
[QUESTION] = user’s query
[ANSWER] = expected model output

c) Generate Answer

response = qa_pipeline(prompt, max_new_tokens=128, do_sample=True)

The model generates text following the [ANSWER] tag.
do_sample=True allows some creativity/randomness.

d) Post-processing

answer_raw = response[0]['generated_text'].split('[ANSWER]')[-1].strip()
answer = re.sub(r"<[^>]+>", "", answer_raw)

Strips the prompt from the output.
Removes any stray XML/HTML-style tags (, , etc.) the model might emit.

6. User Interaction Loop

if __name__ == "__main__":

Opens a CLI loop.
Reads user input from the terminal.
Runs the full search + answer pipeline.
Displays the answer and continues unless the user types exit or quit.

Architecture Summary

[User Query]
     ↓
DuckDuckGo Search API
     ↓
[Web Snippets]
     ↓
[CONTEXT] + [QUESTION] Prompt
     ↓
Zephyr 7B (Hugging Face)
     ↓
[Generated Answer]
     ↓
Display in Terminal

Why Zephyr-7B?

Zephyr is a family of instruction-tuned, open-weight language models developed by Hugging Face. It's designed to be helpful, honest, and harmless — and small enough to run on consumer hardware.

Key Characteristics

Feature	Description
Model Size	7 Billion parameters
Architecture	Based on Mistral-7B (dense transformer, multi-query attention)
Tuning	Fine-tuned using DPO (Direct Preference Optimization)
Context Length	Supports up to 8,192 tokens
Hardware	Runs locally on M1/M2 Macs, GPUs, or even CPU with quantization
Use Case	Optimized for dialogue, instructions, and chat use

Why I Picked Zephyr for This Script

Open weights — no API keys, no rate limits
Runs on laptop — 7B is small enough for consumer devices
Instruction-tuned — great at handling prompts containing context and questions
Friendly outputs — fine-tuned to be helpful and safe
Easy integration — via Hugging Face transformers pipeline

Compared to Other Models

Model	Pros	Cons
Zephyr-7B	Open, chat-tuned, lightweight	Slightly less fluent than GPT-4
GPT-3.5/4	Top-tier reasoning	Closed, pay-per-use, no local use
Mistral-7B	High-speed base model	Needs fine-tuning for QA/chat
LLaMA2 7B	Open and popular	Less optimized for chat out-of-box

Final Thoughts on model

Zephyr-7B hits the sweet spot between performance, privacy, and portability. It gives you GPT-style interaction with full local control — and when combined with web search, it becomes a surprisingly capable assistant.

If you're building a local AI assistant or just want to experiment with RAG pipelines without burning through API tokens, Zephyr-7B is a strong starting point.

Usage example

You can see the RAG searches for the real-time data to add to the context and send to the model so model can generate an answer:

Performance Optimization

While the baseline implementation is functional and responsive, several optimizations can improve performance:

Model Quantization: Use 4-bit or 8-bit quantized versions of the model with bitsandbytes to reduce memory usage and inference time.
Streaming Inference: Implement token streaming for faster perceived response times.
Caching Search Results: Avoid redundant queries by caching recent DuckDuckGo results locally.
Async Execution: Use asyncio to parallelize web search and token generation.
Prompt Truncation: Dynamically trim context to fit within model’s token limits, prioritizing relevance.

Future Enhancements for Enterprise RAG

To scale this into an enterprise-grade RAG system, consider the following enhancements:

Vector Search Integration: Add to a web search a hybrid search system using vector embeddings (e.g., FAISS, Weaviate, Pinecone).
Knowledge Base Sync: Sync data from private sources like Confluence, Notion, SharePoint, or document stores.
Multi-turn Memory: Add a conversation memory layer using a session buffer or vector memory for context retention.
User Feedback Loop: Incorporate thumbs-up/down voting to improve results and fine-tune retrieval relevance.
Security & Auditability: Wrap API access and logging in enterprise security layers (SSO, encryption, RBAC).
Scalability: Run inference via model serving tools like vLLM, TGI, or TorchServe with GPU acceleration and autoscaling.

Summary

This article explores how to build a lightweight Retrieval-Augmented Generation (RAG) assistant using a 7B parameter open-source language model (Zephyr-7B) and real-time web search via DuckDuckGo.

The solution runs locally, requires no external APIs, and leverages Hugging Face's transformers library to deliver intelligent, contextual responses to user queries.

Zephyr-7B was chosen for its balance of performance and portability. It is instruction-tuned, easy to run on consumer hardware, and excels in structured question-answering tasks. When paired with live search results, it creates a powerful, self-contained research assistant.

This project is ideal for developers looking to experiment with local LLMs, build RAG prototypes, or create privacy-respecting AI tools without relying on paid cloud APIs.

The full implementation, code walkthrough, and architecture are detailed below.

Use a GitHub repository to get a POC code: [https://github.com/alexander-uspenskiy/rag_web(https://github.com/alexander-uspenskiy/rag_web)

Happy Coding!