Creating Your Own ChatGPT on a Free VPS — Simple and Fast!

This article is intended for developers with basic knowledge of Python and Docker. We will explore how to deploy a model from Hugging Face on a free VPS server and create an HTTP service to interact with it. I will try to explain the topic in an accessible way and provide practical examples so you can easily apply the knowledge in practice. Even if you don’t plan to delve deep into the details — just follow the instructions, and in 15 minutes, you’ll have a working service. Friends, free VPS servers are rare, and those that exist often lack the power to deploy neural networks. But the good news is — it’s still possible to deploy a neural network on limited resources! In this guide, I’ll show you practical methods to run models even on modest free servers. We’ll use Python as the language for the project, along with the FastAPI library to create the API service, and Docker for easy deployment. What you’ll need: A HuggingFace account Your favorite IDE Start by creating a new Space Go to Hugging Face Spaces. Click New Space Enter a name for your Space Select Docker (mandatory!) → FastAPI doesn’t work in Gradio templates. Choose Blank as the Docker template. We’ll define our own template. Click “Create” Clone the created repository: git clone https://huggingface.co/spaces/your_account_name/your_space_name Define the project structure Open the cloned repository in your favorite IDE and create the following structure: /your_project_root │ ├── .gitattributes ├── Dockerfile ├── README.md ├── main.py └── requirements.txt Fig.1 Project structure Define the Dockerfile FROM python:3.9-slim WORKDIR /app RUN apt-get update && \ apt-get install -y \--no-install-recommends git g++ make && \ apt-get clean && \ rm -rf /var/lib/apt/lists/\* COPY requirements.txt . RUN pip install \--no-cache-dir -r requirements.txt COPY main.py . ENV HF_HOME=/tmp/huggingface-cache ENV TOKENIZERS_PARALLELISM=false EXPOSE 7860 CMD ["uvicorn", "main:app", "\--host", "0.0.0.0", "\--port", "7860"] Fig.2 Dockerfile The Dockerfile is straightforward, but I’d like to highlight the environment variables: ENV HF_HOME=/tmp/huggingface-cache: Sets the HF_HOME environment variable, which Hugging Face libraries use to cache models and tokenizers. Here, the cache will be stored in the temporary directory /tmp. _ENV TOKENIZERS_PARALLELISM=false:\ Sets TOKENIZERS_PARALLELISM to false to avoid multithreading issues with Hugging Face tokenizers. Add dependencies to the project For our app to work, we’ll need the following dependencies: fastapi==0.109.0 uvicorn==0.27.0 torch==2.2.1 \--index-url https://download.pytorch.org/whl/cpu transformers==4.40.2 accelerate==0.29.3 sentencepiece==0.2.0 numpy==1.26.4 protobuf==3.20.3 Fig.3 List of dependencies Start writing the API service We’ve chosen TinyLlama/TinyLlama-1.1B-Chat-v1.0 as our model. It fits well within our limitations and offers decent performance. However, you’re free to choose any other model that suits your needs. from fastapi import FastAPI, HTTPException from pydantic import BaseModel from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch import numpy as np # Check NumPy version assert np.version.startswith('1.'), f"Incompatible NumPy version: {np.version}" app = FastAPI() class RequestData(BaseModel): prompt: str max_tokens: int = 50 MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" try: # Load the model with explicit device_map tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, torch_dtype=torch.float32, device_map="auto", low_cpu_mem_usage=True ) # Create a pipeline without specifying a device generator = pipeline( "text-generation", model=model, tokenizer=tokenizer ) except Exception as e: print(f"Model loading error: {str(e)}") generator = None @app.post("/generate") async def generate_text(request: RequestData): if not generator: raise HTTPException(status_code=503, detail="Model not loaded") try: output = generator( request.prompt, max_new_tokens=min(request.max_tokens, 100), do_sample=False, num_beams=1, temperature=0.7, ) return {"response": output[0]["generated_text"]} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.get("/health") async def health_check(): return {"status": "ok" if generator else "unavailable"} Fig.4 Main program code This program demonstrates how to create a service that accepts POST requests at /generate and returns a JSON response. Pay special attention to this block: o

Apr 25, 2025 - 10:40

Creating Your Own ChatGPT on a Free VPS — Simple and Fast!

This article is intended for developers with basic knowledge of Python and Docker. We will explore how to deploy a model from Hugging Face on a free VPS server and create an HTTP service to interact with it. I will try to explain the topic in an accessible way and provide practical examples so you can easily apply the knowledge in practice. Even if you don’t plan to delve deep into the details — just follow the instructions, and in 15 minutes, you’ll have a working service.

Friends, free VPS servers are rare, and those that exist often lack the power to deploy neural networks. But the good news is — it’s still possible to deploy a neural network on limited resources! In this guide, I’ll show you practical methods to run models even on modest free servers.

We’ll use Python as the language for the project, along with the FastAPI library to create the API service, and Docker for easy deployment.

What you’ll need:

A HuggingFace account
Your favorite IDE

Start by creating a new Space

Go to Hugging Face Spaces.
Click New Space
Enter a name for your Space
Select Docker (mandatory!) → FastAPI doesn’t work in Gradio templates.
Choose Blank as the Docker template. We’ll define our own template.
Click “Create”
Clone the created repository:

git clone https://huggingface.co/spaces/your_account_name/your_space_name

Define the project structure

Open the cloned repository in your favorite IDE and create the following structure:

/your_project_root
│
├── .gitattributes
├── Dockerfile
├── README.md
├── main.py
└── requirements.txt

Fig.1 Project structure

Define the Dockerfile

FROM python:3.9-slim  
WORKDIR /app  
RUN apt-get update && \  
    apt-get install -y \--no-install-recommends git g++ make && \  
    apt-get clean && \  
    rm -rf /var/lib/apt/lists/\*  
COPY requirements.txt .  
RUN pip install \--no-cache-dir -r requirements.txt  
COPY main.py .  
ENV HF_HOME=/tmp/huggingface-cache  
ENV TOKENIZERS_PARALLELISM=false  
EXPOSE 7860  
CMD ["uvicorn", "main:app", "\--host", "0.0.0.0", "\--port", "7860"]

Fig.2 Dockerfile

The Dockerfile is straightforward, but I’d like to highlight the environment variables:

ENV HF_HOME=/tmp/huggingface-cache:

Sets the HF_HOME environment variable, which Hugging Face libraries use to cache models and tokenizers. Here, the cache will be stored in the temporary directory /tmp.

_ENV TOKENIZERS_PARALLELISM=false:\

Sets TOKENIZERS_PARALLELISM to false to avoid multithreading issues with Hugging Face tokenizers.

Add dependencies to the project

For our app to work, we’ll need the following dependencies:

fastapi==0.109.0  
uvicorn==0.27.0  
torch==2.2.1 \--index-url https://download.pytorch.org/whl/cpu  
transformers==4.40.2  
accelerate==0.29.3  
sentencepiece==0.2.0  
numpy==1.26.4  
protobuf==3.20.3

Fig.3 List of dependencies

Start writing the API service

We’ve chosen TinyLlama/TinyLlama-1.1B-Chat-v1.0 as our model. It fits well within our limitations and offers decent performance. However, you’re free to choose any other model that suits your needs.

from fastapi import FastAPI, HTTPException  
from pydantic import BaseModel  
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline  
import torch  
import numpy as np  

# Check NumPy version  
assert np.__version__.startswith('1.'), f"Incompatible NumPy version: {np.__version__}"  

app = FastAPI()  

class RequestData(BaseModel):  
    prompt: str  
    max_tokens: int = 50  

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  

try:  
    # Load the model with explicit device_map  
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)  
    model = AutoModelForCausalLM.from_pretrained(  
        MODEL_NAME,  
        torch_dtype=torch.float32,  
        device_map="auto",  
        low_cpu_mem_usage=True  
    )  
    # Create a pipeline without specifying a device  
    generator = pipeline(  
        "text-generation",  
        model=model,  
        tokenizer=tokenizer  
    )  
except Exception as e:  
    print(f"Model loading error: {str(e)}")  
    generator = None  

@app.post("/generate")  
async def generate_text(request: RequestData):  
    if not generator:  
        raise HTTPException(status_code=503, detail="Model not loaded")  
    try:  
        output = generator(  
            request.prompt,  
            max_new_tokens=min(request.max_tokens, 100),  
            do_sample=False,  
            num_beams=1,  
            temperature=0.7,  
        )  
        return {"response": output[0]["generated_text"]}  
    except Exception as e:  
        raise HTTPException(status_code=500, detail=str(e))  

@app.get("/health")  
async def health_check():  
    return {"status": "ok" if generator else "unavailable"}

Fig.4 Main program code

This program demonstrates how to create a service that accepts POST requests at /generate and returns a JSON response. Pay special attention to this block:

output = generator(  
    request.prompt,  
    max_new_tokens=min(request.max_tokens, 100),  
    do_sample=False,  
    num_beams=1,  
    temperature=0.7,  
)

Fig.5 List of options

These parameters are related to text generation settings in libraries like Hugging Face Transformers. For a full description of all options and their values, refer to the official library documentation.

Now, commit and push all the changes.

Once done, the predefined CI/CD will start on the server. Go to:

https://huggingface.co/spaces/your_account_name/your_space_name?logs=container

You can monitor the build process. Once it’s complete, you can send your first request:

curl -X POST "https://your_account_name-your_space_name.hf.space/generate" \  
  -H "Content-Type: application/json" \  
  -d '{"prompt":"What is weather?"}'

Fig.6 curl request

If everything is done correctly, you’ll receive a response like this:

{"response":"What is Docker?\n\nDocker is a tool for creating and deploying Linux-based containers. It allows you to build and deploy applications on Linux."}

Fig.7 Response to the request

Congratulations! You’ve successfully deployed your own neural network on a VPS, and it’s now linked to your account.

Conclusion

In conclusion, it’s worth noting that free Spaces on Hugging Face run on shared CPU/GPU and have several limitations:

Auto-shutdown: If no requests are made to the API for 48 hours, the Space will “sleep.” The first request after this will be slow (about 30 seconds).
Request timeout: Requests taking longer than 1-2 minutes will be automatically terminated.
No GPU: Models run on CPU, which may slow down large requests.

Additional limitations include:

Sleep mode: If your Space receives no traffic for over 48 hours, it enters “sleep” mode. The first request after waking up may take 30-60 seconds.
Resource limits:
- CPU: 1 core, ~1 GB RAM.
- GPU (free): Only for active Spaces, not running 24/7.
- Auto-deletion: Hugging Face may delete Spaces inactive for over 90 days.

How to avoid “sleep mode”?

Regular requests: Send any request once a day (use cron or services like UptimeRobot).
Notifications: Set up monitoring to receive alerts from Hugging Face for errors.

Alternatives

If you need a free and simple solution → Hugging Face Spaces + FastAPI.
If you need 24/7 without “sleeping” → Google Cloud Run or **Fly.io**.
If you need GPU and low latency → Hugging Face Inference Endpoints.

What’s next?

To restrict access to your service, you can add authorization.

To make the model faster on CPU, try quantization — this reduces its size and speeds up requests without significant loss of accuracy.

When the project is ready for production, deploy it using Hugging Face Inference Endpoints or Google Cloud Run — these services simplify scaling and infrastructure management.

If you liked this article, you know what to do — subscribe, like, and share. It’s the best support for the author. This was Yuri Dubovitsky from the channel “Your Code Is Not Ready Yet, Sir”.

But our code is ready. See you next time!