Step-by-Step Guide to Build and Deploy an LLM-Powered Chat with Memory in Streamlit

And monitor your API usage on Google Cloud Console The post Step-by-Step Guide to Build and Deploy an LLM-Powered Chat with Memory in Streamlit appeared first on Towards Data Science.

May 2, 2025 - 01:33

In this post, I’ll show you step by step how to build and deploy a chat powered with LLM — Gemini — in Streamlit and monitor the API usage on Google Cloud Console. Streamlit is a Python framework that makes it super easy to turn your Python scripts into interactive web apps, with almost no front-end work.

Recently, I built a project, bordAI — a chat assistant powered by LLM integrated with tools I developed to support embroidery projects. After that, I decided to start this series of posts to share tips I’ve learned along the way.

Here’s a quick summary of the post:

1 to 6 — Project Setup

7 to 13 — Building the Chat

14 to 15— Deploy and Monitor the app

1. Create a New GitHub repository

Go to GitHub and create a new repository.

2. Clone the repository locally

→ Execute this command in your terminal to clone it:

git clone

3. Set Up a Virtual Environment (optional)

A Virtual Environment is like a separate space on your computer where you can install a specific version of Python and libraries without affecting the rest of your system. This is useful because different projects might need different versions of the same libraries.

→ To create a virtual environment:

pyenv virtualenv 3.9.14 chat-streamlit-tutorial

→ To activate it:

pyenv activate chat-streamlit-tutorial

4. Project Structure

A project structure is just a way to organize all the files and folders for your project. Ours will look like this:

chat-streamlit-tutorial/
│
├── .env
├── .gitignore
├── app.py
├── functions.py
├── requirements.txt
└── README.md

.env→ file where you store your API key (not pushed to GitHub)
.gitignore → file where you list the files or folders for git to ignore
app.py → main streamlit app
functions.py → custom functions to better organize the code
requirements.txt → list of libraries your project needs
README.md → file that explains what your project is about

→ Execute this inside your project folder to create these files:

touch .env .gitignore app.py functions.py requirements.txt

→ Inside the file .gitignore, add:

.env
__pycache__/

→ Add this to the requirements.txt:

streamlit
google-generativeai
python-dotenv

→ Install dependencies:

pip install -r requirements.txt

5. Get API Key

An API Key is like a password that tells a service you have permission to use it. In this project, we’ll use the Gemini API because they have a free tier, so you can play around with it without spending money.

Go to https://aistudio.google.com/
Create or log in to your account.
Click on “Create API Key“, create it, and copy it.

Don’t set up billing if you just want to use the free tier. It should say “Free” under “Plan”, just like here:

We’ll use gemini-2.0-flash in this project. It offers a free tier, as you can see in the table below:

Screenshot by the author from https://aistudio.google.com/plan_information

15 RPM = 15 Requests per minute
1,000,000 TPM = 1 Million Tokens Per Minute
1,500 RPD = 1,500 Requests Per Day

Note: These limits are accurate as of April 2025 and may change over time.

Just a heads up: if you are using the free tier, Google may use your prompts to improve their products, including human reviews, so it’s not recommended to send sensitive information. If you want to read more about this, check this link.

6. Store your API Key

We’ll store our API Key inside a .env file. A .env file is a simple text file where you store secret information, so you don’t write it directly in your code. We don’t want it going to GitHub, so we have to add it to our .gitignore file. This file determines which files git should literally ignore when you push your changes to the repository. I’ve already mentioned this in part 4, “Project Structure”, but just in case you missed it, I’m repeating it here.

This step is really important, don’t forget it!
→ Add this to .gitignore:

.env
__pycache__/

→ Add the API Key to .env:

API_KEY= "your-api-key"

If you’re running locally, .env works fine. However, if you’re deploying in Streamlit later, you will have to use st.secrets. Here I’ve included a code that can work in both scenarios.

→Add this function to your functions.py:

import streamlit as st
import os
from dotenv import load_dotenv

def get_secret(key):
    """
    Get a secret from Streamlit or fallback to .env for local development.

    This allows the app to run both on Streamlit Cloud and locally.
    """
    try:
        return st.secrets[key]
    except Exception:
        load_dotenv()
        return os.getenv(key)

→ Add this to your app.py:

import streamlit as st
import google.generativeai as genai
from functions import get_secret

api_key = get_secret("API_KEY")

7. Choose the model

I chose gemini-2.0-flash for this project because I think it’s a great model with a generous free tier. However, you can explore other model options that also offer free tiers and choose your preferred one.

Pro: models designed for high–quality outputs, including reasoning and creativity. Generally used for complex tasks, problem-solving, and content generation. They are multimodal — this means they can process text, image, video, and audio for input and output.
Flash: models projected for speed and cost efficiency. Can have lower-quality answers compared to the Pro for complex tasks. Generally used for chatbots, assistants, and real-time applications like automatic phrase completion. They are multimodal for input, and for output is currently just text, other features are in development.
Lite: even faster and cheaper than Flash, but with some reduced capabilities, such as it is multimodal only for input and text-only output. Its main characteristic is that it is more economical than the Flash, ideal for generating large amounts of text within cost restrictions.

This link has plenty of details about the models and their differences.

Here we are setting up the model. Just replace “gemini-2.0-flash” with the model you’ve chosen.

→ Add this to your app.py:

genai.configure(api_key=api_key)
model = genai.GenerativeModel("gemini-2.0-flash")

8. Build the chat

First, let’s discuss the key concepts we’ll use:

st.session_state: this works like a memory for your app. Streamlit reruns your script from top to bottom every time something changes — when you send a message or click a button — so normally, all the variables would be reset. This allows Streamlit to remember values between reruns. However, if you refresh your web page you’ll lose the session_state.
st.chat_message(name, avatar): Creates a chat bubble for a message in the interface. The first parameter is the name of the message author, which can be “user”, “human”, “assistant”, “ai”, or str. If you use user/human and assistant/ai, it already has default avatars of user and bot icons. You can change this if you want to. Check out the documentation for more details.
st.chat_input(placeholder): Displays an input box at the bottom for the user to type messages. It has many parameters, so I recommend you check out the documentation.

First, I’ll explain each part of the code separately, and after I’ll show you the whole code together.

This initial step initializes your session_state, the app’s “memory”, to keep all the messages within one session.

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

Next, we’ll set the first default message. This is optional, but I like to add it. You could add some initial instructions if suitable for your context. Every time Streamlit runs the page and st.session_state.chat_history is empty, it’ll append this message to the history with the role “assistant”.

if not st.session_state.chat_history:
    st.session_state.chat_history.append(("assistant", "Hi! How can I help you?"))

In my app bordAI, I added this initial message giving context and instructions for my app:

For the user part, the first line creates the input box. If user_message contains content, it writes it to the interface and then appends it to chat_history.

user_message = st.chat_input("Type your message...")

if user_message:
    st.chat_message("user").write(user_message)
    st.session_state.chat_history.append(("user", user_message))

Now let’s add the assistant part:

system_prompt is the prompt sent to the model. You could just send the user_message in place of full_input (look at the code below). However, the output might not be precise. A prompt provides context and instructions about how you want the model to behave, not just what you want it to answer. A good prompt makes the model’s response more accurate, consistent, and aligned with your goals. In addition, without telling how our model should behave, it’s vulnerable to prompt injections.

Prompt injection is when someone tries to manipulate the model’s prompt in order to alter its behavior. One way to mitigate this is to structure prompts clearly and delimit the user’s message within triple quotes.

We’ll start with a simple and unclear system_prompt and in the next session we’ll make it better to compare the difference.

full_input: here, we’re organizing the input, delimiting the user message with triple quotes (“””). This doesn’t prevent all prompt injections, but it is one way to create better and more reliable interactions.
response: sends a request to the API, storing the output in response.
assistant_reply: extracts the text from the response.

Finally, we use st.chat_message() combined to write() to display the assistant reply and append it to the st.session_state.chat_history, just like we did with the user.

if user_message:
    st.chat_message("user").write(user_message)
    st.session_state.chat_history.append(("user", user_message))
    
    system_prompt = f"""
    You are an assistant.
    Be nice and kind in all your responses.
    """
    full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\""

    response = model.generate_content(full_input)
    assistant_reply = response.text

    st.chat_message("assistant").write(assistant_reply)
    st.session_state.chat_history.append(("assistant", assistant_reply))

Now let’s see everything together!

→ Add this to your app.py:

import streamlit as st
import google.generativeai as genai
from functions import get_secret

api_key = get_secret("API_KEY")
genai.configure(api_key=api_key)
model = genai.GenerativeModel("gemini-2.0-flash")

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

if not st.session_state.chat_history:
    st.session_state.chat_history.append(("assistant", "Hi! How can I help you?"))

user_message = st.chat_input("Type your message...")

if user_message:
    st.chat_message("user").write(user_message)
    st.session_state.chat_history.append(("user", user_message))

    system_prompt = f"""
    You are an assistant.
    Be nice and kind in all your responses.
    """
    full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\""

    response = model.generate_content(full_input)
    assistant_reply = response.text

    st.chat_message("assistant").write(assistant_reply)
    st.session_state.chat_history.append(("assistant", assistant_reply))

To run and test your app locally, first navigate to the project folder, then execute the following command.

→ Execute in your terminal:

cd chat-streamlit-tutorial
streamlit run app.py

Yay! You now have a chat running in Streamlit!

9. Prompt Engineering

Prompt Engineering is a process of writing instructions to get the best possible output from an AI model.

There are plenty of techniques for prompt engineering. Here are 5 tips:

Write clear and specific instructions.
Define a role, expected behavior, and rules for the assistant.
Give the right amount of context.
Use the delimiters to indicate user input (as I explained in part 8).
Ask for the output in a specified format.

These tips can be applied to the system_prompt or when you’re writing a prompt to interact with the chat assistant.

Our current system prompt is:

system_prompt = f"""
You are an assistant.
Be nice and kind in all your responses.
"""

It is super vague and provides no guidance to the model.

No clear direction for the assistant, what kind of help it should provide
No specification of the role or what is the topic of the assistance
No guidelines for structuring the output
No context on whether it should be technical or casual
Lack of boundaries

We can improve our prompt based on the tips above. Here’s an example.

→ Change the system_prompt in the app.py:

system_prompt = f"""
You are a friendly and a programming tutor.
Always explain concepts in a simple and clear way, using examples when possible.
If the user asks something unrelated to programming, politely bring the conversation back to programming topics.
"""
full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\""

If we ask “What is python?” to the old prompt, it just gives a generic short answer:

With the new prompt, it provides a more detailed response with examples:

Try changing the system_prompt yourself to see the difference in the model outputs and craft the ideal prompt for your context!

10. Choose Generate Content Parameters

There are many parameters you can configure when generating content. Here I’ll demonstrate how temperature and maxOutputTokens work. Check the documentation for more details.

temperature: controls the randomness of the output, ranging from 0 to 2. The default is 1. Lower values produce more deterministic outputs, while higher values produce more creative ones.
maxOutputTokens: the maximum number of tokens that can be generated in the output. A token is approximately four characters.

To change the temperature dynamically and test it, you can create a sidebar slider to control this parameter.

→ Add this to app.py:

temperature = st.sidebar.slider(
    label="Select the temperature",
    min_value=0.0,
    max_value=2.0,
    value=1.0
)

→ Change the response variable to:

response = model.generate_content(
    full_input,
    generation_config={
        "temperature": temperature,
        "max_output_tokens": 1000
    }
)

The sidebar will look like this:

Try adjusting the temperature to see how the output changes!

11. Display chat history

This step ensures that you keep track of all the exchanged messages in the chat, so you can see the chat history. Without this, you’d only see the latest messages from the assistant and user each time you send something.

This code accesses everything appended to chat_history and displays it in the interface.

→ Add this before the if user_message in app.py:

for role, message in st.session_state.chat_history:
    st.chat_message(role).write(message)

Now, all the messages within one session are kept visible in the interface:

Obs: I tried to ask a non-programming question, and the assistant tried to change the subject back to programming. Our prompt is working!

12. Chat with memory

Besides having messages stored in chat_history, our model isn’t aware of the context of our conversation. It is stateless, each transaction is independent.

To solve this, we have to pass all this context inside our prompt so the model can reference previous messages exchanged.

Create context which is a list containing all the messages exchanged until that moment. Adding lastly the most recent user message, so it doesn’t get lost in the context.

system_prompt = f"""
You are a friendly and knowledgeable programming tutor.
Always explain concepts in a simple and clear way, using examples when possible.
If the user asks something unrelated to programming, politely bring the conversation back to programming topics.
"""
full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\""

context = [
    *[
        {"role": role, "parts": [{"text": msg}]} for role, msg in st.session_state.chat_history
    ],
    {"role": "user", "parts": [{"text": full_input}]}
]

response = model.generate_content(
    context,
    generation_config={
        "temperature": temperature,
        "max_output_tokens": 1000
    }
)

Now, I told the assistant that I was working on a project to analyze weather data. Then I asked what the theme of my project was and it correctly answered “weather data analysis”, as it now has the context of the previous messages.

If your context gets too long, you can consider summarizing it to save costs, since the more tokens you send to the API, the more you’ll pay.

13. Create a Reset Button (optional)

I like adding a reset button in case something goes wrong or the user just wants to clear the conversation.

You just need to create a function to set de chat_history as an empty list. If you created other session states, you should set them here as False or empty, too.

→ Add this to functions.py:

def reset_chat():
    """
    Reset the Streamlit chat session state.
    """
    st.session_state.chat_history = []
    st.session_state.example = False # Add others if needed

→ And if you want it in the sidebar, add this to app.py:

from functions import get_secret, reset_chat

if st.sidebar.button("Reset chat"):
    reset_chat()

It will look like this:

Everything together:

import streamlit as st
import google.generativeai as genai
from functions import get_secret, reset_chat

api_key = get_secret("API_KEY")
genai.configure(api_key=api_key)
model = genai.GenerativeModel("gemini-2.0-flash")

temperature = st.sidebar.slider(
    label="Select the temperature",
    min_value=0.0,
    max_value=2.0,
    value=1.0
)

if st.sidebar.button("Reset chat"):
    reset_chat()

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

if not st.session_state.chat_history:
    st.session_state.chat_history.append(("assistant", "Hi! How can I help you?"))

for role, message in st.session_state.chat_history:
    st.chat_message(role).write(message)

user_message = st.chat_input("Type your message...")

if user_message:
    st.chat_message("user").write(user_message)
    st.session_state.chat_history.append(("user", user_message))

    system_prompt = f"""
    You are a friendly and a programming tutor.
    Always explain concepts in a simple and clear way, using examples when possible.
    If the user asks something unrelated to programming, politely bring the conversation back to programming topics.
    """
    full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\""

    context = [
        *[
            {"role": role, "parts": [{"text": msg}]} for role, msg in st.session_state.chat_history
        ],
        {"role": "user", "parts": [{"text": full_input}]}
    ]

    response = model.generate_content(
        context,
        generation_config={
            "temperature": temperature,
            "max_output_tokens": 1000
        }
    )
    assistant_reply = response.text

    st.chat_message("assistant").write(assistant_reply)
    st.session_state.chat_history.append(("assistant", assistant_reply))

14. Deploy

If your repository is public, you can deploy with Streamlit for free.

MAKE SURE YOU DO NOT HAVE API KEYS ON YOUR PUBLIC REPOSITORY.

First, save and push your code to the repository.

→ Execute in your terminal:

git add .
git commit -m "tutorial chat streamlit"
git push origin main

Pushing directly into the main isn’t a best practice, but since it’s just a simple tutorial, we’ll do it for convenience.

Go to your streamlit app that is running locally.
Click on “Deploy” at the top right.
In Streamlit Community Cloud, click “Deploy now”.
Fill out the information.

5. Click on “Advanced settings” and write API_KEY="your-api-key", just like you did with the .env file.

6. Click “Deploy”.

All done! If you’d like, check out my app here! Read More