Real-time log analytics architectures that actually scale
Storing and searching logs is a problem as old as software. Many tools exist to do it. It's solved. But this post focuses on a different problem. What if your users want to explore the logs? What if you have billions of logs? Fundamentally, this is a problem of scale in multi-tenant architectures. It's something that Vercel most solve with runtime logs, for example. Here, I share some things that my teammates and I have learned from working with massive amounts of data from multiple domains over the last years, and that I think could be useful if you're building something similar. If you are reading this, you might know that Tinybird is a platform designed from the ground up to handle scale, and it is used every day in the real world for solving some of these problems (Vercel Observability, for example, is built on Tinybird). In this post, however, I will try to focus on the technical challenges of storing and retrieving logs in multi-tenant, user-facing systems and how to solve them in a tool-agnostic way. If you are interested in a complete Tinybird implementation of these principles, check out this multi-tenant logs analytics template. Ingestion: How do I get logs from my application to my analytical system When building log analytics (regardless of scale) the first challenge you are going to face is getting your logs from where they are generated, to where they are stored and ultimately analyzed. I am going to go over a few options that range in complexity, so that you can evaluate what solution best fits your needs. Option 1: Just Log to Log This is the simplest approach where applications directly write logs to a log management system. The application is responsible for log formatting, buffering, and delivery. Log to log This approach is suitable for simple applications with moderate logging needs and where operational simplicity is a priority. It's a simple, cost-effective approach with no additional infrastructure required. However, it may have some limitations in terms of scalability and observability. Option 2: Log to Log with a Sidecar This approach involves deploying a sidecar container/application alongside your application that handles log collection and forwarding. The sidecar is a common architectural pattern in microservices where a helper container is deployed alongside the main application container. Log to Log with a Sidecar Components Application Container- Writes logs to stdout/stderr or files- Shares a volume with the sidecar container- Requires minimal changes to application code Shared Volume- Mounted by both containers- Acts as a buffer for logs- Can be configured for size limits and rotation Sidecar Container- Runs logging agent (e.g., Fluentd, Filebeat, Vector)- Reads logs from shared volume- Handles log processing, enrichment, and forwarding- Can implement retry logic and buffering This solution is suitable for applications that are already containerized and where you want to avoid changing the application code. Option 3: A hybrid approach A hybrid approach can work in multiple ways: Direct mode: Applications send logs directly to a central collector. Direct to a central collector Sidecar mode: Each application has its own sidecar that sends logs to a central collector. Sidecar to a central collector Gateway mode: Applications send logs to a gateway that forwards them to a central collector. Useful for multi-region deployments. Regional gateways for logging on multi-region deployments This approach lets you start simple (like Log to Log) but gives you the flexibility to evolve to more complex patterns (like sidecars or gateways) as your needs grow, all while using the same core technology. I won't tell you how to do any of this exactly for your specific application, but if you do intend to use Tinybird to store and analyze logs, we have a few useful guides on how to ingest data from tools like Vector.dev, AWS Kinesis or GCS Pub/Sub (or you can take a look at a bunch of ingestion guides for the one that might fit your needs better). Storage: How do I handle multi-tenancy? In this context, multi-tenancy is a common scenario: You have an app or system that provides a service to many different users, and you need to ensure that each user’s log data is only served to them. This may seem to be a retrieval problem more than a storage problem, but as with most data problems, it starts with storage. Replicate your infra for each client? The most obvious way to ensure client separation is simply replicating the architecture of your system once for each client. This comes with a notable downside: inefficiency. You need to dimension each one to accommodate for the particularities of the load of each client, which can be a problem, especially if your SLAs call for a stable service during load spikes. Separate storage and compute? Another option would be to separate the com
Storing and searching logs is a problem as old as software. Many tools exist to do it. It's solved. But this post focuses on a different problem.
What if your users want to explore the logs? What if you have billions of logs?
Fundamentally, this is a problem of scale in multi-tenant architectures. It's something that Vercel most solve with runtime logs, for example.
Here, I share some things that my teammates and I have learned from working with massive amounts of data from multiple domains over the last years, and that I think could be useful if you're building something similar.
If you are reading this, you might know that Tinybird is a platform designed from the ground up to handle scale, and it is used every day in the real world for solving some of these problems (Vercel Observability, for example, is built on Tinybird).
In this post, however, I will try to focus on the technical challenges of storing and retrieving logs in multi-tenant, user-facing systems and how to solve them in a tool-agnostic way. If you are interested in a complete Tinybird implementation of these principles, check out this multi-tenant logs analytics template.
Ingestion: How do I get logs from my application to my analytical system
When building log analytics (regardless of scale) the first challenge you are going to face is getting your logs from where they are generated, to where they are stored and ultimately analyzed. I am going to go over a few options that range in complexity, so that you can evaluate what solution best fits your needs.
Option 1: Just Log to Log
This is the simplest approach where applications directly write logs to a log management system. The application is responsible for log formatting, buffering, and delivery.
This approach is suitable for simple applications with moderate logging needs and where operational simplicity is a priority.
It's a simple, cost-effective approach with no additional infrastructure required. However, it may have some limitations in terms of scalability and observability.
Option 2: Log to Log with a Sidecar
This approach involves deploying a sidecar container/application alongside your application that handles log collection and forwarding. The sidecar is a common architectural pattern in microservices where a helper container is deployed alongside the main application container.
Components
- Application Container- Writes logs to stdout/stderr or files- Shares a volume with the sidecar container- Requires minimal changes to application code
- Shared Volume- Mounted by both containers- Acts as a buffer for logs- Can be configured for size limits and rotation
- Sidecar Container- Runs logging agent (e.g., Fluentd, Filebeat, Vector)- Reads logs from shared volume- Handles log processing, enrichment, and forwarding- Can implement retry logic and buffering
This solution is suitable for applications that are already containerized and where you want to avoid changing the application code.
Option 3: A hybrid approach
A hybrid approach can work in multiple ways:
- Direct mode: Applications send logs directly to a central collector.
- Sidecar mode: Each application has its own sidecar that sends logs to a central collector.
Sidecar to a central collector
- Gateway mode: Applications send logs to a gateway that forwards them to a central collector. Useful for multi-region deployments.
Regional gateways for logging on multi-region deployments
This approach lets you start simple (like Log to Log) but gives you the flexibility to evolve to more complex patterns (like sidecars or gateways) as your needs grow, all while using the same core technology.
I won't tell you how to do any of this exactly for your specific application, but if you do intend to use Tinybird to store and analyze logs, we have a few useful guides on how to ingest data from tools like Vector.dev, AWS Kinesis or GCS Pub/Sub (or you can take a look at a bunch of ingestion guides for the one that might fit your needs better).
Storage: How do I handle multi-tenancy?
In this context, multi-tenancy is a common scenario: You have an app or system that provides a service to many different users, and you need to ensure that each user’s log data is only served to them. This may seem to be a retrieval problem more than a storage problem, but as with most data problems, it starts with storage.
Replicate your infra for each client?
The most obvious way to ensure client separation is simply replicating the architecture of your system once for each client. This comes with a notable downside: inefficiency. You need to dimension each one to accommodate for the particularities of the load of each client, which can be a problem, especially if your SLAs call for a stable service during load spikes.
Separate storage and compute?
Another option would be to separate the compute and the storage, re-using the same compute capacity for all of your clients, and storing their data separately. This can be done at many levels, but in practice, it often means sending each client’s data to a different table in your DBMS.
And this could be a practical solution, but given enough clients (scale strikes again), you might find yourself with millions of different tables each using the same structure for their logs. DBMSs tend to not like having that many tables.
Keep all your logs in one table
Finally, another solution (and the one that usually makes the most sense) is to keep all of your logs in the same shared table and identify each log with a unique id for the client that generated it. That way, when you read a log, your system knows exactly to whom it can and can not send it to.
This solution does not have the same table scalability problem as before, and now you are only limited by the amount of rows/logs that your DBMS can reasonably handle given your needs, i.e. how fast do you need to retrieve your data (hint: nobody likes “slow”).
But you have to be careful about something. If you put all clients' data in one table, you swap one problem for another. Storage becomes simple, and retrieval becomes complex. Imagine that client “A” sends a request to your system, and it needs to read some of their data. If you don't use any special strategy to index the data, you would need to go through each log and check that the owner matches your query for “A”.
This means that the query has an algorithmic complexity dependent on the total number of logs that you have. Since all your clients’ logs are in the same place, and you have to scan every log to identify what belongs to whom, a client with 10 logs would take a similar amount of time to perform the query as a client with 10 billion logs (and neither of them would be happy).
This is where you could think about indexing the data based on the client. This would probably make things better, depending on the distribution of your data, but the reality is that your index is going to have to be very big, as it needs to point to every single log that belongs to every single user. Ideally, a reference to where the data is stored is smaller than the data itself, but you still need to run that on a computer, and that means going to storage, retrieving some logs, going to some other section of the storage, retrieving some logs, and repeat until you are done for the client. On a big enough scale, this is not going to cut it either.
So how could you make this more efficient? Well, a good solution would be to organize your data in a way that all of a client’s logs are close to each other on disk, which has 2 implications:
- First, you do not need to store a reference in your index to every single line for each client, you just need to know where their data starts, and the rest should be next to it (until you get to the “next” client), so your index is now as big as the number of clients that you have (and therefore efficient at much greater scale)
- Second, information retrieval from storage is more efficient if the data that you are looking for is contiguous. In this case, you've essentially solved the problem. The algorithmic complexity of the log retrieval is (almost) independent between your clients.
Now, I'm not going to go into detail about how to implement this on multiple systems, but I will mention that this is really easy to do with Tinybird. Tinybird uses columnar storage; you can set the way that your data is indexed and sorted (i.e. organized) in storage by a particular column (in this case the client's unique id), and use row-level security policies so that when you retrieve a client’s data, the system only needs to go to the index, see where the data for that client starts and ends, and get it from storage in one go.
Retrieval: How do I filter to give my user a quick glance of their logs?
A good logs explorer makes it simple to filter by things like environment, request path, status codes, etc.
The problem of logs retrieval in a multi-tenant system is a very practical problem that anyone looking at logs faces. When you are exploring your logs, you do not usually want to just read them one by one; it would be way too much for a person to process.
Instead, you probably want to quickly see stuff that can give you high level view of what your system is doing, like “How many log lines for the different levels (e.g. DEBUG
, WARNING
, ERROR
) do I have?” or “How many 200s has my service returned” to get some sense of the usage that it is getting, or “How many 500s has my service returned” to identify any problems. Or even more complex stuff, like “What path from my website is the most popular/ generating the most errors?”.
How do you do this? Generally, it starts by defining a time range to look for the logs, for example, the last 7 days (usually the freshest data is the most relevant). And then, for that time range, you want to provide some view - a chart or list - that answers questions like those above.
The most direct way to calculate these metrics is to get them on the fly - to query the underlying logs storage table and pass all the filters and aggregations in the query. However, scale is here to ruin your solution yet again, and there will be a point where doing this every single time that you need those stats is not practical, performant, or cheap.
Pre-calculate your aggregates
So here's another trick: pre-calculate those stats beforehand. Now I can hear you say, “but I do not know the exact time frame that my user will want”, and thus, you can not get the exact pre-calculated values. However, not all is lost.
The simplest approach is to pre-calculate aggregations over arbitrary time windows, and use that to calculate the stats on the time window that we want to show our user. This can be as simple as calculating daily totals (this is called rollups), and when your user looks at a 7 day time-frame, you take your 7 pre-aggregated daily values, sum them, and – voilá – you’ve got your fancy stats.
Now, the number of values that you have to read to get the total does not depend anymore on the number of logs in your system, but only on the aggregation window that you choose and the time frame that your user wants to retrieve.
In the last 7 days example, you'd need to read just 7 values regardless of the actual number of rows that you are counting. You could have a trillions logs every day, but you'd read 7 values, not 7 trillion.