Distributed Model Serving Patterns

Intro The goal of every company is to make money and AI Models are seen as an integral part of the business. As machine learning models move from experimentation to production, serving them becomes a challenge. Serving them at a scale becomes even a larger issue. Having a model that shows high accuracy isn't enough - we need an infrastructure that will be robust, efficient, and scalable. In this article, I will dive deeper into main model serving patterns. This would be useful for anyone building model ML Platform systems that need to operate reliably despite large number of users (and requests) or large data. What is Model Serving? Model serving is the process of loading a previously trained machine learning model with the ultimate goal of generating predictions or in general, perform inference on new and unseen data. Replicated Services Pattern Imagine a very simple prediction server. You have a use case where users upload their photos or videos, and the trained ML model will automatically label people in them. In general, any API should be stateless; each request is processed independently and is treated as a completely new transaction without knowing anything about a client. On a small-scale, it would be possible to run predictions with a single node. However, a growing number of user requests will inevitably lead to delays in getting responses, as those requests will be processed in sequence. To solve this bottleneck, Replicated Services Pattern is used. The Replicated Services pattern is what is typically meant when discussing Web Server Scalability 101. The core concept here is adding multiple instances of the same model server; an instance here is a copy (or replica) of the original web server with a different address. As an API is stateless, adding or reducing equivalent servers (AKA Horizontal Scaling) allows for seamless scaling of inference and ensures High Availability (HA). To ensure that requests go to appropriate servers Load Balancers are deployed. Additionally to the information above, Replicated Services pattern helps with reducing latency as replicas can be put closer to a client's geographic location, and thus, minimizing latency. Sharded Services Pattern In the previous pattern, the goal was to distribute many requests to ensure that clients get responses fast. However, it's very common to have large datasets when it comes to inference in ML domain. An expectation of serving large amounts of data is one of the core differences between a regular web server and a web server designed for ML inferences. Thus, it is common to rely on yet another serving pattern, called Sharded Services Pattern. Replicated Services Pattern has fixed, identical computational resources. Regular web servers are not expected to do computationally-intensive tasks but it's something that is expected out of ML-specific ones and thus, those computational resources would be a bottleneck. In Sharded Services pattern, large request is divided into smaller pieces, where each piece (or segment) is processed independently by a model server shard, which is a partition of a larger model server. After processing each segment, the results are merged into a final output. Sharding Services pattern is useful not only with large datasets but also in use cases where each shard is responsible for a specific task (Natural Language Processing on one shard and Computer Vision model on another). Another use case would include shards accounting for certain data characteristics such as geographic regions. One of the core concepts of this pattern is sharding function. A sharding function acts as an intelligent router that determines which shard process a sub-request should go to. Sharding function is conceptually very similar to hash functions used in more traditional distributed applications. Important characteristics of sharding functions would be: Uniform distribution: It is important to distribute load evenly to prevent "hot shards" that become overloaded while others are underutilized. Minimal resharding impact: If the number of shards changes, the workload redistribution should be minimized. This is conceptually similar to consistent hashing algorithms like Ring Hash. Context Awareness: For ML workloads specifically, the sharding function will need to understand model-specific characteristics and routing must be done based on input size, computational complexity, and/or data characteristics. It would be important to note that a load balancer would employ more stateful algorithms that will provide some information about the client. Event-Driven Processing Pattern Event-driven Processing Pattern compliments the previous ones. In this pattern, the system reacts on demand, allocating resources only when requests arrive. In this model, the system operates on demand, allocating resources only when inference requests arrive rather than maintaining constantly active ser

Apr 5, 2025 - 22:00

Intro

The goal of every company is to make money and AI Models are seen as an integral part of the business. As machine learning models move from experimentation to production, serving them becomes a challenge. Serving them at a scale becomes even a larger issue. Having a model that shows high accuracy isn't enough - we need an infrastructure that will be robust, efficient, and scalable. In this article, I will dive deeper into main model serving patterns. This would be useful for anyone building model ML Platform systems that need to operate reliably despite large number of users (and requests) or large data.

What is Model Serving?

Model serving is the process of loading a previously trained machine learning model with the ultimate goal of generating predictions or in general, perform inference on new and unseen data.

Replicated Services Pattern

Imagine a very simple prediction server. You have a use case where users upload their photos or videos, and the trained ML model will automatically label people in them. In general, any API should be stateless; each request is processed independently and is treated as a completely new transaction without knowing anything about a client. On a small-scale, it would be possible to run predictions with a single node. However, a growing number of user requests will inevitably lead to delays in getting responses, as those requests will be processed in sequence. To solve this bottleneck, Replicated Services Pattern is used.

The Replicated Services pattern is what is typically meant when discussing Web Server Scalability 101. The core concept here is adding multiple instances of the same model server; an instance here is a copy (or replica) of the original web server with a different address. As an API is stateless, adding or reducing equivalent servers (AKA Horizontal Scaling) allows for seamless scaling of inference and ensures High Availability (HA). To ensure that requests go to appropriate servers Load Balancers are deployed.

Additionally to the information above, Replicated Services pattern helps with reducing latency as replicas can be put closer to a client's geographic location, and thus, minimizing latency.

Sharded Services Pattern

In the previous pattern, the goal was to distribute many requests to ensure that clients get responses fast. However, it's very common to have large datasets when it comes to inference in ML domain. An expectation of serving large amounts of data is one of the core differences between a regular web server and a web server designed for ML inferences. Thus, it is common to rely on yet another serving pattern, called Sharded Services Pattern.

Replicated Services Pattern has fixed, identical computational resources. Regular web servers are not expected to do computationally-intensive tasks but it's something that is expected out of ML-specific ones and thus, those computational resources would be a bottleneck. In Sharded Services pattern, large request is divided into smaller pieces, where each piece (or segment) is processed independently by a model server shard, which is a partition of a larger model server. After processing each segment, the results are merged into a final output.

Sharding Services pattern is useful not only with large datasets but also in use cases where each shard is responsible for a specific task (Natural Language Processing on one shard and Computer Vision model on another). Another use case would include shards accounting for certain data characteristics such as geographic regions.

One of the core concepts of this pattern is sharding function. A sharding function acts as an intelligent router that determines which shard process a sub-request should go to. Sharding function is conceptually very similar to hash functions used in more traditional distributed applications. Important characteristics of sharding functions would be:

Uniform distribution: It is important to distribute load evenly to prevent "hot shards" that become overloaded while others are underutilized.
Minimal resharding impact: If the number of shards changes, the workload redistribution should be minimized. This is conceptually similar to consistent hashing algorithms like Ring Hash.
Context Awareness: For ML workloads specifically, the sharding function will need to understand model-specific characteristics and routing must be done based on input size, computational complexity, and/or data characteristics.

It would be important to note that a load balancer would employ more stateful algorithms that will provide some information about the client.

Event-Driven Processing Pattern

Event-driven Processing Pattern compliments the previous ones. In this pattern, the system reacts on demand, allocating resources only when requests arrive. In this model, the system operates on demand, allocating resources only when inference requests arrive rather than maintaining constantly active services. This approach leverages a shared resource pool where compute capacity is dynamically borrowed based on current load requirements, enabling efficient utilization across the entire infrastructure. A critical consideration in this architecture is implementing robust defenses against denial-of-service attacks, as both accidental (buggy clients) and malicious overuse can potentially overwhelm the system. Protection mechanisms typically include rate limiting to control request processing velocity, along with intelligent queuing systems that buffer excess requests and process them at a manageable pace without losing data.

Conclusion

Hope you enjoyed reading this post and in the future ones, I will dive deeper into the rest of Machine Learning Infrastructure setup.