Understanding Amazon Bedrock Pricing: From On-Demand to Fine-Tuning

As generative AI continues to revolutionize industries, Amazon Bedrock emerges as a pivotal platform, providing seamless access to a plethora of foundation models (FMs) from leading AI providers such as Anthropic, Meta, Mistral AI, and Amazon itself. Its serverless architecture and unified API simplify the deployment of AI applications. However, understanding its pricing nuances is crucial for optimizing both performance and cost. Model Inference When utilizing foundation models (FMs) in Amazon Bedrock for inference, there are two primary approaches: On-Demand and Provisioned Throughput. On-Demand In the On-Demand model, Amazon Bedrock operates on a pay-as-you-go basis, making it ideal for scenarios where usage patterns are unpredictable. For instance, if you're launching a new LLM application without a clear forecast of user engagement, this model offers flexibility without long-term commitments. Each foundation model (FM) available through Bedrock has its own pricing structure based on token usage. When the model is invoked, Bedrock calculates the number of input and output tokens processed and multiplies these by the respective per-token rates defined for that model. Prompt caching In addition to standard token pricing, Amazon Bedrock also offers a prompt caching feature. This allows repeated prompts within a short window to be served from cache, reducing both latency and cost—especially useful when parts of your input remain the same across multiple requests. Let’s take a look at the current pricing for Amazon Nova Micro on the Bedrock pricing page. (Note: Pricing is subject to change, so it’s always a good idea to refer to the official AWS Bedrock pricing page for the latest rates.) For example, Amazon Nova Micro—a lightweight text generation model—charges $0.000035 per 1,000 input tokens and $0.00014 per 1,000 output tokens when used in on-demand mode. If a portion of your prompt is cached, the cached input tokens are charged at a reduced rate of $0.00000875 per 1,000, offering substantial savings for repeated instructions or context. When running batch inference, input and output costs drop even further to $0.0000175 and $0.00007 per 1,000 tokens, respectively—making it a cost-efficient choice for large-scale jobs. While these prices seem small, they can quickly add up when you’re processing thousands of requests per day. In addition to text-based models, Amazon Bedrock includes support for image and video generation, with pricing based on output type and quality. For example, generating images through Amazon Nova Canvas or Stability AI models ranges from a few cents depending on resolution and quality level—higher resolutions or premium outputs cost more. Batch processing - potential to reduce inference costs If you plan to handle a high volume of prompts or images in one scheduled run, using batch inference can help reduce the cost per token or per image. Let’s say you have 1,000 customer support transcripts that you want to summarize. Instead of sending each document individually—which can be both time-consuming and more expensive—you can use batch inference to process them all at once. Each document is treated as a separate prompt within a single batch job. This approach helps reduce the per-token cost compared to on-demand inference and is well-suited for scheduled or background tasks that don’t require real-time responses. The main advantage is reduced per-token cost compared to on-demand inference, making it ideal for large-scale jobs that don't require real-time output. Provisioned Throughput For applications that require consistent, high-performance inference—especially in production environments—Provisioned Throughput is a valuable option. Unlike the on-demand model where you pay per token, Provisioned Throughput reserves dedicated capacity for your chosen foundation model, ensuring low-latency and predictable response times. You are billed hourly/daily/monthly for the provisioned units, regardless of how much you use them, which makes this approach ideal for steady, high-volume workloads. Bedrock also offers discounts based on commitment: the longer the reservation (e.g., 1-month or 6-month plans), the lower the hourly rate. Which Pricing Model Should You Choose? If you're just starting out or expect fluctuating usage, On-Demand gives you the flexibility to pay only for what you use—perfect for development, experimentation, or unpredictable traffic. If you’re processing large volumes of requests in scheduled jobs, Batch Inference offers the same flexibility with better cost-efficiency. For steady, production-level workloads that demand consistent performance and low latency, Provisioned Throughput is the most reliable choice, especially when combined with long-term commitments for additional savings. Fine-Tuning: Customizing Models When you want to fine-tune a model in Amazon Bedrock, the cost struc

May 5, 2025 - 11:38

Understanding Amazon Bedrock Pricing: From On-Demand to Fine-Tuning

As generative AI continues to revolutionize industries, Amazon Bedrock emerges as a pivotal platform, providing seamless access to a plethora of foundation models (FMs) from leading AI providers such as Anthropic, Meta, Mistral AI, and Amazon itself. Its serverless architecture and unified API simplify the deployment of AI applications. However, understanding its pricing nuances is crucial for optimizing both performance and cost.

Model Inference

When utilizing foundation models (FMs) in Amazon Bedrock for inference, there are two primary approaches: On-Demand and Provisioned Throughput.

On-Demand

In the On-Demand model, Amazon Bedrock operates on a pay-as-you-go basis, making it ideal for scenarios where usage patterns are unpredictable. For instance, if you're launching a new LLM application without a clear forecast of user engagement, this model offers flexibility without long-term commitments. Each foundation model (FM) available through Bedrock has its own pricing structure based on token usage. When the model is invoked, Bedrock calculates the number of input and output tokens processed and multiplies these by the respective per-token rates defined for that model.

Prompt caching

In addition to standard token pricing, Amazon Bedrock also offers a prompt caching feature. This allows repeated prompts within a short window to be served from cache, reducing both latency and cost—especially useful when parts of your input remain the same across multiple requests.

Let’s take a look at the current pricing for Amazon Nova Micro on the Bedrock pricing page. (Note: Pricing is subject to change, so it’s always a good idea to refer to the official AWS Bedrock pricing page for the latest rates.)

For example, Amazon Nova Micro—a lightweight text generation model—charges $0.000035 per 1,000 input tokens and $0.00014 per 1,000 output tokens when used in on-demand mode. If a portion of your prompt is cached, the cached input tokens are charged at a reduced rate of $0.00000875 per 1,000, offering substantial savings for repeated instructions or context. When running batch inference, input and output costs drop even further to $0.0000175 and $0.00007 per 1,000 tokens, respectively—making it a cost-efficient choice for large-scale jobs. While these prices seem small, they can quickly add up when you’re processing thousands of requests per day.

In addition to text-based models, Amazon Bedrock includes support for image and video generation, with pricing based on output type and quality. For example, generating images through Amazon Nova Canvas or Stability AI models ranges from a few cents depending on resolution and quality level—higher resolutions or premium outputs cost more.

Batch processing - potential to reduce inference costs

If you plan to handle a high volume of prompts or images in one scheduled run, using batch inference can help reduce the cost per token or per image. Let’s say you have 1,000 customer support transcripts that you want to summarize. Instead of sending each document individually—which can be both time-consuming and more expensive—you can use batch inference to process them all at once. Each document is treated as a separate prompt within a single batch job. This approach helps reduce the per-token cost compared to on-demand inference and is well-suited for scheduled or background tasks that don’t require real-time responses. The main advantage is reduced per-token cost compared to on-demand inference, making it ideal for large-scale jobs that don't require real-time output.

Provisioned Throughput

For applications that require consistent, high-performance inference—especially in production environments—Provisioned Throughput is a valuable option. Unlike the on-demand model where you pay per token, Provisioned Throughput reserves dedicated capacity for your chosen foundation model, ensuring low-latency and predictable response times. You are billed hourly/daily/monthly for the provisioned units, regardless of how much you use them, which makes this approach ideal for steady, high-volume workloads. Bedrock also offers discounts based on commitment: the longer the reservation (e.g., 1-month or 6-month plans), the lower the hourly rate.

Which Pricing Model Should You Choose?

If you're just starting out or expect fluctuating usage, On-Demand gives you the flexibility to pay only for what you use—perfect for development, experimentation, or unpredictable traffic. If you’re processing large volumes of requests in scheduled jobs, Batch Inference offers the same flexibility with better cost-efficiency. For steady, production-level workloads that demand consistent performance and low latency, Provisioned Throughput is the most reliable choice, especially when combined with long-term commitments for additional savings.

Fine-Tuning: Customizing Models

When you want to fine-tune a model in Amazon Bedrock, the cost structure differs from standard inference and comes with a few additional components:

Training Cost: For text models, you’re charged per 1,000 tokens processed during training. For image or multimodal models, pricing is typically based on the number of images used.

Storage Fee: After fine-tuning, the custom model is stored in your account, and a monthly storage fee applies.

Inference Cost: You can’t run fine-tuned models in on-demand mode. Instead, you must use Provisioned Throughput, which is billed hourly—even if the model isn’t actively being used.

For example, let’s consider fine-tuning Amazon Nova Micro using a small dataset.

Pricing for model customization (fine-tuning)
Let’s say you're fine-tuning a model with 100,000 tokens (about 75,000 words or 150+ pages of content). That’s still on the small side for deep fine-tuning, but it’s a more realistic starting point.

Training Cost (One-time) You’re charged based on the number of tokens processed during training. → Example: 100,000 tokens × $0.001 per 1,000 tokens = $0.10 (one-time)
Model Storage (Monthly) Once the model is fine-tuned, storing it incurs a fixed monthly cost. → Example: $1.95 per month
Provisioned Throughput for Inference (Hourly) Fine-tuned models must use provisioned throughput—you pay even if no requests are made. → Example: $108.15 per hour

Special Case Pricing: What You Should Know

When exploring Amazon Bedrock's pricing structure, it's essential to be aware of certain exceptional costs that can significantly impact your overall expenditure. Beyond the standard charges for on-demand usage and provisioned throughput, there are additional fees associated with model customization. For instance, fine-tuning a model on your proprietary data incurs costs based on the number of tokens processed during training. Moreover, once a model is fine-tuned, storing it attracts a monthly storage fee. These costs are separate from the inference charges and can accumulate over time, especially if multiple custom models are maintained.

Another area to consider is the inference of fine-tuned models. Unlike base models that can be used on-demand, fine-tuned models require provisioned throughput, meaning you need to reserve dedicated capacity, which is billed hourly regardless of usage. This can lead to higher costs, particularly if the reserved capacity isn't fully utilized. Additionally, importing models trained outside of Bedrock may involve compatibility evaluations and associated fees. It's crucial to factor in these exceptional costs when planning your AI infrastructure to avoid unexpected charges.