How Geekbench 6 Multicore Is Broken by Design

As a developer, performance is very important to me (clearly, I am not a front-end dev - hah!). It's crucial for the company I work for, too, affecting both cost and user experience. I've been regularly performing and publishing cloud VM CPU comparisons to share my insights. Although my primary tool is my own DKbench suite, I tend to include Geekbench 5 in the comparison, mostly due to the abundance of available published results. This is despite Geekbench 6 being available for some time now. You might wonder why I haven't switched: Simply put, I found that Geekbench 6 multi-core is fundamentally "broken", and I thought I'd explain why. Geekbench 6 Fails at Multi-Core Scaling Geekbench 6 barely scales on multi-core systems. This is unlike Geekbench 5, which, although not linear at scaling, continuously benefits from additional cores. To demonstrate, here is a comparison using Google Cloud C3D VMs with SMT disabled (vCPU = full core), showing the scaling behaviour of DKbench, Geekbench 5 and Geekbench 6 across 2 to 180 cores: The dotted line is theoretical max / ideal scaling. For clarity, here's the DKbench vs Geekbench 6 data in table form, along with Geekbench 6's efficiency versus ideal scaling: Cores DKbench Scaling Geekbench 6 Scaling Geekbench 6 % of Ideal 2 2.0 1.8 89.91% 4 4.0 3.2 79.92% 8 7.9 4.9 61.27% 16 15.2 7.9 49.54% 32 30.4 10.5 32.69% 48 45.5 11.4 23.66% 64 60.0 12.1 18.84% 90 82.6 12.1 13.46% 180 158.8 10.3 5.73% Geekbench 6 scaling starts poorly and pretty much flattens at 32-64 cores or more, at which point you get a performance of around 10-12x that of a single core. Performance shockingly even declines beyond 32 cores, with 180 cores performing worse than 32! For comparison, Geekbench 5 manages a 63x over single core performance on those 180 cores (and DKbench a cool 159x). Geekbench 6's Shared Task Model Let's dive into some technical details to figure out the reason behind this behaviour. Geekbench helpfully publishes some internal details, and there is a "Multi-Threading" section which explains: Geekbench 6 uses a “shared task” model for multi-threading, rather than the “separate task” model used in earlier versions of Geekbench. The “shared task” approach better models how most applications use multiple cores. Basically they say that in the previous versions of Geekbench, in Multi-core mode they would create more work to give to each core separately. In Geekbench 6 they have one task that they try to serve with multiple threads communicating with each other (IPC), which is indeed how, for example Photoshop would try to apply a filter on an image using more cores. There are some fundamental issues with this approach: Home usage: While the "shared task" approach may closely model a (specific) single app, typical user environments involve multiple apps at the same time, while the OS is running dozens of background tasks. Server usage: The "shared task" idea is even less relevant for most cases of servers made for parallel tasks (e.g., processing multiple users/images/etc concurrently), not singular tasks processed faster. What Should a CPU Benchmark Measure? Let's go back to the basics for a moment. What exactly is a CPU benchmark for? I'd say a useful CPU benchmark typically does one of two things:: Application-Specific Benchmark: Tests the performance of specific software. Ideal when workloads are predictable. This type of benchmarking, in multi-core context will tell you whether you can expect performance gains for your app by simply adding cores. Generic Benchmark: Measures general CPU capability by stressing all parts of the CPU in diverse workloads, offering insights into performance across various scenarios. In multi-core mode, these tests should similarly load all cores of the CPU, showing any the limitations of the processor design (lower "all core" boost speed, heat throttling etc). There are some benchmarks that fall in-between. I.e. specific applications that are quite good at stressing a single or multiple cores of a CPU. The Cinebench benchmark is such an example. Geekbench traditionally fit the generic benchmark category, providing useful rough comparisons, where the overall score usually agreed with what I would get from my custom benchmarks. Geekbench 6 breaks this in Multicore, for me it is no longer a generic benchmark: Geekbench 6 Multicore simply measures the performance of Geekbench's particular implementation of very specific workloads. Poor Implementation of Multi-threaded workloads Even if we accept the premise of the "shared task" model, Geekbench 6 does a notably poor job implementing it, leading to decreasing performance when adding cores. From the internals document it seems that their approach to multi-core scaling often involves arbitrary and fixed scaling of workloads, typically setting multi-core tasks at exactly four times the single-core workload, regardl

May 7, 2025 - 02:09

How Geekbench 6 Multicore Is Broken by Design

As a developer, performance is very important to me (clearly, I am not a front-end dev - hah!). It's crucial for the company I work for, too, affecting both cost and user experience. I've been regularly performing and publishing cloud VM CPU comparisons to share my insights. Although my primary tool is my own DKbench suite, I tend to include Geekbench 5 in the comparison, mostly due to the abundance of available published results. This is despite Geekbench 6 being available for some time now. You might wonder why I haven't switched: Simply put, I found that Geekbench 6 multi-core is fundamentally "broken", and I thought I'd explain why.

Geekbench 6 Fails at Multi-Core Scaling

Geekbench 6 barely scales on multi-core systems. This is unlike Geekbench 5, which, although not linear at scaling, continuously benefits from additional cores. To demonstrate, here is a comparison using Google Cloud C3D VMs with SMT disabled (vCPU = full core), showing the scaling behaviour of DKbench, Geekbench 5 and Geekbench 6 across 2 to 180 cores:

The dotted line is theoretical max / ideal scaling.

For clarity, here's the DKbench vs Geekbench 6 data in table form, along with Geekbench 6's efficiency versus ideal scaling:

Cores	DKbench Scaling	Geekbench 6 Scaling	Geekbench 6 % of Ideal
2	2.0	1.8	89.91%
4	4.0	3.2	79.92%
8	7.9	4.9	61.27%
16	15.2	7.9	49.54%
32	30.4	10.5	32.69%
48	45.5	11.4	23.66%
64	60.0	12.1	18.84%
90	82.6	12.1	13.46%
180	158.8	10.3	5.73%

Geekbench 6 scaling starts poorly and pretty much flattens at 32-64 cores or more, at which point you get a performance of around 10-12x that of a single core. Performance shockingly even declines beyond 32 cores, with 180 cores performing worse than 32!
For comparison, Geekbench 5 manages a 63x over single core performance on those 180 cores (and DKbench a cool 159x).

Geekbench 6's Shared Task Model

Let's dive into some technical details to figure out the reason behind this behaviour. Geekbench helpfully publishes some internal details, and there is a "Multi-Threading" section which explains:

Geekbench 6 uses a “shared task” model for multi-threading, rather than the “separate task” model used in earlier versions of Geekbench. The “shared task” approach better models how most applications use multiple cores.

Basically they say that in the previous versions of Geekbench, in Multi-core mode they would create more work to give to each core separately. In Geekbench 6 they have one task that they try to serve with multiple threads communicating with each other (IPC), which is indeed how, for example Photoshop would try to apply a filter on an image using more cores.

There are some fundamental issues with this approach:

Home usage: While the "shared task" approach may closely model a (specific) single app, typical user environments involve multiple apps at the same time, while the OS is running dozens of background tasks.
Server usage: The "shared task" idea is even less relevant for most cases of servers made for parallel tasks (e.g., processing multiple users/images/etc concurrently), not singular tasks processed faster.

What Should a CPU Benchmark Measure?

Let's go back to the basics for a moment. What exactly is a CPU benchmark for? I'd say a useful CPU benchmark typically does one of two things::

Application-Specific Benchmark: Tests the performance of specific software. Ideal when workloads are predictable. This type of benchmarking, in multi-core context will tell you whether you can expect performance gains for your app by simply adding cores.
Generic Benchmark: Measures general CPU capability by stressing all parts of the CPU in diverse workloads, offering insights into performance across various scenarios. In multi-core mode, these tests should similarly load all cores of the CPU, showing any the limitations of the processor design (lower "all core" boost speed, heat throttling etc).

There are some benchmarks that fall in-between. I.e. specific applications that are quite good at stressing a single or multiple cores of a CPU. The Cinebench benchmark is such an example.

Geekbench traditionally fit the generic benchmark category, providing useful rough comparisons, where the overall score usually agreed with what I would get from my custom benchmarks. Geekbench 6 breaks this in Multicore, for me it is no longer a generic benchmark:

Geekbench 6 Multicore simply measures the performance of Geekbench's particular implementation of very specific workloads.

Poor Implementation of Multi-threaded workloads

Even if we accept the premise of the "shared task" model, Geekbench 6 does a notably poor job implementing it, leading to decreasing performance when adding cores. From the internals document it seems that their approach to multi-core scaling often involves arbitrary and fixed scaling of workloads, typically setting multi-core tasks at exactly four times the single-core workload, regardless of CPU size. This approach explains the respectable 80% scaling observed up to 4 cores. Realistically, competent multi-threaded software would dynamically scale concurrent workloads to match available cores.

The Text Processing benchmark

It gets worse than this. I looked for the least scalable benchmark of the suite, which (surpisingly, as I was expecting some sort of non-parallelizable algorithm) is "Text Processing":

Cores	Text Processing Scaling
2	1.182
4	1.303
8	1.346
16	1.280
32	1.300
48	1.278
64	1.277
90	1.279
180	1.274

This is so bizarre. Their "text processing" representative benchmark scales to only about 1.35x single-core performance, peaking at 8 cores, then declining afterward.

It's even more bizarre if you read the internals doc:

The Text Processing workload loads numerous files, parses the contents using regular expressions, stores metadata in a SQLite database, and finally exports the content to a different format. It models typical text processing tasks that manipulate, analyze, and transform data to reformat it for publication and to gain insights. The input and output files are stored using an in-memory encrypted file system.
[...] and processes 190 Markdown files as its input.

So they describe it as processing 190 Markdown files (using regular expressions, storing metadata in SQLite, and exporting results) while taking virtually no advantage of parallel processing! The nearly flat scaling strongly suggests severe implementation bottlenecks e.g. perhaps due to something like poorly managed global write locks on SQLite or similar that serializes the whole process. There are no more details to figure out what they did wrong, but they did do it very wrong!

This benchmark literally implies CPUs with more than 4 cores provide no benefits for "text processing" tasks...

Conclusion

Geekbench 6’s multi-core benchmark is not merely flawed, it's fundamentally broken (and mostly by design). Its adoption of the "shared task" model and poor implementation techniques make it an ineffective representation of any real-world CPU performance. For more realistic and scalable benchmarks, I'd say stick to Geekbench 5, or try others - maybe even give DKbench a try, there's even a docker version.