Introducing Forte
- 2025-06-27
Your scientists were so preoccupied with whether or not they could... They didn't stop to think if they should.
Lately I've been hacking on forte
, an experimental re-implementation of rayon-core
built for Bevy.
I'm super pleased with how it is turning out, so I thought I'd share some details about what's been done so far, and where things are going.
Introduction
Ask anyone how to do multi-threading or parallelism in Rust?
They'll tell you to use rayon
.
It's standard, it's reliable, it's what you use:
Take a collection, slap on a par_iter
, and boom instant parallelism.
Over the last decade, the Rust community has steadily refined rayon into one of the most important and impressive parts of the ecosystem.
And, because rayon is so beloved and so widely used, most people are a little surprised when I mention I am rewriting part of it.
To be clear, I'm not touching rayon itself.
The actual rayon
crate is concerned only with high-level parallelism and the parallel iterator API.
What I've been focused on is rayon-core
, which provides the thread pool that rayon
runs on top of.
Compared to rayon
, I've found that forte
can have orders-of-magnitude lower CPU usage, in exchange for a small performance penalty.
I've successfully tuned forte
to beat rayon
for small and mid-sized workloads, but it tends to lag a bit behind rayon
at larger scales.
But most importantly, forte
has first-class support for async!
Because forte
is both a traditional parallel scheduler (like rayon
) and a complete async executor, you combine synchronous and asynchronous Rust in ways that are typically difficult or impossible.
Want to give it a shot?
Right now, the easiest way to test-drive forte
is probably via the forte-rayon-compat
crate.
If you apply a Cargo patch redirecting rayon-core
to that crate, you can run rayon
directly on top of forte
.
This isn't the recommended way to use the crate though; forte
was built primarily for applications that need direct thread-pool access.
Is This Hubris?
Rolling a custom version of a reliable, stable, and mature library is generally considered a Bad Idea™. Rewrites are slow and cost precocious development hours. They usually introduce new bugs and can sometimes re-introduce old ones. In open source, forks fragment the ecosystem and can divert maintainer attention. And all that's assuming the end result is even a clear improvement on some axis! It's all too easy to burn a few hundred hours painstakingly rediscovering why the original authors did what they did, that they were basically correct, and their choices can't be improved upon.
Which is all to say, anyone who embarks on a rewrite naturally has some hard questions to answer:
-
What needs prompted this rewrite?
-
Could you have made the existing solution serve your needs?
-
Are you sure this isn't just another sad case of Not Invented Here Syndrome?
I'm going to try to answer all these questions, but first, I'd like to give some context about what forte
is designed to replace:
Bevy's Task Pool abstraction.
What's a Task Pool?
Bevy is a popular open-source game engine written in rust.
It initially used rayon
for parallelism,
but way back in 2020 bevy switched over to bevy_tasks
, bevy's home-grown scheduling crate.
We liked rayon, but we found that it...
-
Lacked Control. Rayon was a closed box, with no unsafe escape hatches.
-
Had High Overhead. Calling into rayon comes with high costs in time and CPU utilization, making it hard to adopt for workloads of unknown or variable size.
-
Complicated Async Support. Rayon didn't support async, and using an async executor in combination with rayon had significant pitfalls.
All of these issues were listed more or less verbatim on the PR removing Bevy's rayon
dependency.
Now, five years later, very little has changed.
rayon
is still very locked down, it still has an overhead problem, and it's still easy to blow your foot off when you mix it with async.
This slugishness is due in part to rayon's admirable commitment to stability:
Each major release of rayon_core
comes with significant risks1, and the maintainers are (rightfully) reticent to merge changes that would break their ecosystem.
I can't fault rayon
for any of this, really.
It was simply designed to do one thing well, and Bevy mostly isn't that thing.
Unfortunately, the replacement, bevy_tasks
, has not been without its own slew of issues.
It is built around the idea of Task Pools2,
which are more or less just thread pools where each worker is an async-executor
3.
The API is primarily async, and it requires most synchronous jobs to be described as futures.
It does have minimal support for parallel-iteration, but compared to rayon
it is incomplete, unintuitive, and slow.
Moving to bevy_tasks
did lower CPU-use overhead, but it introduced new execution-time overhead and worsened issues with load-balancing.
I said earlier that it was built on async-executor
, but that isn't quite right.
When we needed Bevy to run on the web, a temporary solution4 was added to pass tasks off to wasm_bindgen_futures
instead.
And more recently, when moving Bevy towards no_std
support, we also added a way to swap to edge-executor
for task execution.
So now there's a wide surface area of different feature flags and platforms, each pulling in different crates and using different execution mechanisms.
It's hard to manage, and harder still to test.
To me, bevy_tasks
has always felt very cobbled together.
Everything is built from off-the-shelf crates, which in principle sounds like a blessing: Less code to maintain.
But in practice it has mostly led to a lot of complex integration work, without us really gaining any decisive control over what each part does.
We don't actually have that much more control than we did with rayon
.
How Do You Solve A Problem Like bevy_tasks
?
As you may have noticed, I am not a fan of bevy_tasks
.
The code is over-complicated, under-documented, abuses async for synchronous work, and is filled with subtle platform-specific footguns and incompatibilities.
I have run into serious soundness issues with its web implementation, and I am concerned there are more.
But its worst sin is how Bevy uses the Task Pool.
For some complex reasons I won't get into now, Bevy supplies not one, not two, but three different thread pools, which must compete for CPU cores.
As a consequence of this design, work scheduled on one pool cannot utilize the entire CPU, even if the other pools are sitting idle.
People have been trying to fix/refactor/replace bevy_tasks
for literally years.
At one point we tried building something like par_iter
on top of bevy_tasks
.
But we found that the Task Pool lacked the performance and the low-level control necessary to make it work.
We tried embedding async-executor
into a rayon-core
task pool and exposing it through bevy_tasks
,
but this only added more overhead, messed with the internal workings of rayon-core
, and ended up being a sizable performance regression.
We tried a lot of stuff.
None of it worked.
No combination of existing off-the-shelf crates can be made to serve our needs.
bevy_tasks
remains the thread-pool de jour because, despite its faults, nothing else can do what it does.
It is best of all our fairly-lousy options.
Taking The Next Step
I had just started contributing to bevy as the last round of exploratory work on re-integrating rayon with bevy_tasks
was petering out.
I was only vaguely aware of the work, and I really just wanted us to go back to using rayon directly.
So I started looking at the specific blockers, and I found that most of them did have solutions:
-
A new technique called "Heartbeat Scheduling" had been developed as an alternative to work-stealing, first published in a paper, then popularized by a Zig parallelism library called
spice
and finally ported to Rust aschili
. This Rust port had some promising benchmarks compared torayon
, and the author @dragostis was talking about building arayon
integration. -
Niko Matsakis outlined a comprehensive approach to adding async to
rayon-core
. This happened to useasync_task
, one of the libraries we rely on forbevy_tasks
, so it was clearly compatible with bevy's async needs. -
In another comment, Josh Stone provided a prototype of a
block_on
method for rayon. This seemed to be the main other thing required to turnrayon-core
into a fully-featured async executor. -
The
rayon-core
codebase is surprisingly small, simple, and clean. If we wanted maximum control, it wouldn't be too hard to fork and slim it down into something tiny and easily maintainable.
I thought the solution was pretty obvious: Fork rayon_core
, throw out anything we didn't need (to make it maintainable), do some house-keeping, add the async support and steal the heartbeat-scheduling model from chili
. And, over the course of six months, that's more or less what I did with forte
. Nothing I did was novel, or particularly interesting. All I did was look around at what other people5 had done, and take the next logical step.
Using forte
Since I mentioned API refactors, I want to briefly introduce some of the important features of the forte
API.
There's a lot of shared DNA with its older brother rayon_core
, but there are also some significant departures.
The most obvious difference is that all thread-pools must be static
, and that forte::ThreadPool
has a const
constructor.
This makes creating a new thread-pool relay simple.
use ThreadPool;
// You can kiss OnceLock goodbye.
static THREAD_POOL: ThreadPool = new;
As an aside, this change also let me significantly simplify the internal job execution logic,
and effectively eliminate the use of Arc<T>
(which rayon_core
relies heavily on).
How does this work? Obviously, you can't spawn threads in a const
-context, so thread pools are empty by default.
Before using a thread pool, you'll probably want to add some workers to it.
This populates the pool with one worker-thread per CPU core.
You can also use populate
to add a single worker, depopulate
to shut down all current workers, or resize_to
if you want a specific number of workers.
Thread pools can be resized at any point, even while in use (although only one resize operation can happen at a time).
The best way to interact with a thread-pool is through a Worker
.
There are several different ways to access workers, but the easiest is probably via ThreadPool::with_worker
6 method.
THREAD_POOL.with_worker;
This registers the current thread as a temporary worker on the thread pool, and creates a new thread-local worker context.
If the thread is already a member of the pool, it just looks up the existing worker context.
If you know a thread is a member of some pool but you don't know which, you can use Worker::with_current
to access the current local worker context.
You can also manually acquire a "lease" on a thread pool and occupy it to create a new worker.
let lease = THREAD_POOL.claim_lease; // Lease some space from the thread pool.
occupy;
A worker is like a local "view" onto a thread pool; you can queue work onto it, and it will make sure it gets executed. It also provides access to a selection of multi-threading calls:
block_on
: Replacesbevy_tasks::block_on
.join
: Replaces bothrayon_core::join
andrayon_core::join_context
.scope
: A blocking/async hybrid ofrayon::scope
andTaskPool::scope
.spawn
: Replacesrayon_core::spawn
.spawn_future
: ReplacesTaskPool::spawn
.spawn_async
: Likespawn_future
but takes a closure that returns a future.yield_local
: Replacesrayon_core::yield_local
.yield_now
: Replacesrayon_core::yield_now
.
These are also exposed on ThreadPool
and as first-class functions, but under the hood these are just proxies that manage the Worker
for you.
The most important of these are probably join
and spawn_async
.
The former takes two closures and executes them in parallel; this is what lets rayon
work its parallel iterator magic.
As an example, here's how you can use join
to divide a slice into chunks which can be operated upon in parallel.
Notice how you don't have to rely on a specific ThreadPool
static to make this work, you just accept a &Worker
ref.
This should be familiar to anyone who has used rayon_core
before, just as spawn_async
will be familiar to anyone who has used async Rust.
In case you need a refresher, here's an example of how one might use spawn_async
to calculate a checksum.
// Spawning an async job returns a task (from async_task)
let task = worker.spawn_async;
// If you want to halt the job, you can cancel it or drop the task.
task.cancel;
// If you don't care about the result but still want it to run, you can detach it instead.
task.detach;
// In an async context, you can get the result by awaiting the task.
let result = task.await;
// Outside of an async context, you can use block_on instead.
let result = block_on;
Now let's move into something slightly less standard: scoped tasks.
Hybrid Scopes
Scopes exist to get around the 'static
lifetime bounds (which you generally need if you are spawning work with an unbounded lifetime).
The forte::scope
API is just like std::thread::scope
.
You pass in a closure over &Scope
, and all work started within the closure must complete before forte::scope
will yield back to the caller.
This makes it possible to borrow data stack-allocated from outside the scope-closure, safe in the knowledge that the stack won't be popped while in use.
Here's an example of what this looks like:
let mut a = vec!;
let mut x = 0;
worker.scope;
// Both jobs will run before we reach this point
println!;
Scopes are tricky. Historically, Rust has had some difficulty implementing scopes properly]poop-scope (though this isn't an issue for my implementation). But async-scopes specifically, there's also a pretty welcome problem called The Scoped-Task Trilemma. The gist of it is:
Concurrency, Parallelizability, Borrowing — pick two.
Under this framework, API calls fall into three broad categories:
-
Concurrency + Parallelizability
Child tasks proceed both independently and concurrently with the parent tasks.
std::thread::spawn
(non-async)bevy_tasks::TaskPool::spawn
(async)
-
Parallelizability + Borrowing
Child tasks proceed independently with the parent task while borrowing from it.
std::thread::scope
(non-async)rayon::join
(non-async)rayon::scope
(non-async)bevy_tasks::TaskPool::scope
(async)
-
Borrowing + Concurrency
Child tasks proceed concurrently with the parent task while borrowing from it.
futures::select!
(async)futures::stream::FuturesUnordered
(async)
Near the end of the article introducing the “Trilemma", withoutboats remarks:
When it comes to “Parallelizability + Borrowing” we see that no async APIs exist at all. This maybe isn’t surprising, because async is all about enabling concurrency. But there’s a missing API here that seems obvious to me: an API which starts an executor across a thread pool, and blocks until all of the tasks you spawn onto it are complete.
Both bevy_tasks
and forte
provide this "missing" API.
And with forte
, scopes can be used to spawn both blocking and async jobs.
let = unbounded;
worker.scope;
Support for scoped-async was a hard blocker for Bevy, since it uses this part of bevy_tasks
extensively.
But there are many situations where I think scoped-async could be useful to the broader Rust community,
especially when combined with blocking calls like join
.
Mixing Blocking and Concurrent Rust
I think of forte
as a hybrid scheduler, because it's designed to run both blocking jobs and futures.
Usually, making blocking calls inside of a future is a Bad Idea™.
For example, calling rayon::scope
in an async-context will usually block our async executor.
In this case, there's a critical safety requirement that prevents rayon
from yielding back to the caller until all the work spawned on the pool is complete.
But with forte
, there's an escape hatch: our scheduler is our executor, so we just continue executing async tasks while waiting for our blocking work.
That means it's totally reasonable to mix async and blocking calls to a worker
.
async
There are some dangers here. Forte cannot guarantee that your code will be lock-free, and you do have to take care during your implementation.
Compatibility with rayon_core
If you've used rayon_core
before, the forte
API will probably feel familiar.
But there are enough significant differences (especially the lack of a default thread pool) that it can't really be called a drop-in replacement.
That also means that rayon
can't run directly on forte
.
Hmmm.... directly?
Does that mean I can run rayon
on forte
indirectly?
Yes.
I've written a crate called forte-rayon-compat
that mocks the rayon_core
but redirects the calls to forte
.
Only about half the API works7, but it's the half that rayon
uses for parallel iterators.
Since there's a wide array of applications built using rayon
, this crate gives us an easy (if slightly suspect) way to benchmark the performance of forte
against rayon_core
.
Benchmark Methodology
So how does forte
perform? There's a verity of benchmarks we can use to assess that:
-
chili
has afork_join
benchmark that can be used to assessjoin
overhead againstrayon
, and includingforte
was trivial. -
The
rayon_demo
benchmarks can be used throughforte-rayon-compat
to compare withrayon
. -
Griffin (author of the amazing
obvhs
crate) has written a parallel radix-sort benchmark forforte
,rayon
,bevy_tasks
, andchili
calledpool_racing
. -
bevy_tasks
has a few benchmarks as well, which I have ported over to compare withforte
.
All of the performance numbers I'm about to give were either sampled with divan
or aggregated across a large number of runs.
It's still not super scientific8 but I am confident that the results are more than just noise.
Tree Traversal Benchmark
The chili
crate sports a single benchmark, on which it performs very favorably compared to rayon
.
It's a simple traversal of a balanced binary tree using join
, and is designed to assess overhead.
Nodes | Serial | Rayon-Core | Chili | Forte |
---|---|---|---|---|
1023 | 1.24 μs | 52.69 μs | 3.01 μs | 4.54 μs |
16777215 | 27.91 ms | 10.80 ms | 6.15 ms | 5.91 ms |
134217727 | 318.50 ms | 97.47 ms | 47.36 ms | 49.88 ms |
Nodes | Serial | Rayon-Core | Chili | Forte |
---|---|---|---|---|
1023 | 1.20 μs | 27.87 μs | 2.46 μs | 3.45 μs |
16777215 | 27.41 ms | 10.46 ms | 5.35 ms | 5.67 ms |
134217727 | 312.00 ms | 93.83 ms | 40.51 ms | 47.7 ms |
Both chili
and forte
handle it well, with chili
in the lead.
It is worth pointing out that almost no real work is being done in this benchmark, and it's essentially just assessing latency.
From the serial pass, we can estimate that the time-per-item is about 1.1
nanoseconds.
Let’s increase that to about 10
nanoseconds (about 40
clock-cycles) and run the tests again.
Nodes | Serial | Rayon-Core | Chili | Forte |
---|---|---|---|---|
1023 | 10.94 μs | 57.61 μs | 11.80 μs | 12.36 μs |
16777215 | 151.20 ms | 21.23 ms | 19.80 ms | 17.55 ms |
134217727 | 1230.00 ms | 167.80 ms | 147.90 ms | 138.40 ms |
Nodes | Serial | Rayon-Core | Chili | Forte |
---|---|---|---|---|
1023 | 10.74 μs | 29.87 μs | 10.37 μs | 10.41 μs |
16777215 | 149.60 ms | 20.72 ms | 16.50 ms | 17.11 ms |
134217727 | 1214.00 ms | 165.70 ms | 128.30 ms | 135.4 ms |
Surprisingly little changes:
forte
narrowly takes the lead, and though the gap closes a bit, rayon
is firmly in last place.
This does make it look like rayon::join
has problems with both latency and throughput.
To be fair to rayon
, this is still a bit of a pathological case.
When left to its own devices, rayon
spawns fewer jobs, in part to compensate for deficiencies it has in situations like this.
Let's look at some of rayon's own benchmarks to see what effect this has.
Rayon's Benchmarks
The rayon
repo has an app called rayon-demo
that bundles together several different benchmarks.
By replacing rayon-core
with forte-rayon-compat
(via a Cargo patch), it's also possible to run these benchmarks with forte
.
Benchmark | Serial | Rayon (Core) | Rayon (Forte) |
---|---|---|---|
Mergesort | 11.46 s | 1.08 s | 1.27 s |
Quicksort | 17.09 s | 2.20 s | 2.39 s |
Sieve | 3.16 s | 56.64 ms | 58.47 ms |
Matmul | 0.84 s | 52.10 ms | 76.66 ms |
Here we see rayon
take the lead, as we would expect.
Clearly, it is possible to get around the issue we saw in the last test.
I think forte
's performance here is reasonable, given the lack of work-stealing.
Though it is a clear regression, especially on the Matmul example, it's still within the range that I would consider "usable".
When comparing to rayon
, I am more interested in CPU utilization.
Luckily, rayon has a "Game of Life" demo that reports this.
Backend | Mean Time | Cpu Usage |
---|---|---|
Serial | 33.28 ms | 8.5 % |
Rayon (Core) | 18.35 ms | 38.4 % |
Rayon (Core) Bridged | 125.07 ms | 119.9 % |
Rayon (Forte) | 29.68 ms | 7.7 % |
Rayon (Forte) Bridged | 43.34 ms | 8.2 % |
There's a lot to talk about here. Clearly, forte
uses significantly less CPU.
For the parallel iterators, performance is slightly worse, but interestingly, for bridged iterators (sequential iterators which rayon converts to parallel), performance actually improved!
One thing I should mention is that, because forte uses "Heartbeat Scheduling", the runtime has a tuning parameter called the Heartbeat Interval. This is, more or less, the rate at which individual workers distribute load to their peers. In my tests, I've been able to get good results with intervals as low as 5 µs (or 200 kHz). Varying this parameter seems to skew the performance distribution, but has a pretty small effect on the mean. It does, however, seem to have a very pronounced effect on CPU utilization. To produce the low CPU usage in this demo, I turned it down to 500 µs (or 2 kHz).
I suspect the reason forte
performs so much better on bridged work is that rayon ends up spawning more individual tasks (and does less chunking).
So this is where the performance deficit we saw in the tree-reversal benchmarks really starts to have an appreciable effect.
Now that we've set the scene with some artificial benchmarks, let’s look at how these different schedulers perform on something like a real-world application.
Pool-Racing Benchmark
The obvhs
crate is a really excellent bounding-volume-hierarchy, written by @DGriffin91.
Griffin also runs a repo called tray_tracing
, with benchmarks of obvhs
and other bhv implementations.
Now he's been nice enough to put together pool_racing
, for benchmarking how obvhs
performs on different thread pools.
We'll use this to compare the performance of forte
, rayon
, and chili
when used to construct a bvh from scenes of varying complexity.
Phase | Serial | Rayon (Core) | Rayon (Forte) | Forte | Chili |
---|---|---|---|---|---|
Init | 50 μs | 496 μs | 20 μs | 129 μs | |
Sort | 54 μs | 198 μs | 122 μs | 109 μs | |
Build | 149 μs | 1.15 ms | 485 μs | 644 μs |
Phase | Serial | Rayon (Core) | Rayon (Forte) | Forte | Chili |
---|---|---|---|---|---|
Init | 80 μs | 701 μs | 37 μs | 109 μs | |
Sort | 35 μs | 222 μs | 41 μs | 19 μs | |
Build | 159 μs | 1.52 md | 377 μs | 501 μs |
Phase | Serial | Rayon (Core) | Rayon (Forte) | Forte | Chili |
---|---|---|---|---|---|
Init | 18.88 ms | 9.69 ms | 11.31 ms | 12.21 ms | 17.13 ms |
Sort | 92.60 ms | 14.45 ms | 20.36 ms | 37.39 ms | 86.20 ms |
Build | 406.02 ms | 64.41 ms | 66.31 ms | 83.42 ms | 137.88 ms |
Here again rayon
decisively takes the lead, with the naive forte
and chili
implementations are further behind.
There's two things to note here.
- The performance of
rayon
is similar when running onrayon-core
orforte-rayon-compat
. - We still see
forte
win on small to midsize workloads. - Across the board
chili
performed significantly worse.
To me, it looks like the internal batching magic happening within rayon
that's the main factor in this test, rather than the thread-pool.
These benchmarks also seem to bare-out the idea that rayon
is throughput-focused rather than latency-focused.
Since forte
is definitely latency-focused, I'm not at all upset with second place.
I also suspect it will be possible to further improve forte
's performance here...
but that's a topic for another article.
So forte
isn't quite as fast as rayon-core
when used through rayon
, but it is kinda close.
That's nice, but all that really matters is how it stacks up against bevy_tasks
.
Bevy-Tasks Iteration Benchmark
When looking at bevy_tasks
, there's two things we should benchmark: parallel iteration, and async execution.
The Bevy repo already has some parallel iteration benchmarks.
The first one measures the overhead of iterating a list, in chunks of a hundred.
Like the tree-traversal test, we do next to no work within the iterator.
Length | Serial | Bevy-Tasks | Rayon | Forte |
---|---|---|---|---|
100 | 84.75 ns | 3.06 μs | 2.03 μs | 388.6 ns |
1,000 | 773.6 ns | 6.06 μs | 40.26 μs | 1.37 μs |
10,000 | 7.708 μs | 55.14 μs | 49.53 μs | 12.38 μs |
100,000 | 76.59 μs | 870.60 μs | 97.35 μs | 75.71 μs |
1,000,000 | 772.9 μs | 9.30 ms | 170.10 μs | 326.60 μs |
10,000,000 | 7.64 ms | 101.20 ms | 468.30 μs | 1.23 ms |
And... wow, uh, bevy_tasks
is really slow!
Slower than serial iteration!!
What's going on here?!
We're looking at iteration, so we should expect to see linear scaling, and indeed we do.
In theory, going parallel just changes the scaling factor;
The parallel runtime should be approximately proportional to the serial runtime divided by the number of cores.
So, on test like this, I'd expect all the parallel cases to scale slower than serial implementation.
Both forte
and rayon
do seem to be scaling somewhere under half the rate of the serial version.
Strangely, bevy_tasks
appears to be scaling almost identically to the serial version.
Wait... since bevy_tasks
also seems to have a higher initial cost, does that serial processing will always be faster?
Has bevy_tasks
been just totally useless this entire time?
No, it's not as bad as it looks.
Clearly bevy_tasks
has really some astonishingly high per-item overhead.
When the work is basically zero, that does means it's always slower than serial iteration.
But as the workload grows, the benefits of parallelization will still eventually become greater than the cost.
We can actually work out exactly when this happens.
From the sequential case, it looks like visiting each item in the list costs ~0.7 ns
.
With bevy_tasks
the per-item-overhead is closer to 10 ns
.
Lets imagine adding ~100 ns
of work, while we are visiting each time, and let's assume we have ten-million items.
Dividing that work by the number of cores (in my case 13) and adding the overhead gives an average per-item completion time of around 17.7 ns
when parallelized,
and an expected total runtime of approximately 17.7 ms
.
This is far better than the expected serial runtime (more than a second), but far worse than the theoretical optimum of ~7.7 ms
.
On my machine, you need each task to take at least ~10 ns
for bevy_tasks
to beat serial iteration.
We can double-check this by running the tests again and adding a 10 ns
wait to each item.
Length | Serial | Bevy-Tasks | Rayon | Forte |
---|---|---|---|---|
100 | 1.07 μs | 4.84 μs | 4.62 μs | 1.421 μs |
1,000 | 10.42 μs | 14.06 μs | 70.26 μs | 12.92 μs |
10,000 | 102.40 μs | 111.00 μs | 82.26 μs | 77.46 μs |
100,000 | 1.02 ms | 1.265 ms | 200.80 μs | 318.80 μs |
1,000,000 | 10.13 ms | 13.44 ms | 1.19 ms | 1.298 ms |
10,000,000 | 101.30 ms | 140.10 ms | 10.65 ms | 10.65 ms |
It's close.
Even at 10 ns
are not quite at break-even-point for bevy_tasks
, but spend a bit longer on each item and we'd get there... eventually... probably.
For the ten-thousand item test, neither forte
nor rayon
quite reach the theoretically optimal 7.7 ms
.
But they are not far off.
And forte
puts up a very good show here, actually improving vs rayon
compared to the previous test.
The cost of bevy_tasks
is probably fine, after all 10 ns
is not huge even by CPU-standards.
But I'd be willing to bet that the vast majority of iterators won't take that long per-item.
And even if they did, that's just the break-even time.
To actually see statistically significant performance benefits you probably need to be spending at least 25 ns
on each item, and at that point we're talking about maybe a hundred clock-cycles.
It's patently obvious that this isn't what bevy_tasks
was built for.
After all, it's primarily an async runtime.
If we want to evaluate it properly, we must also have a look at it's async performance.
Bevy-Tasks Async Benchmark
TODO
Performance Summary
That was a lot of numbers, and I wouldn't blame you if you just skipped all of it to read the conclusions. I would summarize the results as follows:
rayon
is truly amazing, but due to being highly throughput-focused it leaves performance on the table in some situations.chili
is really interesting, but doesn't generalize and does not degrade gracefully.bevy_tasks
is extremely slow compared to any of the other offerings.forte
generalizes better thanchili
and can keep up withrayon
fairly well on larger workloads, despite being more latency-focused.
I think these results are really promising for forte
.
It's not going to kill rayon_core
any time soon, but depending on your needs I do think it presents a viable alternative.