Internet Place

Introducing Forte

- 2025-06-27

Your scientists were so preoccupied with whether or not they could...
They didn't stop to think if they should.

Lately I've been hacking on forte, an experimental re-implementation of rayon-core built for Bevy. I'm super pleased with how it is turning out, so I thought I'd share some details about what's been done so far, and where things are going.

Introduction

Ask anyone how to do multi-threading or parallelism in Rust? They'll tell you to use rayon. It's standard, it's reliable, it's what you use: Take a collection, slap on a par_iter, and boom instant parallelism. Over the last decade, the Rust community has steadily refined rayon into one of the most important and impressive parts of the ecosystem. And, because rayon is so beloved and so widely used, most people are a little surprised when I mention I am rewriting part of it.

To be clear, I'm not touching rayon itself. The actual rayon crate is concerned only with high-level parallelism and the parallel iterator API. What I've been focused on is rayon-core, which provides the thread pool that rayon runs on top of.

Compared to rayon, I've found that forte can have orders-of-magnitude lower CPU usage, in exchange for a small performance penalty. I've successfully tuned forte to beat rayon for small and mid-sized workloads, but it tends to lag a bit behind rayon at larger scales. But most importantly, forte has first-class support for async! Because forte is both a traditional parallel scheduler (like rayon) and a complete async executor, you combine synchronous and asynchronous Rust in ways that are typically difficult or impossible.

Want to give it a shot? Right now, the easiest way to test-drive forte is probably via the forte-rayon-compat crate. If you apply a Cargo patch redirecting rayon-core to that crate, you can run rayon directly on top of forte. This isn't the recommended way to use the crate though; forte was built primarily for applications that need direct thread-pool access.

Is This Hubris?

Rolling a custom version of a reliable, stable, and mature library is generally considered a Bad Idea™. Rewrites are slow and cost precocious development hours. They usually introduce new bugs and can sometimes re-introduce old ones. In open source, forks fragment the ecosystem and can divert maintainer attention. And all that's assuming the end result is even a clear improvement on some axis! It's all too easy to burn a few hundred hours painstakingly rediscovering why the original authors did what they did, that they were basically correct, and their choices can't be improved upon.

Which is all to say, anyone who embarks on a rewrite naturally has some hard questions to answer:

What needs prompted this rewrite?
Could you have made the existing solution serve your needs?
Are you sure this isn't just another sad case of Not Invented Here Syndrome?

I'm going to try to answer all these questions, but first, I'd like to give some context about what forte is designed to replace: Bevy's Task Pool abstraction.

What's a Task Pool?

Bevy is a popular open-source game engine written in rust. It initially used rayon for parallelism, but way back in 2020 bevy switched over to bevy_tasks, bevy's home-grown scheduling crate. We liked rayon, but we found that it...

Lacked Control. Rayon was a closed box, with no unsafe escape hatches.
Had High Overhead. Calling into rayon comes with high costs in time and CPU utilization, making it hard to adopt for workloads of unknown or variable size.
Complicated Async Support. Rayon didn't support async, and using an async executor in combination with rayon had significant pitfalls.

All of these issues were listed more or less verbatim on the PR removing Bevy's rayon dependency. Now, five years later, very little has changed. rayon is still very locked down, it still has an overhead problem, and it's still easy to blow your foot off when you mix it with async. This slugishness is due in part to rayon's admirable commitment to stability: Each major release of rayon_core comes with significant risks¹, and the maintainers are (rightfully) reticent to merge changes that would break their ecosystem. I can't fault rayon for any of this, really. It was simply designed to do one thing well, and Bevy mostly isn't that thing.

Unfortunately, the replacement, bevy_tasks, has not been without its own slew of issues. It is built around the idea of Task Pools², which are more or less just thread pools where each worker is an async-executor³. The API is primarily async, and it requires most synchronous jobs to be described as futures. It does have minimal support for parallel-iteration, but compared to rayon it is incomplete, unintuitive, and slow. Moving to bevy_tasks did lower CPU-use overhead, but it introduced new execution-time overhead and worsened issues with load-balancing.

I said earlier that it was built on async-executor, but that isn't quite right. When we needed Bevy to run on the web, a temporary solution⁴ was added to pass tasks off to wasm_bindgen_futures instead. And more recently, when moving Bevy towards no_std support, we also added a way to swap to edge-executor for task execution. So now there's a wide surface area of different feature flags and platforms, each pulling in different crates and using different execution mechanisms. It's hard to manage, and harder still to test.

To me, bevy_tasks has always felt very cobbled together. Everything is built from off-the-shelf crates, which in principle sounds like a blessing: Less code to maintain. But in practice it has mostly led to a lot of complex integration work, without us really gaining any decisive control over what each part does. We don't actually have that much more control than we did with rayon.

How Do You Solve A Problem Like `bevy_tasks`?

As you may have noticed, I am not a fan of bevy_tasks. The code is over-complicated, under-documented, abuses async for synchronous work, and is filled with subtle platform-specific footguns and incompatibilities. I have run into serious soundness issues with its web implementation, and I am concerned there are more. But its worst sin is how Bevy uses the Task Pool. For some complex reasons I won't get into now, Bevy supplies not one, not two, but three different thread pools, which must compete for CPU cores. As a consequence of this design, work scheduled on one pool cannot utilize the entire CPU, even if the other pools are sitting idle.

People have been trying to fix/refactor/replace bevy_tasks for literally years. At one point we tried building something like par_iter on top of bevy_tasks. But we found that the Task Pool lacked the performance and the low-level control necessary to make it work. We tried embedding async-executor into a rayon-core task pool and exposing it through bevy_tasks, but this only added more overhead, messed with the internal workings of rayon-core, and ended up being a sizable performance regression.

We tried a lot of stuff. None of it worked. No combination of existing off-the-shelf crates can be made to serve our needs. bevy_tasks remains the thread-pool de jour because, despite its faults, nothing else can do what it does. It is best of all our fairly-lousy options.

Taking The Next Step

I had just started contributing to bevy as the last round of exploratory work on re-integrating rayon with bevy_tasks was petering out. I was only vaguely aware of the work, and I really just wanted us to go back to using rayon directly. So I started looking at the specific blockers, and I found that most of them did have solutions:

A new technique called "Heartbeat Scheduling" had been developed as an alternative to work-stealing, first published in a paper, then popularized by a Zig parallelism library called spice and finally ported to Rust as chili. This Rust port had some promising benchmarks compared to rayon, and the author @dragostis was talking about building a rayon integration.
Niko Matsakis outlined a comprehensive approach to adding async to rayon-core. This happened to use async_task, one of the libraries we rely on for bevy_tasks, so it was clearly compatible with bevy's async needs.
In another comment, Josh Stone provided a prototype of a block_on method for rayon. This seemed to be the main other thing required to turn rayon-core into a fully-featured async executor.
The rayon-core codebase is surprisingly small, simple, and clean. If we wanted maximum control, it wouldn't be too hard to fork and slim it down into something tiny and easily maintainable.

I thought the solution was pretty obvious: Fork rayon_core, throw out anything we didn't need (to make it maintainable), do some house-keeping, add the async support and steal the heartbeat-scheduling model from chili. And, over the course of six months, that's more or less what I did with forte. Nothing I did was novel, or particularly interesting. All I did was look around at what other people⁵ had done, and take the next logical step.

Using `forte`

Since I mentioned API refactors, I want to briefly introduce some of the important features of the forte API. There's a lot of shared DNA with its older brother rayon_core, but there are also some significant departures.

The most obvious difference is that all thread-pools must be static, and that forte::ThreadPool has a const constructor. This makes creating a new thread-pool relay simple.

use forte::ThreadPool;

// You can kiss OnceLock goodbye.
static THREAD_POOL: ThreadPool = ThreadPool::new();

As an aside, this change also let me significantly simplify the internal job execution logic, and effectively eliminate the use of Arc<T> (which rayon_core relies heavily on).

How does this work? Obviously, you can't spawn threads in a const-context, so thread pools are empty by default. Before using a thread pool, you'll probably want to add some workers to it.

fn main() {
    // ... bla bla bla
    THREAD_POOL.resize_to_available();
    // ... parallel bla bla bla
}

This populates the pool with one worker-thread per CPU core. You can also use populate to add a single worker, depopulate to shut down all current workers, or resize_to if you want a specific number of workers. Thread pools can be resized at any point, even while in use (although only one resize operation can happen at a time).

The best way to interact with a thread-pool is through a Worker. There are several different ways to access workers, but the easiest is probably via ThreadPool::with_worker⁶ method.

THREAD_POOL.with_worker(|worker| {
    // ... do some work
});

This registers the current thread as a temporary worker on the thread pool, and creates a new thread-local worker context. If the thread is already a member of the pool, it just looks up the existing worker context. If you know a thread is a member of some pool but you don't know which, you can use Worker::with_current to access the current local worker context.

You can also manually acquire a "lease" on a thread pool and occupy it to create a new worker.

let lease = THREAD_POOL.claim_lease(); // Lease some space from the thread pool.
Worker::occupy(lease, |worker| {
    // ... do some work
});

A worker is like a local "view" onto a thread pool; you can queue work onto it, and it will make sure it gets executed. It also provides access to a selection of multi-threading calls:

block_on: Replaces bevy_tasks::block_on.
join: Replaces both rayon_core::join and rayon_core::join_context.
scope: A blocking/async hybrid of rayon::scope and TaskPool::scope.
spawn: Replaces rayon_core::spawn.
spawn_future: Replaces TaskPool::spawn.
spawn_async: Like spawn_future but takes a closure that returns a future.
yield_local: Replaces rayon_core::yield_local.
yield_now: Replaces rayon_core::yield_now.

These are also exposed on ThreadPool and as first-class functions, but under the hood these are just proxies that manage the Worker for you. The most important of these are probably join and spawn_async. The former takes two closures and executes them in parallel; this is what lets rayon work its parallel iterator magic.

As an example, here's how you can use join to divide a slice into chunks which can be operated upon in parallel. Notice how you don't have to rely on a specific ThreadPool static to make this work, you just accept a &Worker ref.

fn chunk_mut<T, F>(worker: &Worker, chunk: &mut [T], max_size: usize, func: &F)
where
    T: Send + Sync,
    F: Fn(&mut [T]) + Send + Sync,
{
    // If the current chunk is less than or equal to the max size
    if chunk.len() <= max_size {
        // Then we can apply the function
        func(data);
    } else {
        // Otherwise, split the chunk into halves
        let split_index = data.len() / 2;
        let (head, tail) = data.split_at_mut(split_index);
        // And recurse in parallel on each half
        worker.join(
            |worker| chunk_mut(worker, head, max_size, func),
            |worker| chunk_mut(worker, tail, max_size, func),
        );
    }
}

This should be familiar to anyone who has used rayon_core before, just as spawn_async will be familiar to anyone who has used async Rust. In case you need a refresher, here's an example of how one might use spawn_async to calculate a checksum.

// Spawning an async job returns a task (from async_task)
let task = worker.spawn_async(|| async {
    let response = reqwest::get("https://internet.place").await?;
    let text = response.text().await?;
    let mut hasher = DefaultHasher::new();
    text.hash(&mut hasher);
    let hash = hasher.finish();
    Ok(hash)
});

// If you want to halt the job, you can cancel it or drop the task.
task.cancel();

// If you don't care about the result but still want it to run, you can detach it instead.
task.detach();

// In an async context, you can get the result by awaiting the task.
let result = task.await;

// Outside of an async context, you can use block_on instead.
let result = block_on(task);

Now let's move into something slightly less standard: scoped tasks.

Hybrid Scopes

Scopes exist to get around the 'static lifetime bounds (which you generally need if you are spawning work with an unbounded lifetime). The forte::scope API is just like std::thread::scope. You pass in a closure over &Scope, and all work started within the closure must complete before forte::scope will yield back to the caller. This makes it possible to borrow data stack-allocated from outside the scope-closure, safe in the knowledge that the stack won't be popped while in use. Here's an example of what this looks like:

let mut a = vec![1, 2, 3];
let mut x = 0;

worker.scope(|scope| {
    scope.spawn(|| {
        println!("hello from the first scoped thread");
        // We can borrow `a` here.
        dbg!(&a);
    });
    scope.spawn(|| {
        println!("hello from the second scoped thread");
        // We can even mutably borrow `x` here,
        // because no other threads are using it.
        x += a[0] + a[2];
    });
    // Unlike rayon, the scope closure is always executed
    // on the current thread.
    println!("hello from the main thread");
});
// Both jobs will run before we reach this point
println!("goodbye from the main thread");

Scopes are tricky. Historically, Rust has had some difficulty implementing scopes properly]poop-scope (though this isn't an issue for my implementation). But async-scopes specifically, there's also a pretty welcome problem called The Scoped-Task Trilemma. The gist of it is:

Concurrency, Parallelizability, Borrowing — pick two.

Under this framework, API calls fall into three broad categories:

Concurrency + Parallelizability

Child tasks proceed both independently and concurrently with the parent tasks.
- std::thread::spawn (non-async)
- bevy_tasks::TaskPool::spawn (async)
Parallelizability + Borrowing

Child tasks proceed independently with the parent task while borrowing from it.
- std::thread::scope (non-async)
- rayon::join (non-async)
- rayon::scope (non-async)
- bevy_tasks::TaskPool::scope (async)
Borrowing + Concurrency

Child tasks proceed concurrently with the parent task while borrowing from it.
- futures::select! (async)
- futures::stream::FuturesUnordered (async)

Near the end of the article introducing the “Trilemma", withoutboats remarks:

When it comes to “Parallelizability + Borrowing” we see that no async APIs exist at all. This maybe isn’t surprising, because async is all about enabling concurrency. But there’s a missing API here that seems obvious to me: an API which starts an executor across a thread pool, and blocks until all of the tasks you spawn onto it are complete.

Both bevy_tasks and forte provide this "missing" API. And with forte, scopes can be used to spawn both blocking and async jobs.

let (sender, receiver) = async_channel::unbounded();
worker.scope(|scope| {
    // Spawn a blocking job that sends a message over the channel
    scope.spawn(|| {});
        sender.send_blocking("Hello");
    );
    // Spawn an async job to receive and print the message
    let task = scope.spawn_async(|| async {
        let msg = receiver.recv().await.unwrap();
        println!(msg);
    });
    task.detach();
});

Support for scoped-async was a hard blocker for Bevy, since it uses this part of bevy_tasks extensively. But there are many situations where I think scoped-async could be useful to the broader Rust community, especially when combined with blocking calls like join.

Mixing Blocking and Concurrent Rust

I think of forte as a hybrid scheduler, because it's designed to run both blocking jobs and futures. Usually, making blocking calls inside of a future is a Bad Idea™. For example, calling rayon::scope in an async-context will usually block our async executor. In this case, there's a critical safety requirement that prevents rayon from yielding back to the caller until all the work spawned on the pool is complete.

But with forte, there's an escape hatch: our scheduler is our executor, so we just continue executing async tasks while waiting for our blocking work. That means it's totally reasonable to mix async and blocking calls to a worker.

async fn concurrent_tasks(worker: &Worker) -> Result { // <------------------------ async
    // Load some data asynchronously
    let data = get_data().await;
    // Allocate a vec to hold the tasks for the jobs we span
    let mut tasks = Vec::new();
    // Create a scope that will let us borrow &data
    worker.scope(|scope| { // <------------------------------------------------- blocking
        // Iterate over the data
        for item in &data {
            // Spawn a task that processes each item
            let task = scope.spawn_async(|scope| async { // <---------------------- async
                // Process each item in parallel
                // Notice how we are borrowing &item here
                let (head, tail) = forte::join( // <---------------------------- blocking
                    || item.head().process(),
                    || item.tail().process()
                );
                Ok((head?, tail?))
            })
            // Store the tasks
            tasks.push(task);
        }
    });
    // All the tasks are guaranteed to be complete
    futures::try_join_all(tasks).await // <---------------------------------------- async
}

There are some dangers here. Forte cannot guarantee that your code will be lock-free, and you do have to take care during your implementation.

Compatibility with `rayon_core`

If you've used rayon_core before, the forte API will probably feel familiar. But there are enough significant differences (especially the lack of a default thread pool) that it can't really be called a drop-in replacement. That also means that rayon can't run directly on forte.

Hmmm.... directly? Does that mean I can run rayon on forte indirectly?

Yes. I've written a crate called forte-rayon-compat that mocks the rayon_core but redirects the calls to forte. Only about half the API works⁷, but it's the half that rayon uses for parallel iterators. Since there's a wide array of applications built using rayon, this crate gives us an easy (if slightly suspect) way to benchmark the performance of forte against rayon_core.

Benchmark Methodology

So how does forte perform? There's a verity of benchmarks we can use to assess that:

chili has a fork_join benchmark that can be used to assess join overhead against rayon, and including forte was trivial.
The rayon_demo benchmarks can be used through forte-rayon-compat to compare with rayon.
Griffin (author of the amazing obvhs crate) has written a parallel radix-sort benchmark for forte, rayon, bevy_tasks, and chili called pool_racing.
bevy_tasks has a few benchmarks as well, which I have ported over to compare with forte.

All of the performance numbers I'm about to give were either sampled with divan or aggregated across a large number of runs. It's still not super scientific⁸ but I am confident that the results are more than just noise.

Tree Traversal Benchmark

The chili crate sports a single benchmark, on which it performs very favorably compared to rayon. It's a simple traversal of a balanced binary tree using join, and is designed to assess overhead.

Nodes	Serial	Rayon-Core	Chili	Forte
1023	1.24 μs	52.69 μs	3.01 μs	4.54 μs
16777215	27.91 ms	10.80 ms	6.15 ms	5.91 ms
134217727	318.50 ms	97.47 ms	47.36 ms	49.88 ms

Mean Time

Nodes	Serial	Rayon-Core	Chili	Forte
1023	1.20 μs	27.87 μs	2.46 μs	3.45 μs
16777215	27.41 ms	10.46 ms	5.35 ms	5.67 ms
134217727	312.00 ms	93.83 ms	40.51 ms	47.7 ms

Fastest Time

Both chili and forte handle it well, with chili in the lead. It is worth pointing out that almost no real work is being done in this benchmark, and it's essentially just assessing latency. From the serial pass, we can estimate that the time-per-item is about 1.1 nanoseconds. Let’s increase that to about 10 nanoseconds (about 40 clock-cycles) and run the tests again.

Nodes	Serial	Rayon-Core	Chili	Forte
1023	10.94 μs	57.61 μs	11.80 μs	12.36 μs
16777215	151.20 ms	21.23 ms	19.80 ms	17.55 ms
134217727	1230.00 ms	167.80 ms	147.90 ms	138.40 ms

Mean Time

Nodes	Serial	Rayon-Core	Chili	Forte
1023	10.74 μs	29.87 μs	10.37 μs	10.41 μs
16777215	149.60 ms	20.72 ms	16.50 ms	17.11 ms
134217727	1214.00 ms	165.70 ms	128.30 ms	135.4 ms

Fastest Time

Surprisingly little changes: forte narrowly takes the lead, and though the gap closes a bit, rayon is firmly in last place. This does make it look like rayon::join has problems with both latency and throughput.

To be fair to rayon, this is still a bit of a pathological case. When left to its own devices, rayon spawns fewer jobs, in part to compensate for deficiencies it has in situations like this. Let's look at some of rayon's own benchmarks to see what effect this has.

Rayon's Benchmarks

The rayon repo has an app called rayon-demo that bundles together several different benchmarks. By replacing rayon-core with forte-rayon-compat (via a Cargo patch), it's also possible to run these benchmarks with forte.

Benchmark	Serial	Rayon (Core)	Rayon (Forte)
Mergesort	11.46 s	1.08 s	1.27 s
Quicksort	17.09 s	2.20 s	2.39 s
Sieve	3.16 s	56.64 ms	58.47 ms
Matmul	0.84 s	52.10 ms	76.66 ms

Mean Time

Here we see rayon take the lead, as we would expect. Clearly, it is possible to get around the issue we saw in the last test. I think forte's performance here is reasonable, given the lack of work-stealing. Though it is a clear regression, especially on the Matmul example, it's still within the range that I would consider "usable".

When comparing to rayon, I am more interested in CPU utilization. Luckily, rayon has a "Game of Life" demo that reports this.

Backend	Mean Time	Cpu Usage
Serial	33.28 ms	8.5 %
Rayon (Core)	18.35 ms	38.4 %
Rayon (Core) Bridged	125.07 ms	119.9 %
Rayon (Forte)	29.68 ms	7.7 %
Rayon (Forte) Bridged	43.34 ms	8.2 %

Game of Life Benchmark

There's a lot to talk about here. Clearly, forte uses significantly less CPU. For the parallel iterators, performance is slightly worse, but interestingly, for bridged iterators (sequential iterators which rayon converts to parallel), performance actually improved!

One thing I should mention is that, because forte uses "Heartbeat Scheduling", the runtime has a tuning parameter called the Heartbeat Interval. This is, more or less, the rate at which individual workers distribute load to their peers. In my tests, I've been able to get good results with intervals as low as 5 µs (or 200 kHz). Varying this parameter seems to skew the performance distribution, but has a pretty small effect on the mean. It does, however, seem to have a very pronounced effect on CPU utilization. To produce the low CPU usage in this demo, I turned it down to 500 µs (or 2 kHz).

I suspect the reason forte performs so much better on bridged work is that rayon ends up spawning more individual tasks (and does less chunking). So this is where the performance deficit we saw in the tree-reversal benchmarks really starts to have an appreciable effect.

Now that we've set the scene with some artificial benchmarks, let’s look at how these different schedulers perform on something like a real-world application.

Pool-Racing Benchmark

The obvhs crate is a really excellent bounding-volume-hierarchy, written by @DGriffin91. Griffin also runs a repo called tray_tracing, with benchmarks of obvhs and other bhv implementations. Now he's been nice enough to put together pool_racing, for benchmarking how obvhs performs on different thread pools. We'll use this to compare the performance of forte, rayon, and chili when used to construct a bvh from scenes of varying complexity.

Phase	Serial	Rayon (Core)	Forte	Chili
Init	50 μs	496 μs	20 μs	129 μs
Sort	54 μs	198 μs	122 μs	109 μs
Build	149 μs	1.15 ms	485 μs	644 μs

Basic Scene (Mean Time)

Phase	Serial	Rayon (Core)	Forte	Chili
Init	80 μs	701 μs	37 μs	109 μs
Sort	35 μs	222 μs	41 μs	19 μs
Build	159 μs	1.52 md	377 μs	501 μs

Cornell Box Scene (Mean Time)

Phase	Serial	Rayon (Core)	Rayon (Forte)	Forte	Chili
Init	18.88 ms	9.69 ms	11.31 ms	12.21 ms	17.13 ms
Sort	92.60 ms	14.45 ms	20.36 ms	37.39 ms	86.20 ms
Build	406.02 ms	64.41 ms	66.31 ms	83.42 ms	137.88 ms

Complex Scene (Mean Time)

Here again rayon decisively takes the lead, with the naive forte and chili implementations are further behind. There's two things to note here.

The performance of rayon is similar when running on rayon-core or forte-rayon-compat.
We still see forte win on small to midsize workloads.
Across the board chili performed significantly worse.

To me, it looks like the internal batching magic happening within rayon that's the main factor in this test, rather than the thread-pool. These benchmarks also seem to bare-out the idea that rayon is throughput-focused rather than latency-focused. Since forte is definitely latency-focused, I'm not at all upset with second place. I also suspect it will be possible to further improve forte's performance here... but that's a topic for another article.

So forte isn't quite as fast as rayon-core when used through rayon, but it is kinda close. That's nice, but all that really matters is how it stacks up against bevy_tasks.

Bevy-Tasks Iteration Benchmark

When looking at bevy_tasks, there's two things we should benchmark: parallel iteration, and async execution. The Bevy repo already has some parallel iteration benchmarks. The first one measures the overhead of iterating a list, in chunks of a hundred. Like the tree-traversal test, we do next to no work within the iterator.

Length	Serial	Bevy-Tasks	Rayon	Forte
100	84.75 ns	3.06 μs	2.03 μs	388.6 ns
1,000	773.6 ns	6.06 μs	40.26 μs	1.37 μs
10,000	7.708 μs	55.14 μs	49.53 μs	12.38 μs
100,000	76.59 μs	870.60 μs	97.35 μs	75.71 μs
1,000,000	772.9 μs	9.30 ms	170.10 μs	326.60 μs
10,000,000	7.64 ms	101.20 ms	468.30 μs	1.23 ms

Mean Time

And... wow, uh, bevy_tasks is really slow! Slower than serial iteration!! What's going on here?!

We're looking at iteration, so we should expect to see linear scaling, and indeed we do. In theory, going parallel just changes the scaling factor; The parallel runtime should be approximately proportional to the serial runtime divided by the number of cores. So, on test like this, I'd expect all the parallel cases to scale slower than serial implementation. Both forte and rayon do seem to be scaling somewhere under half the rate of the serial version. Strangely, bevy_tasks appears to be scaling almost identically to the serial version.

Wait... since bevy_tasks also seems to have a higher initial cost, does that serial processing will always be faster? Has bevy_tasks been just totally useless this entire time?

No, it's not as bad as it looks. Clearly bevy_tasks has really some astonishingly high per-item overhead. When the work is basically zero, that does means it's always slower than serial iteration. But as the workload grows, the benefits of parallelization will still eventually become greater than the cost.

We can actually work out exactly when this happens. From the sequential case, it looks like visiting each item in the list costs ~0.7 ns. With bevy_tasks the per-item-overhead is closer to 10 ns. Lets imagine adding ~100 ns of work, while we are visiting each time, and let's assume we have ten-million items. Dividing that work by the number of cores (in my case 13) and adding the overhead gives an average per-item completion time of around 17.7 ns when parallelized, and an expected total runtime of approximately 17.7 ms. This is far better than the expected serial runtime (more than a second), but far worse than the theoretical optimum of ~7.7 ms.

On my machine, you need each task to take at least ~10 ns for bevy_tasks to beat serial iteration. We can double-check this by running the tests again and adding a 10 ns wait to each item.

Length	Serial	Bevy-Tasks	Rayon	Forte
100	1.07 μs	4.84 μs	4.62 μs	1.421 μs
1,000	10.42 μs	14.06 μs	70.26 μs	12.92 μs
10,000	102.40 μs	111.00 μs	82.26 μs	77.46 μs
100,000	1.02 ms	1.265 ms	200.80 μs	318.80 μs
1,000,000	10.13 ms	13.44 ms	1.19 ms	1.298 ms
10,000,000	101.30 ms	140.10 ms	10.65 ms	10.65 ms

Mean Time

It's close. Even at 10 ns are not quite at break-even-point for bevy_tasks, but spend a bit longer on each item and we'd get there... eventually... probably. For the ten-thousand item test, neither forte nor rayon quite reach the theoretically optimal 7.7 ms. But they are not far off. And forte puts up a very good show here, actually improving vs rayon compared to the previous test.

The cost of bevy_tasks is probably fine, after all 10 ns is not huge even by CPU-standards. But I'd be willing to bet that the vast majority of iterators won't take that long per-item. And even if they did, that's just the break-even time. To actually see statistically significant performance benefits you probably need to be spending at least 25 ns on each item, and at that point we're talking about maybe a hundred clock-cycles.

It's patently obvious that this isn't what bevy_tasks was built for. After all, it's primarily an async runtime. If we want to evaluate it properly, we must also have a look at it's async performance.

Bevy-Tasks Async Benchmark

TODO

Performance Summary

That was a lot of numbers, and I wouldn't blame you if you just skipped all of it to read the conclusions. I would summarize the results as follows:

rayon is truly amazing, but due to being highly throughput-focused it leaves performance on the table in some situations.
chili is really interesting, but doesn't generalize and does not degrade gracefully.
bevy_tasks is extremely slow compared to any of the other offerings.
forte generalizes better than chili and can keep up with rayon fairly well on larger workloads, despite being more latency-focused.

I think these results are really promising for forte. It's not going to kill rayon_core any time soon, but depending on your needs I do think it presents a viable alternative.

Introducing Forte

- 2025-06-27

Introduction

Is This Hubris?

What's a Task Pool?

How Do You Solve A Problem Like bevy_tasks?

Taking The Next Step

Using forte

Hybrid Scopes

Mixing Blocking and Concurrent Rust

Compatibility with rayon_core

Benchmark Methodology

Tree Traversal Benchmark

Rayon's Benchmarks

Pool-Racing Benchmark

Bevy-Tasks Iteration Benchmark

Bevy-Tasks Async Benchmark

Performance Summary

Future Work

How Do You Solve A Problem Like `bevy_tasks`?

Using `forte`

Compatibility with `rayon_core`