Async cancellation and sync blocks: good luck debugging this
When you cancel an async task that has spawned a blocking thread, the async mutex guard gets dropped but the thread keeps running unprotected. Always pass owned guards to spawned threads.
The Setup: Code That Worked for Years
Imagine having such code:
use Rng;
use HashMap;
use ;
// let's pretend it's a realprod db which manages its synchronization on its own
async
You have some function, which has precondition to being called serially. Function takes a lot of time, eating your precious cpu time, so you spawn it on separate thread to mitigate blocking of runtime.
Code worked for years without issues, everybody was happy. This calculation was core of your company.
The Change: Adding a Fast Path
Some day you found that you can actually query some other resource to get this value - eg other cache server.
So your main loop now looks like this:
async
Heads-up: when tokio::select!
picks one branch, the other future is dropped immediately - jj releasing any resources it owned (e.g., async mutex guards).
The Mystery: Inconsistent Results
You run tests(you have them, right?) and found that your precious calculations are incorrect - sometime calculations for n are right, some time they show stale values. You blame your realprod db for stale caches or something.
heavy_compute
was working for years, nobody have changed anything in it.
The Investigation: Adding Assertions
Desperate, you modify it to shed some light:
and find out that you have different data from what you have read on the start of computation. Assert fires 1 time in an hour.
You call author of heavy_compute
, asking him why it stopped working.
You check all the code, it looks correct.
Mutex is being held during while heavy_compute
is called.
Hoplessly you change heavy compute one more time:
Magicaly assert stops firing. All is green.
This feels like a fix, but it's really just a clue.
We've papered over the race condition by forcing all blocking tasks to run one at a time, but we haven't addressed the root cause:
why we even have the spawn_blocking
tasks running in parallel if we have a mutex?
The Root Cause
So, what's the problem?
cancellation.
frob
is canceled when fetch_cached
runs faster then heavy_compute
.
When tokio::select!
cancels frob, it drops the future's state - including the mutex guard - but the spawned thread keeps running.
Timeline:
T0: Task A acquires async mutex for round 5
T1: Task A spawns blocking thread for round 5
T2: Task A gets cancelled (select! chooses the cache path)
T3: Async mutex is released (guard dropped)
T4: Task B acquires async mutex for round 6
T5: Task B spawns blocking thread for round 6
T6: Both threads are now running heavy_compute() simultaneously!
How Async Cancellation Works
Click to see state machine desugaring
Lets rewrite our async fn as state machine.
use mem;
use ;
// This function represents the simplified work a runtime's executor does.
tokio::select!
can be viewed as such loop(without context, wakers and all this stuff):
let fut1 = Start ;
let fut2 = fetch_cached;
loop
So when either fut1 or fut2 completes other future is dropped. Including it's lock. But nobody cancels thread spawn!
It's still working in background, not knowning thats nobody waits for it's result and more importantly not knowning that it's not protected by lock!
The Solution: Owned Guards
Acquire blocking mutex handle.
Instead of being stored in the async fn's state (which can be dropped at any .await point during cancellation), we can move ownership of this guard directly into the closure given to spawn_blocking. Now, the lock's lifetime is tied to the blocking task itself. The lock will only be released when the blocking thread finishes its work, regardless of whether the original frob task that spawned it was cancelled.
The same thing applies for tokio::fs functions. Eg fs::read(some_huge_file)
can linger when caller is cancelled.
Runnable code for the article here