| # Quiescent-State Based Reclamation (QSBR) |
| |
| ## Introduction |
| |
| When implementing lock-free data structures, a key challenge is determining |
| when it is safe to free memory that has been logically removed from a |
| structure. Freeing memory too early can lead to use-after-free bugs if another |
| thread is still accessing it. Freeing it too late results in excessive memory |
| consumption. |
| |
| Safe memory reclamation (SMR) schemes address this by delaying the free |
| operation until all concurrent read accesses are guaranteed to have completed. |
| Quiescent-State Based Reclamation (QSBR) is a SMR scheme used in Python's |
| free-threaded build to manage the lifecycle of shared memory. |
| |
| QSBR requires threads to periodically report that they are in a quiescent |
| state. A thread is in a quiescent state if it holds no references to shared |
| objects that might be reclaimed. Think of it as a checkpoint where a thread |
| signals, "I am not in the middle of any operation that relies on a shared |
| resource." In Python, the eval_breaker provides a natural and convenient place |
| for threads to report this state. |
| |
| |
| ## Use in Free-Threaded Python |
| |
| While CPython's memory management is dominated by reference counting and a |
| tracing garbage collector, these mechanisms are not suitable for all data |
| structures. For example, the backing array of a list object is not individually |
| reference-counted but may have a shorter lifetime than the `PyListObject` that |
| contains it. We could delay reclamation until the next GC run, but we want |
| reclamation to be prompt and to run the GC less frequently in the free-threaded |
| build, as it requires pausing all threads. |
| |
| Many operations in the free-threaded build are protected by locks. However, for |
| performance-critical code, we want to allow reads to happen concurrently with |
| updates. For instance, we want to avoid locking during most list read accesses. |
| If a list is resized while another thread is reading it, QSBR provides the |
| mechanism to determine when it is safe to free the list's old backing array. |
| |
| Specific use cases for QSBR include: |
| |
| * Dictionary keys (`PyDictKeysObject`) and list arrays (`_PyListArray`): When a |
| dictionary or list that may be shared between threads is resized, we use QSBR |
| to delay freeing the old keys or array until it's safe. For dicts and lists |
| that are not shared, their storage can be freed immediately upon resize. |
| |
| * Mimalloc `mi_page_t`: Non-locking dictionary and list accesses require |
| cooperation from the memory allocator. If an object is freed and its memory is |
| reused, we must ensure the new object's reference count field is at the same |
| memory location. In practice, this means when a mimalloc page (`mi_page_t`) |
| becomes empty, we don't immediately allow it to be reused for allocations of a |
| different size class. QSBR is used to determine when it's safe to repurpose the |
| page or return its memory to the OS. |
| |
| |
| ## Implementation Details |
| |
| |
| ### Core Implementation |
| |
| The proposal to add QSBR to Python is contained in |
| [Github issue 115103](https://github.com/python/cpython/issues/115103). |
| Many details of that proposal have been copied here, so they can be kept |
| up-to-date with the actual implementation. |
| |
| Python's QSBR implementation is based on FreeBSD's "Global Unbounded |
| Sequences." [^1][^2][^3]. It relies on a few key counters: |
| |
| * Global Write Sequence (`wr_seq`): A per-interpreter counter, `wr_seq`, is started |
| at 1 and incremented by 2 each time it is advanced. This ensures its value is |
| always odd, which can be used to distinguish it from other state values. When |
| an object needs to be reclaimed, `wr_seq` is advanced, and the object is tagged |
| with this new sequence number. |
| |
| * Per-Thread Read Sequence: Each thread has a local read sequence counter. When |
| a thread reaches a quiescent state (e.g., at the eval_breaker), it copies the |
| current global `wr_seq` to its local counter. |
| |
| * Global Read Sequence (`rd_seq`): This per-interpreter value stores the minimum |
| of all per-thread read sequence counters (excluding detached threads). It is |
| updated by a "polling" operation. |
| |
| To free an object, the following steps are taken: |
| |
| 1. Advance the global `wr_seq`. |
| |
| 2. Add the object's pointer to a deferred-free list, tagging it with the new |
| `wr_seq` value as its qsbr_goal. |
| |
| Periodically, a polling mechanism processes this deferred-free list: |
| |
| 1. The minimum read sequence value across all active threads is calculated and |
| stored as the global `rd_seq`. |
| |
| 2. For each item on the deferred-free list, if its qsbr_goal is less than or |
| equal to the new `rd_seq`, its memory is freed, and it is removed from the: |
| list. Otherwise, it remains on the list for a future attempt. |
| |
| |
| ### Deferred Advance Optimization |
| |
| To reduce memory contention from frequent updates to the global `wr_seq`, its |
| advancement is sometimes deferred. Instead of incrementing `wr_seq` on every |
| reclamation request, the object's qsbr_goal is set to `wr_seq` + 2 (the value |
| the counter *would* take on its next advance) without actually advancing the |
| global counter. This is safe because the goal still corresponds to a future |
| sequence value that no thread has yet observed as quiescent. |
| |
| Whether to actually advance `wr_seq` is decided per request, based on how |
| much memory and how many items the calling thread has already deferred since |
| its last advance: |
| |
| * For deferred object frees (`_PyMem_FreeDelayed`), the thread tracks both a |
| count (`deferred_count`) and an estimate of the held memory |
| (`deferred_memory`). The global `wr_seq` is advanced when the freed block |
| is larger than `QSBR_FREE_MEM_LIMIT` (1 MiB), when the accumulated deferred |
| memory exceeds that limit, or when the count exceeds `QSBR_DEFERRED_LIMIT` |
| (127, sized so a chunk of work items is processed before it overflows). |
| Crossing any of these thresholds also sets a per-thread `should_process` |
| flag, signalling that the deferred-free list should be drained. |
| |
| * For mimalloc pages held by QSBR, the thread tracks `deferred_page_memory` |
| and advances `wr_seq` when either the individual page or the accumulated |
| page memory exceeds `QSBR_PAGE_MEM_LIMIT` (4096 * 20 bytes). Advancing |
| promptly here matters because a held page cannot be reused for a different |
| size class or by a different thread. |
| |
| Processing of the deferred-free list normally happens from the eval breaker |
| (rather than from inside `_PyMem_FreeDelayed`), which gives the global |
| `rd_seq` a better chance to have advanced far enough that items can actually |
| be freed. `_PyMem_ProcessDelayed` is still called from the free path as a |
| safety valve when a work-item chunk fills up. |
| |
| This optimization improves runtime speed but may increase peak memory usage |
| by slightly delaying when memory can be reclaimed; the size-based thresholds |
| above bound that extra memory. |
| |
| |
| ## Limitations |
| |
| Determining the `rd_seq` requires scanning over all thread states. This operation |
| could become a bottleneck in applications with a very large number of threads |
| (e.g., >1,000). Future work may address this with more advanced mechanisms, |
| such as a tree-based structure or incremental scanning. For now, the |
| implementation prioritizes simplicity, with plans for refinement if |
| multi-threaded benchmarks reveal performance issues. |
| |
| |
| ## References |
| |
| [^1]: https://youtu.be/ZXUIFj4nRjk?t=694 |
| [^2]: https://people.kernel.org/joelfernandes/gus-vs-rcu |
| [^3]: http://bxr.su/FreeBSD/sys/kern/subr_smr.c#44 |