Await-Tree: A Panacea for Observability in Async Rust

Await-Tree: A Panacea for Observability in Async Rust

Background

Async Rust provides developers with a flexible and efficient asynchronous programming capability through a set of concise stackless coroutines abstractions. However, the dynamic scheduling and execution models of Async Rust also make debugging concurrency issues particularly challenging. In this article, we will introduce Await-Tree, a debugging tool for Async Rust. Based on the in-depth practice of Async Rust in RisingWave, a distributed streaming database, Await-Tree allows developers to export the execution status of Async tasks in real time as a tree structure. It enables the analysis of asynchronous call chains within tasks and the dependency-blocking relationships between tasks, significantly improving system observability and debuggability with minimal runtime overhead.

This article is divided into three main parts:

  1. Review of the design and pain points of Async Rust.

  2. Introduction to the design principles and implementation of Await-Tree.

  3. Demonstration of use cases of Await-Tree in RisingWave.

Design and Pain Points of Async Rust

Review of Async Rust

Async Rust combines Rust programming language with asynchronous programming. As an event-driven programming paradigm, it inherits the common advantages of asynchronous programming languages, such as:

  • Introduction of async/await keywords, allowing developers to write and understand asynchronous code using a synchronous programming mindset, and fully utilize asynchronous interfaces to achieve high-performance I/O.

  • High-performance user-space cooperative scheduling enables developers to create a large number of coroutines in their applications with minimal performance overhead, thereby easily achieving high concurrency.

In addition, Async Rust also inherits the unique features of Rust programming language, such as high performance and safety, making it the preferred language for next-generation distributed systems including databases. The unique advantages of Async Rust in this aspect include:

  • Ownership and lifetime mechanisms, Send/Sync/Pin, and other language designs that allow developers to ensure the safety and robustness of concurrent programming through compile-time type system constraints.

  • Building upon these mechanisms, Async Rust adopts the design of stackless coroutines, avoiding the need to allocate separate stacks for each coroutine and perform context switching during scheduling. This further reduces the memory footprint and performance overhead of asynchronous programming.

Async Rust uses the "future" as the basic abstraction of its stackless coroutines. A future can be understood as a state machine represented by an enum. The runtime switches the states by continuously calling the poll function externally. Taking the diagram below as an example, the process of a future being scheduled and executed by the runtime can be simplified as follows:

Rust's Stackless Coroutines Abstraction-Future.

  1. The state machine starts in the “init” state. When the runtime first calls the poll function, the future executes a block of synchronous code until it enters the next state, such as the “WaitForIO/WaitForRes” in the diagram, which indicates that the future is blocked by I/O operations like network or disk, or by resources like locks or semaphores, and thus cannot proceed. At this point, the future will save the “waker” provided by the runtime as a callback function when the resources become available, and return “pending” from the poll function.

  2. When the runtime learns that the future is in the pending state, it moves the future to a waiting queue and switches to the next task for scheduling.

  3. After some time, when the resources become available and the callback function is called, the future is awakened. Then runtime puts the future back into the scheduling queue, waiting for the next round of scheduling.

  4. When the future is scheduled by the runtime, its poll function is called again. The future resumes its execution state from the state machine and repeats the process described in steps 1-3 until the future enters the “done” state, indicating that it has completed its execution.

In Async Rust, the underlying future provided by the runtime often needs to be manually implemented by defining the behavior of the poll function. Fortunately, Async Rust provides us with the async/await keywords, allowing developers to directly nest and compose these futures using synchronous programming techniques, with the compiler generating anonymous future types for each Async function. These single-threaded execution futures ultimately form the concept of "task", which serve as the basic scheduling units for the runtime.

Pain Points of Observability and Debugging in Async Rust

There's no such thing as a free lunch. While Async Rust brings excellent performance, safety, and flexibility, it sacrifices observability and debugging convenience.

  • Flexible composition of futures: In Async Rust, developers can compose multiple futures to run concurrently within the same task using functions like join and select, in addition to awaiting async functions like writing synchronous code. These control flows do not require any additional language features from Async Rust, as they are implemented by manually customizing the execution and scheduling logic through the poll function. However, from the CPU's perspective, the execution between different futures still interleaves within a single thread and cannot perceive the "simultaneous execution" as understood by developers in terms of business or logic. This feature makes it difficult for thread-based observability and debugging tools to reflect the execution state of a specific task. For example, when printing a backtrace in an async context, we can only see the call stack of the currently scheduled future, without knowing the state of other "concurrently executing" futures within the same task. This is because, when developers engage in asynchronous programming, their execution may form a "call tree" in a logical sense, rather than just a "call stack" from a thread's perspective.

  • User-space scheduling and stackless coroutines: As mentioned earlier, stackful coroutines require allocating separate stacks for each coroutine and switching between them during scheduling and execution. On the other hand, stackless coroutines maintain the states of each coroutine in a state machine, only restoring the state to the thread's stack when executing through the poll function, and saving the state back to the state machine's structure when unable to continue execution. Generally, the memory management of stackless coroutines presents certain challenges, such as maintaining references to stack-local variables, which Rust solves through the concept of Pin in its type system. Although memory safety is guaranteed, the characteristic of "no stack space when not executing" in stackless coroutines makes it impossible to restore the execution state of an Async task before it gets blocked using thread-based observability and debugging tools. In synchronous programming, we can easily view the call stack of a thread before it gets parked, for example, by attaching a debugger. However, for Rust asynchronous programming, the same method only reveals the worker thread of the runtime synchronizedly blocking on its task queue due to the absence of executable tasks. We have no way of knowing why the task is asynchronously blocked.

It is evident that the core problem to solve the pain points of Async Rust mentioned above is the need for a native observability and debugging tool designed specifically for tasks. This tool should hide the implementation details of "task's physical execution on a thread" and reflect its logical execution state from the task's perspective. For example, due to the existence of concurrent execution, the execution state of a task is represented by a tree (rather than a stack). When a task is about to be blocked, its logical execution state, represented by the call tree, should be preserved for subsequent observability and debugging.

Design Principles and Implementation of Await-Tree

Based on the above ideas, we introduced Await-Tree as a debugging tool for Async Rust. As the name suggests, Await-Tree can trace the control flow of critical futures (i.e., await points) throughout their entire lifecycle at runtime. It maintains the logical execution state of tasks as a tree in real-time, allowing querying or exporting at any time for debugging issues like deadlocks encountered during Async Rust execution.

Let's use a simple Async Rust program to explain the usage and effect of Await-Tree:

Design principles and implementation of Await Tree part one : basic examples(1).

In this code snippet, we have some simple nesting of Async functions and concurrent execution using join. Unlike usual code, we add .instrument_await after each critical future that we want to trace and specify a name for it. This name can be a static string constant or include additional runtime information.

After executing the foo function and waiting for 1 second, all branches of the task should be in a blocked state, waiting for a sleep wake-up. The entire stack space of the task should be stored in a state machine within the runtime. However, with Await-Tree, we can reconstruct the logical execution state of the task and determine the duration of each future.

Design principles and implementation of Await Tree part one : basic examples(2).

Continuing execution for 2 more seconds, some sleep futures are awakened, and the task enters a new execution state. The exported Await-Tree can reflect this updated execution state, and the duration of each future is also updated. Unlike the tree from 2 seconds ago, this tree includes a current pointer, indicating the position where the CPU is currently executing: the presence of the current pointer indicates that the task is currently executing the poll function of a future without being blocked.

Design Details of Await-Tree

After gaining an intuitive understanding of how Await-Tree is used, let's delve deeper into its design details. Await-Tree's structure is maintained based on the control flow of the poll function of futures. To ensure that Await-Tree accurately reflects the execution state of an Async task, it is crucial to fully understand the possible control flows throughout the entire lifecycle of a future.

Let's use a slightly more complex but common Async Rust program as an example to demonstrate how Await-Tree responds to the control flow of futures and maintains its structure. In this example, we set a timeout for the execution of the query function, print a warning message after the timeout, and then continue waiting for query to complete.

Design principles and implementation of Await Tree part two : design details.

  • Construction of futures: Suppose we call the handle function from the serve function. When we first enter handle, we construct three futures: query, select, and timeout. Generally, the execution of futures is lazy, so constructing the futures does not affect the Await-Tree: the current pointer still points to handle.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(1).

  • First poll of futures: Next, since we use await on select, the poll function is called on select for the first time. Await-Tree creates a new node as a child of the current node and moves the current pointer. In the implementation logic of the selectpoll function, it tries to call the poll function on each of the two futures. Similarly, Await-Tree creates new nodes for them and moves the current pointer.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(2).

  • Future returns pending: After executing some synchronous code inside query, let's assume it gets blocked on a network I/O, and its poll function returns pending. Although we physically exit the poll function, the future is still logically in the process of execution. Therefore, we keep its node and move the current position to the parent node select. The same logic applies to the other side of the select, timeout.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(3).

  • Task is pending: The poll function of select returns pending, causing its caller to recursively return pending, ultimately resulting in the entire task returning pending. At this point, since the entire task is in a blocked state, the current pointer no longer points to any node. However, the tree structure of Await-Tree is preserved before the task gets suspended and can be accessed by other threads.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(4).

  • Future poll again: After some time, timeout is awakened first, and the runtime starts rescheduling the execution of the task, recursively calling the poll function of each future. Since the futures are already running, their corresponding nodes on the Await-Tree already exist, so the entire process only requires moving the current pointer.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(5).

  • Future returns ready: After the second poll of timeout, it immediately returns ready. Due to the semantics and implementation logic of select, it also returns ready. Since timeout has completed execution, its corresponding node in Await-Tree can be removed. As for query, since it has not finished executing and we only pass a reference to it to select, the return of select does not cancel query. At this point, since the parent node has been removed, query is detached from Await-Tree and is no longer connected to any other nodes.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(6).

  • Rebuilding future call hierarchy: After printing the warning log, we continue to wait for the completion of query in the handle function. At this point, the query node in Await-Tree notices that its parent node has changed and reconstructs the parent-child call relationship with handle.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(7).

  • Future returns ready: After some time, query completes execution and returns Ready, causing handle to recursively return ready. Await-Tree also reverts to its initial state. Additionally, futures have a special control flow called “cancel”, which behaves similarly to ready on Await-Tree but does not require manipulating the current pointer. We won't go into detail about it here.

Design principles and implementation of Await Tree part three : maintenance of the Await Tree(8).

Implementation of Await-Tree

After gaining a deeper understanding of how Await-Tree should be maintained based on the different control flows of futures, we can now further discuss the programming implementation of Await-Tree.

Considering that an Await-Tree represents the execution state of a task, and its physical execution is single-threaded, the maintenance operations on it do not require any inter-thread competition. Additionally, to simplify the interface design, we need to store the data structure of Await-Tree in a globally accessible context. Based on this, we choose to maintain Await-Tree in task-local storage, which can be understood as the coroutine version of thread-local storage.

Implementing a linked data structure in Rust is quite challenging and may introduce memory safety issues due to misuse of unsafe Rust. Therefore, we use an arena-based index tree to implement Await-Tree, where each future only needs to maintain an ID in the arena to access the corresponding node in Await-Tree. This design ensures that Await-Tree does not contain any unsafe code.

Implementation of Await Tree.

In terms of interface design, we adopt a similar approach to the future adaptor design in futures::futureExt. By implementing the InstrumentAwait trait for all futures, developers can directly call .instrument_await on a future to construct a wrapped future that can be tracked and maintained by Await-Tree.

In the poll logic of Instrumented, we implement and maintain a state machine for Await-Tree nodes according to the design details described in the previous section. It is worth noting that even if the future to be tracked is implemented by manually customizing the logic of the poll function (such as common join/select), as long as its behavior is "structured", Await-Tree can almost always correctly track its execution state through its rigorous and comprehensive implementation, without requiring any special awareness or handling. This also enables Await-Tree to naturally be compatible with various future or runtime libraries.

Application of Await-Tree in RisingWave

Await-Tree was designed based on the in-depth practice of Async Rust in RisingWave. As a new generation of cloud-native streaming database based on SQL, RisingWave maintains real-time incremental updates of materialized views in the system through distributed stream computing tasks. It also manages stream computing states and table data through a self-developed shared storage engine on S3. Compared to OLAP systems, RisingWave's computing tasks need to run for a long time and require high stability. Additionally, RisingWave's stream computing operators have more complex logic, with computations and state store requests interleaved, resulting in a strong dependency on Async. Therefore, Await-Tree has been deployed and used in RisingWave's production environment for a long time, significantly improving the observability and debuggability of Async Rust with very low runtime overhead, and helping us solve several tricky Async deadlock issues.

Supplement to Backtrace

Await-Tree is originally designed for tasks in Async Rust and can provide a logical call tree of tasks. Therefore, it can be directly used as a supplement to thread backtrace. For example, when a panic occurs in RisingWave, the Await-Tree of the current task is exported and printed to the log. With the dynamically attached runtime information, developers can have a clearer understanding of the executor where the problem occurred and its corresponding materialized view, including other concurrent futures within the same task. This helps developers locate the problem more easily.

Supplement to Backtrace.

Debugging Async Stuck

RisingWave adopts an execution engine design similar to MPP, where data processing is divided into fragments based on distribution. Each fragment is executed by multiple independent actors on their respective async tasks. Actors communicate with each other through local channels or remote gRPC. Inside each actor, there are multiple nested executors that execute concurrently. Due to the complexity of the engine model and computation logic, RisingWave often encountered deadlock issues with async stuck during early development. The introduction of Await-Tree greatly simplifies the debugging process of such problems.

Case 1: Future detach

The backend of RisingWave's state storage engine is located on S3 object storage. To improve read and write performance, we have developed a tiered cache layer to cache storage blocks in memory and local disks. When a request for a block fails to hit the cache and needs to make a network request, a common optimization is to intercept subsequent requests for the same block, so that they only need to wait for the first network request to return. This behavior is also known as “single flight”.

After introducing this optimization in RisingWave, we often encountered a problem in testing where a streaming job would get stuck and couldn't continue executing. By printing the Await-Tree of all actors, we found the following execution state: the network requests for the block cache were detached from the entire Await-Tree. This indicates that even if the actor is awakened by the completion of the network request, it is not able to re-call the poll function of the corresponding get future, and thus cannot notify the wait futures on the current execution path to complete, resulting in a deadlock.

Async Stuck: Future Detach.

After analyzing the code, we found that our join operator selects two upstream executors as inputs through select to continuously retrieve data that needs to be processed from both ends. This leads to a potential problem: when one end is ready for the next batch of data to be processed, the other end may be in an arbitrary state, such as waiting for a network request for a block. However, at this time, the join operator will acquire the execution right of that task and start processing this batch of data. If during this process, the join operator also needs to access the state storage and happens to access the same block that the other end is accessing, it will choose to wait for the previous request to complete based on the single flight optimization, temporarily causing the current actor to be in a pending state.

After some time, the previous get request is completed, and the actor is awakened, and its poll function is called again. However, unfortunately, since the join operator currently monopolizes the execution right of this actor, the wakeup does not correctly wake up the poll function of the get call, so it fails to notify the other wait futures to complete. From the perspective of the wait in join, this wakeup is considered a suspicious wakeup, so it still returns pending, but as a result, the current actor permanently loses the chance to be awakened and scheduled again.

Once the problem is identified, fixing it becomes quite easy: the root cause of this async stuck lies in the fact that the block cache's single flight implementation assumes that all requests have equal scheduling qualifications, while the nested executor calls within the actor violate this principle. Therefore, all we need to do is to complete the block cache's requests and notification wakeup process by spawning a new task, ensuring that it can always be independently scheduled, and the problem is resolved.

Case 2: Circular resource dependency

In addition to individual task-related stuck issues, thanks to the ability to embed runtime information, the Await-Tree of multiple tasks can be easily associated to debug dependency deadlocks and other problems.

In streaming computing scenarios, the workload is highly variable. To address this, RisingWave has a unique online scaling capability, allowing compute tasks to have controlled latency and high resource utilization. However, when faced with a sudden increase in load, RisingWave prioritizes the stability of compute tasks by implementing the back pressure mechanism between actors.

Actors in RisingWave communicate with each other through local channels or remote gRPC Streaming. With the increase in actor parallelism, to avoid excessive usage of system resources such as port file descriptors due to remote communication, we introduced an optimization of reusing gRPC connections. However, after introducing this optimization, we found in extreme testing scenarios that when the parallelism of actors reached 100+ in a single process, it was easy to encounter a problem where the entire streaming job would get stuck and couldn't continue executing under high load. Based on our previous experience in solving async stuck issues, we chose to dump the Await-Tree of all actors and discovered the following phenomenon:

Async Stuck: Ring Resource Dependencies.

In the exported Await-Tree, we found that there was a circular dependency between the reported blocking reasons of different actors: the downstream actors believed that the upstream actors did not produce data, causing them to be unable to continue processing; while the upstream actors believed that the downstream actors did not consume data in a timely manner, resulting in back pressure and the inability to continue processing. This is impossible in a directed acyclic execution graph. Through comparative observation, we narrowed down the problem to the gRPC streaming used for remote communication between actors. After further studying the internal implementation of the gRPC library tonic, we found that it sets a total window size for the reused gRPC connection, shared by different gRPC streams: if one stream occupies a disproportionately large window size, it may cause other random streams to be back pressured, violating the principle of topological order propagation of back pressure in a directed acyclic execution graph.

It is clear that the cause of this async stuck issue is similar to Case 1: both introduce incorrect assumptions of equality on resources with dependencies and form reverse dependencies, resulting in circular resource dependency and deadlock. After identifying the problem, we chose to disable congestion control at the connection level of gRPC, while retaining the ability for each gRPC stream to independently control congestion. This allows us to preserve the gRPC back pressure mechanism while avoiding interference between communication of different actors.

CONCLUSION

In this article, we introduced Await-Tree as a powerful tool for observability in Async Rust. Await-Tree is a backtrace tool designed natively for Async Rust, which allows developers to observe the execution status of each async task in real time and analyze the dependency blocking relationships between different futures or tasks. In RisingWave, Await-Tree has been used in production environments for a long time and has helped developers solve challenging async stuck issues multiple times.

Await-Tree is open source on GitHub: github.com/risingwavelabs/await-tree. Readers are welcome to integrate Await-Tree into their own Async Rust systems.