how-randomized-sql-testing-can-help-detect-bugs

We chose to build our own SQLSmith in Rust. It wasn't a mere whim. We tested various alternatives, analyzing their capabilities and adaptability to our needs. While some solutions showed promise, SQLSmith in Rust consistently emerged as the right solution.

The Motivation Behind

SQLSmith is a tool used for automated SQL query generation and testing. It is designed to explore the capabilities and limitations of database systems by generating random valid SQL queries and executing them against a target database. When a bug is discovered, the query causes the database to crash or produce unexpected errors.

Before implementing our own SQLSmith, Risingwave also tried another approach powered by AFL++. It generates random binary data and checks it using Risingwave's frontend to ensure syntactic and semantic correctness. The frontend consists of the following components:

Lexer: It breaks the input into smaller units called tokens based on SQL specifications. Tokens can be keywords, identifiers, literals, operators, punctuation symbols, etc.
Parser: It restructures the tokens into an abstract syntax tree (AST) that represents the hierarchical structure according to SQL's grammar rules. The AST captures the relationships between tokens and their syntactic meaning. For example, a Join has two inputs: a table or a subquery.
Binder: It determines the actual meaning of queries. For example, in the query select count(user) from foo, foo is a table that contains multiple columns. The binder must verify the column user's existence by consulting the database catalog. It also helps detect the compatibility of data types and related operations. For instance, + 8 seconds cannot be applied to a non-timestamp data type.
Optimizer: Once we have a syntactically and semantically correct query, we can optimize it through various transformations, such as pushing filters down or enumerating different join orderings to find the best one.

However, random binary data generated by a general-purpose fuzzer is unlikely to pass the parser or even the lexer. This means that testing the binder, optimizer, and execution engine becomes difficult. Using a domain-specific fuzzy testing tool is much more efficient.

There are a few open-source SQLSmith implementations available:

The first SQLSmith implementation borrowed ideas from CSmith.
SQLSmith implementation from CockroachDB.
SQLSmith implementation from PingCAP.

However, some reasons prevent us from directly using them to test RisingWave:

They do not support all the syntax and features of Risingwave. Risingwave is a streaming database with domain-specific features for streaming workload, such as time-related operations like time windows, watermarks, and temporal filters.
They support different syntax or more features than Risingwave supports. For example, they may support a different SQL dialect other than PostgreSQL. Even if they are PostgreSQL-compatible like Risingwave, Risingwave hasn't achieved 100% compatibility, so we have to disable features and adapt implementations accordingly.
Risingwave supports both batch SQL and streaming SQL. The underlying implementations of batch and streaming engines are independent. Therefore, besides testing, if a query can pass RW's frontend and execute without errors, we can embed correctness testing into SQLSmith. This involves comparing the results of running the same query using batch and streaming engines. This technique is typically referred to as differential testing.

Therefore, we chose to build our own SQLSmith in Rust.

How We Use SQLSmith for Testing

We run SQLSmith frontend tests per pull request and a snapshot of generated queries from SQLSmith. This has helped us catch many bugs in the four components mentioned above: Risingwave's frontend, execution engine, and storage engine.

Pull Requests

Initially, we ran SQLSmith with a different random seed for each pull request, as we believed that generating a different set of queries each time maximizes code coverage. However, in the early stages of development, we encountered numerous important and unimportant bugs. Our CI/CD pipeline blocks any pull request that fails tests from being merged. Thus, developers were under pressure to fix even lower-priority bugs, i.e., queries that are too complex for users to write in real life.

Additionally, queries generated for each pull request were simply forgotten once the test passed. This is not ideal, as the same bugs detected by SQLSmith may reappear in the future. It would be better to collect all the failed test cases and periodically test them to avoid repeating the same mistakes.

Weekly Snapshot Generation

To address this issue, we generate a snapshot of SQL queries to maintain a stable test set.

The snapshot approach allows us to prune the query set. We can remove queries that trigger unimportant bugs or invalid queries that were previously not rejected by Risingwave. Once a bug is detected, it will fail our run pre-generated queries workflow, preventing the pull request from being merged.

If a bug is deemed unimportant or actually an invalid query, we can remove it from the active test set. However, we still store the failing set of queries so that we can test and reproduce these errors.

Meanwhile, we generate a fresh set of queries weekly to ensure comprehensive coverage.

SQLSmith Test Runner Internals

At a high level, SQLSmith generates the following queries:

Data Definition Language (DDL) Queries, e.g., CREATE TABLE, CREATE MATERIALIZED VIEW.
Data Manipulation Language (DML) Queries, e.g., INSERT, UPDATE, DELETE.
Batch queries, e.g., SELECT.
Streaming queries, e.g., CREATE MATERIALIZED VIEW.
Session variables that configure database behavior, e.g., SET RW_ENABLE_TWO_PHASE_AGG TO FALSE, to influence the optimizer's query plan output with/without certain optimizations.

DDL statements allow us to run batch queries or create new materialized views based on the tables we create, enabling us to test our batch and stream engines.

DML statements ensure that data is processed through batch and stream engines.

Session variables allow us to test the new behavior of the database after modifying these variables. These variables are usually set per query, as they often modify the behavior of query plan optimization.

SQLSmith Query Generation Internals

The query generation process follows a top-down recursive approach, closely following the SQL Abstract Syntax Tree. This process is applied to batch and stream queries, selectively disabling certain SQL features not supported by the stream engine.

Here is a general example of how a complex query is generated:

Generate a WITH clause.
Generate the set expression, i.e., the SELECT statement.
Generate the ORDER BY clause.

For the set expression, we generate it through the following steps:

Generate the FROM expression, including any necessary joins.
Generate the WHERE expression.
Generate other parts of the SELECT query.
Generate the list of SELECT items.

When generating the list of SELECT items, the process first chooses a type and then calls gen_expr with that type. This can result in the generation of:

A scalar value.
A column.
Functions.
Aggregates.
Casts.
Other types of expressions, such as "CASE ... WHEN" statements.

All of these help ensure that various types of SQL statements are well-tested and reliable.

If you're interested in further details, you can read our developer docs.

Bug Findings

SQLSmith has already found nearly 40 bugs in Risingwave across the frontend, execution engines, and storage engine. If interested, you can check out the details by searching on GitHub.

Here are some interesting bugs we found:

bug(frontend): wrong cast for NULL in places we require boolean

SQLSmith is useful for ensuring test coverage of different CASTs from one data type to another. There are many combinations of types, which are too cumbersome for people to write test cases manually. It managed to uncover this NULL-casting issue caused by implicit_cast not being enforced correctly in the binder.
bug: risingwave-streaming-actor’ panicked at ’mem-table operation inconsistent

SQLSmith discovered a bug with FULL JOIN when the primary key is null. It exposed a limitation in our optimizer when dealing with streaming queries.
bug: two phase stream agg panicked at Option::unwrap()

SQLSmith detected a bug with two_phase_stream_agg that occurs during plan generation.

💡 WHAT IS TWO-PHASE AGGREGATION?

For group-by aggregation, Risingwave’s streaming engine can choose between two query plans:

Directly shuffling the data by the group-by columns and then aggregating. Doing aggregation before shuffling to reduce the amount of data sent over the network and then performing a second phase of aggregation. The second approach favors low-cardinality data, i.e., columns with few unique values, as the reduction in data can be significant.

bug (stream): mv-on-mv: vnode 98 should not be accessed by this table

SQLSmith found a bug with the computation of the hash key due to an overflow of the interval type during execution.

Future Improvements

Differential Testing

Other implementations of SQLSmith typically need an oracle of truth to test correctness. However, in Risingwave, we can largely mitigate this inconvenience. Risingwave has both a streaming engine and a batch engine, each implemented by a different set of operators optimized for their respective use cases, resulting in drastically different computation logics:

Batch SQL has a finite amount of input data, allowing for optimizations that are impracticable in continuous stream processing. Each operator can wait for all data to be ready before processing and can store all intermediate results completely in memory without dumping them to S3.
Streaming SQL has infinite input data and must store intermediate results in the storage engine backed by a remote object store. Additionally, every new piece of data is incrementally computed and updated in the materialized view.

Therefore, we generate and test the same query using batch and streaming computation engines. Finally, we compare the results to deduce any incorrectness. While the results from both engines can be wrong, the chances of them being wrong in the same way are minimal.

Source Testing

SQLSmith can be further enhanced to test connectors, such as reading data in a specific format from upstream sources. Taking an example from Risingwave's documentation (https://www.risingwave.dev/docs/current/create-source-kafka/#examples), Risingwave supports ingesting data from Kafka in six different data formats: Avro, Upsert Avro, JSON, Upsert JSON, Protobuf, and CSV. Consequently, Risingwave has six different code paths for parsing each data format.

SQLSmith is supposed to generate randomized valid or invalid input data with different data types, such as numeric or jsonb, to ensure that the parser can correctly parse valid data and reject ill-formed data.

SQL Reducer

Often, after detecting a failed query, it takes a significant amount of time to locate the cause of the bug because the queries generated by SQLSmith are very complex and not designed to trigger bugs intentionally.

As a result, we have to apply a binary-search-like algorithm manually:

Attempt to remove or replace part of the query with simpler elements to see if the failure persists.
If it does, repeat the first step within the untouched part of the query.
If it doesn't, further investigate the removed or replaced part.

This is a time-consuming process that could be automated. Discussions about this can be found on Risingwave's GitHub.

CONCLUSION

Without SQLSmith, we would have to generate test cases manually, which is inefficient. Although we could incorporate existing test cases from other databases, they may not be compatible with Risingwave’s syntax, requiring manual adjustments. Moreover, both approaches explore a search space that is much smaller than what SQLSmith is capable of. It has become evident to us that investing time into SQLSmith uncovers many more bugs than manual labor alone. SQLSmith has significantly increased our confidence in delivering new features and making changes to code or SQL syntax.

How Randomized SQL Testing Can Help Detect Bugs?

Table of contents