Stream Processing Engines and Streaming Databases: Design, Use Cases, and the Future
In the fast-evolving field of real-time analytics, many stream processing engines have emerged over the past decade. Notable examples include Apache Storm, Apache Flink, and Apache Samza. These engines have become widely accepted in various enterprises, providing substantial support for real-time processing and analytical applications.
In the last few years, a new and intriguing innovation has surfaced: streaming databases. Starting with early solutions like PipelineDB, conceived as a PostgreSQL plugin, followed by Confluent's KsqlDB designed for Kafka, and the more recent open-source RisingWave—a distributed SQL streaming database—these systems have steadily climbed the ladder of acceptance and popularity.
The question arises: Can stream processing engines and streaming databases be used interchangeably? This article sheds light on their respective design principles and explores these fascinating technologies' distinctions, similarities, use cases, and potential future trajectories.
Design Principles
Stream processing engines and databases serve critical functions in data stream processing. Yet, they differ notably in their design principles, particularly in user interaction interfaces and data storage options. Before diving into their unique characteristics, a clear understanding of the historical backdrop that gave rise to these technologies is needed.
The Evolution of Stream Processing Engines
Database systems have been under exploration for over six decades, while computing engines—encompassing both batch and stream processing—are relatively recent innovations. The journey into modern stream processing began in 2004 with the release of Google's MapReduce paper.
MapReduce aimed to optimize resource utilization across networks of commodity machines striving for peak performance. This lofty goal led to the introduction of an elegant low-level programming interface with two principal functions: Map and Reduce. Such a design granted seasoned programmers direct access to these core functions, allowing them to implement specific business logic, control program parallelism, and manage other intricate details on their own.
What set MapReduce apart was its exclusive focus on processing, relegating data storage to external systems like remote distributed file systems. This division resulted in a unique architecture that continues to influence today's stream processing engines and sets the stage for understanding the vital differences and synergies with streaming databases.
To successfully utilize MapReduce inside a company, three critical prerequisites were essential:
The company must possess a substantial volume of data and relevant business scenarios internally;
The company must have access to a sufficient number of commodity machines;
The company must employ a cadre of skilled software engineers.
These three conditions were a high bar for most companies when MapReduce was introduced in 2004. It wasn't until after 2010, with the explosion of social networks and mobile internet, that the first condition began to be satisfied. This shift led major companies to turn their attention to stream processing technology, investing heavily to meet the second and third prerequisites. This investment marked the golden age of development for stream processing engines, which began around 2010. During this period, a slew of exceptional stream processing engines emerged, including Apache Storm, Apache Samza, Apache Flink, among others.
These emergent stream processing engines fully embraced the core principles of the MapReduce design pattern, specifically:
exposing low-level programming interfaces to users;
relinquishing control over data storage.
This design was aptly tailored to the big data era. Usually, technology companies requiring substantial data processing had their own data centers and specialized engineering teams. What these companies needed were improved performance and more adaptable programming paradigms. Coupled with their specialized engineering teams, they could typically deploy distributed file systems to handle vast data storage.
However, the landscape began to shift with the remarkable advancements in cloud computing technology post-2015. More companies, even those without extensive technical backgrounds, desired access to stream processing technology. In response to this demand, stream processing engines embraced SQL, a concise and universally recognized programming language. This adaptation opened the doors for a broader audience to reap the benefits of stream processing technology. As a result, today's leading stream processing engines offer a tiered approach to user interaction, providing low-level programming interfaces such as Java and Scala and more accessible high-level SQL programming interfaces.
User Interaction Interfaces
Modern stream processing engines offer both low-level programming interfaces, such as Java and Scala, and higher-level interfaces, like SQL and Python. These interfaces expose various system runtime details, such as parallelism, allowing users to control more nuanced aspects of their applications. On the other side, streaming databases primarily feature SQL interfaces, simplifying runtime complexities. Some even provide User-Defined Functions (UDFs) in languages like Python and Java to enhance expressive capabilities. This divergence in user interaction interfaces leads to two principal balancing acts.
1. Balance Between Flexibility and Ease of Use
Proficient programmers benefit from the considerable expressive capabilities offered by low-level interfaces such as Java and Scala. As Turing-complete languages, Java, Scala, and similar languages theoretically empower users to articulate any logic. For companies with pre-existing Java and Scala libraries, this is particularly advantageous.
However, due to its complexity, this approach may deter those unfamiliar with Java or Scala. Mastering a system's custom API can be time-consuming and demanding, even for technical users.
Streamlining ease of use through higher-level SQL interfaces in streaming databases addresses some usability challenges, but two main obstacles remain. First is the completeness of SQL support, which includes more than simple DML (Data Manipulation Language) statements. Users often require complex DDL (Data Definition Language) operations like creating or deleting roles, users, indexes, and more. Second is the support for the broader SQL ecosystem, requiring additional effort, such as compatibility with management tools like DBeaver or pgAdmin.
Low-level interfaces give technically-focused users more flexibility, while SQL interfaces of streaming databases are tailored for business-focused users, emphasizing ease of use.
2. Balance Between Flexibility and Performance
The availability of low-level programming interfaces allows users to optimize system performance using prior knowledge. Those with deep programming expertise and business logic comprehension often achieve superb performance through low-level interfaces. The flip side is that this may require companies to invest more in building their engineering teams.
Contrastingly, hierarchical stream processing engines might face performance drawbacks when using high-level programming interfaces. Encapsulation leads to performance overhead, and the more encapsulation layers there are, the worse the performance. Intermediate layers in stream processing engines can cause the underlying design to be oblivious to the higher-level logic, leading to performance loss. In comparison, streaming databases solely offering SQL interfaces and optimization capabilities can reach higher performance levels.
In conclusion, stream processing engines with low-level interfaces allow technically-focused users to exploit their expertise for performance benefits. Conversely, streaming databases that provide SQL interfaces are tailored to serve business-focused users, often achieving higher performance bounds without requiring intricate technical expertise.
Data Storage
Apart from user interaction interfaces, the most significant difference between stream processing engines and streaming databases is whether the system stores data. In stream processing, the presence or absence of data storage directly affects various aspects of the system, such as performance, fault recovery, scalability, and use cases.
In a stream processing engine, data input and output typically occur in external systems, such as remote distributed file systems like HDFS. In contrast, a streaming database not only possesses computational capabilities but also includes data storage capabilities. This means it can do at least two things: 1) store inputs, and 2) store outputs.
On the data input side, storing data yields significant performance benefits. Consider a scenario where a naive stream processing engine joins a data stream from a message queue (like Kafka) with a table stored in a remote database (like MySQL). Suppose a stream processing engine lacks data storage; a significant problem arises whenever a new data item enters the stream. In that case, the engine must fetch data from the remote database before performing calculations. This approach was adopted by the earliest distributed stream processing engines. Its advantage is that it simplifies the system architecture, but the drawback is a sharp decrease in performance. Accessing data across different systems clearly causes high latency, and introducing high-latency operations in latency-sensitive stream processing engines results in performance degradation.
Storing (or caching) data directly within the stream processing system helps to avoid cross-system data access, thereby improving performance. In a streaming database, which is a stream processing system with built-in storage capabilities, a user can directly replicate the table into the streaming database if they need to join a data stream from a message queue with a table from a remote database. This way, all access becomes internal operations, leading to efficient processing. Of course, this is a simplified example. In real-world scenarios, challenges such as large or dynamically changing tables in the remote database may arise, which we won't delve into here.
On the output side, storing output results can offer substantial benefits. To summarize, there are roughly four key advantages:
Simplify the data stack architecture: Instead of using separate stream processing engines for computation and separate storage systems for data storage and query responses, a single system like a streaming database can handle computation, storage, and query responses. This approach significantly simplifies the data stack and often results in cost savings.
Facilitate computation resource sharing: In stream processing engines, computation results are exported to external systems after consuming input data and performing calculations. This makes it challenging to reuse these results within the system directly. A streaming database with built-in storage can save computation results as materialized views internally, enabling other computations to access and reuse those resources directly.
Ensure data consistency: While stream processing engines can ensure exactly-once semantics for processing, they cannot guarantee result access consistency. Since stream processing engines don't have storage, calculation results must be imported into downstream storage systems. The upstream engine must output version information of the results to the downstream system to achieve consistent results visible to users in downstream systems. This puts the burden of ensuring consistency on the user. In contrast, a streaming database with storage can manage computation progress and result version information within the system, assuring users always see consistent results through multi-version control.
Enhance program interpretability: During program development, repeated modifications and verification of correctness are common. A common approach to verifying correctness is to obtain inputs and outputs and manually validate whether the calculation logic meets expectations. This practice is relatively straightforward in batch processing engines but more challenging in stream processing engines. In stream processing engines, where inputs and outputs are dynamically changing, and the engine lacks storage, verifying correctness requires traversing three systems: the upstream message source, the downstream result storage system, and the stream processing engine. Additionally, users must be aware of computation progress information, making it a complex task. Conversely, verifying correctness becomes relatively simple in a streaming database as the calculation results are also stored within the database. Users only need to obtain inputs from the upstream system (typically by directly fetching offsets from message queues like Kafka) and validate results within the single system. This significantly boosts development efficiency.
Of course, there's no such thing as a free lunch, and software development never offers a silver bullet. Compared to stream processing engines, streaming databases with storage brings many advantages in architecture, resource utilization, consistency, and user experience. However, what do we sacrifice in return?
One significant sacrifice is the complexity of software design. Recall the design of MapReduce. The concept of MapReduce stood out when giants like Oracle and IBM dominated the entire database market because they effectively used a simple model to enable large-scale parallel computing for technology companies with many ordinary computers.
MapReduce directly subverted the storage and computation integration concept in databases, separating computation as an independent product. It enabled users to make computation scalable through programming. In other words, MapReduce achieved large-scale horizontal scalability through a simplified architecture, relying on the availability of highly skilled professional users.
However, when a system requires data storage, everything becomes more complex. To make such a system usable, developers must consider high availability, fault recovery, dynamic scaling, data consistency, and other challenges, each requiring intricate design and implementation. Fortunately, over the past decade, the theory and practice of large-scale databases and stream processing have significantly progressed. Hence, now is indeed a favorable time for the emergence of streaming databases.
Use Cases
While sharing similarities in their computational models, streaming processing engines and streaming databases still have nuances that lead to unique applications and functionalities. These technologies overlap significantly in their applications but diverge in their end-user focus.
Stream processing engines are generally more aligned with machine-centric operations, lacking data storage capabilities and often requiring integration with downstream systems for computation result consumption. In contrast, streaming databases cater more to human interaction, offering storage and random query support, thus enabling direct human engagement.
The dichotomy between machines and humans is evident in their preferences—machines demand high-performance processing of hardcoded programs, while humans favor a more interactive and user-friendly experience. This inherent divergence leads to a differentiation in functionality, user experience, and other aspects between stream processing engines and streaming databases.
In sum, the commonalities in application scenarios between these technologies are nuanced by the specific focus on end users and interaction modes, resulting in distinct features within each system.
A Step Back in History or the Current Trend of the Era?
Databases and computing engines often reflect two divergent philosophies in design, evidenced by their distinct scholarly contributions and industry evolution.
In academics, computing engine papers typically find their home in system conferences like OSDI, SOSP, and EuroSys, while database-centric works frequent SIGMOD and VLDB conferences. This division was famously highlighted by the 2008 critique "MapReduce: A major step backwards" by David DeWitt and Michael Stonebraker. He argued that MapReduce was historically regressive and needed more innovation relative to databases.
In the realm of stream processing, the question arises: Which philosophy represents a historical regression, and which embodies the current era's trend? I assess that both stream processing engines and streaming databases will coexist and continue evolving for at least the next 3-5 years.
Stream processing engines, emphasizing extreme performance and flexibility, cater to technically proficient users. On the other hand, streaming databases balance elegance in user experience with high performance and fine internal implementation. The mutual integration trend between these philosophies enhances both sides, with computing engines adopting SQL interfaces to boost user experience and databases leveraging UDF capabilities to increase programming adaptability.
Predicting the Future of Stream Processing
Looking back at history, we find that the concept of streaming databases was already proposed and implemented more than 20 years ago. For example, the aforementioned Aurora system is already a streaming database. However, streaming databases are not as popular as stream processing systems. History moves in a spiral pattern.
The big data era witnessed the trend of separating stream processing from databases and overthrowing the monopoly of the three giants: Oracle, IBM, and Microsoft. More recently, in the cloud era starting around 2012, in the field of batch processing systems, we are witnessing systems such as Redshift, Snowflake, and Clickhouse bring the "computing engine" back to the "database".
Now it is time to bring stream processing back into databases, which is the core idea behind modern streaming databases such as RisingWave and Materialize. But why now? We analyze it in detail in this section.
With over two decades of development, stream processing remains in its infancy concerning commercial adoption, particularly compared to its batch processing counterpart. However, the industry has reached a consensus on the direction of stream processing. Both stream processing engines and streaming databases concur on the necessity of supporting both stream and batch processing. The primary differentiation lies in harmoniously integrating these two distinct computing models. Essentially, the notion of "unified stream and batch processing" has become a shared understanding within the field.
There are generally three methodologies for accomplishing this unification:
- The Single Engine Approach
This strategy involves employing the same computing engine to manage both stream processing and batch processing functionalities. Its advantage is that it offers a relatively straightforward implementation plan and a more cohesive user experience. However, since stream processing and batch processing entail significant differences in implementation aspects like optimization and computation methods, distinct handling of each is often necessary to achieve peak performance.
2. The Dual Engine Approach
Under this approach, stream processing and batch processing are configured separately as two unique system engines. While it allows for specific optimization of each, it also necessitates considerable engineering resources, high levels of collaboration, and stringent priority management within the engineering team. The need to align and make both parts work seamlessly can be a considerable challenge.
3. The Sewn-Together System Approach
This third approach involves repurposing existing systems instead of constructing a new, ideal system to provide a unified user experience. Although it seems to be the least engineering-intensive method, delivering a seamless user experience can be highly challenging. Existing market systems vary in support interfaces, and even when they employ SQL, different dialects might be used. Shielding users from these inconsistencies becomes a core issue. Additionally, orchestrating multiple systems to function, interact, and remain cognizant of each other introduces complex engineering challenges.
Conclusion
Though stream processing engines and streaming databases may exhibit some overlap and differences in design and practical applications, both strive to align with corresponding batch processing systems. Selecting which approach to adopt ultimately depends on the user’s assessment, considering their particular scenarios and needs. This decision-making process emphasizes the importance of understanding each system’s unique characteristics and requirements and how they fit into the broader computing landscape.