The shift in data management towards data lakes is indeed transformative. Data lakes are fundamental in managing vast amounts of raw, unstructured, and semi-structured data. Their ability to store historical data as a single source of truth is crucial for organizations to maintain data consistency, integrity, and trustworthiness across different departments and teams.
By integrating a compute engine like Apache Spark, Trino, or ClickHouse, a data lake can be turned into a 'data lakehouse'. This not only helps in storing massive amounts of data, but also in processing it efficiently.
Apache Kafka, a widely used event streaming platform, has found its way into the technology stack of numerous data-driven corporations. Kafka has long been perceived as a “repository for recent data” in the modern data stack. Many data engineers use Kafka to hold recently ingested data, usually for a duration of 7 days to a month, before transferring this data into data lakes.
People have the impression that “event streaming platforms are for transient data, and data lakes are for historical data.” However, there is increasing evidence to suggest that Kafka is evolving into a new form of data lake.
This article will explain why this evolution is occurring.
What is data lake?
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Unlike a data warehouse, which stores data in a structured and organized manner, a data lake retains data in its raw, native format, typically with a flat architecture.
There are three popular data lake managing frameworks, namely Apache Iceberg, Apache Hudi, and Delta Lake. While each of these systems has its unique features and advantages, all three are widely adopted for storing and managing historical data at large scale. Their design and functionalities make it easier to handle massive amounts of data, and their integration capabilities with popular compute engines like Apache Spark make them suitable for various big data applications and analytics use cases.
Kafka has all Data Lake Properties
Kafka is inherently well-suited for being a data lake. Before debating whether Kafka is the new form of data lake, let's first examine if Kafka possesses all the necessary properties to become a data lake.
Kafka indeed has all the essential properties.
Database-like ACID Properties. As highlighted in Martin Kleppmann’s Kafka Summit San Francisco 2018 keynote “Is Kafka a Database?”, Kafka has evolved to include all database-like properties, specifically atomicity, consistency, isolation, and durability (ACID). While many people use Kafka to store only recent data, Kafka actually has infinite retention, similar to modern data lakes. This capability makes Kafka an attractive option for storing vast amounts of data.
Cost-Efficient Tiered Storage. One key reason people hesitate to use Kafka for storing long-lived data is the perception that Kafka is expensive. This used to be true. The classic design of Kafka required storing data in compute instances (like AWS EC2), which can be much more expensive than object storage such as AWS S3. However, this has changed. The latest version of Kafka built by Confluent, along with other popular event streaming platforms like Redpanda and Apache Pulsar, has adopted tiered storage, which stores cold data in a cheap object store, thereby reducing costs and making it feasible to persist long-term data. This new design makes Kafka suitable for storing vast amounts of data at low cost without worrying about scalability.
Storing Data of Different Types. Kafka can handle a wide variety of data types, from structured data like relational data, to semi-structured data like JSON and Avro, and even unstructured data like text documents, images, and videos (though uncommon). This versatility is crucial in today's diverse data landscape and allows Kafka to serve as a centralized repository for all of an organization's data, reducing the complexity and overhead of managing multiple storage solutions.
Storing Real-Time Data. While many people use data lakes to store historical data, modern data lakes are evolving and becoming increasingly real-time. This evolution is natural, as modern applications and devices can generate huge amounts of data continuously. Hence, data lakes are implementing optimizations to allow ingesting data in real time. As an event streaming platform, Kafka inherently supports real-time data ingestion. Its architecture is well-suited for storing both fast-moving real-time data and slowly moving historical data.
Is Kafka Suitable for Being the New Data Lake?
Kafka has all the data lake properties. But if Kafka have the potential to serve as the new data lake in production? Several compelling reasons support this perspective.
It's the Data Source. Many organizations directly ingest data into Kafka before subsequently transferring it into data warehouses or other storage systems. If Kafka is employed as the data lake that permanently retains data, it negates the necessity of relocating data between different systems. The elimination of data movement not only reduces costs but also minimizes the potential for data inconsistency and loss.
Single Source of Truth. Leveraging Kafka as the data lake means it can serve as the real single source of truth for the entire organization. Data inconsistency occurs because people transform data. But if we use the data source as the data destination, then we won’t encounter any data inconsistency issues. Moreover, this approach significantly simplifies the data architecture by reducing the number of systems that need to be maintained, synchronized, and integrated, thereby making the infrastructure more manageable, less prone to errors, and more cost-efficient.
Rich Ecosystem. Kafka boasts a very rich and robust ecosystem for ingesting data from various of data sources, and most compute engines can readily consume data from Kafka. This flexibility greatly facilitates the integration of Kafka into existing systems and workflows, thereby reducing the effort and complexity required to adopt Kafka as a data lake. Additionally, Kafka’s capabilities extend beyond just data ingestion and storage. It also natively offers lightweight stream processing capabilities (through Kafka Streams), which means that data can be processed in real-time as it is ingested. This is a significant advantage for organizations that require real-time analytics and decision-making capabilities.
Will Kafka Replace Existing Data Lake Managing Frameworks?
The straightforward answer is no, at least not in the immediate future. Despite Kafka's capability of storing both real-time and historical data, it does not imply it will supplant widely used data lake managing frameworks like Apache Iceberg, Apache Hudi, and Delta Lake.
These data lake managing frameworks are optimized for large-scale data storage while maintaining ACID properties. Functionally, Kafka is yet to incorporate crucial features, such as data type awareness for compression, support for query pushdown, and support for updates and inserts, rendering it less attractive in serving historical data.
A possible architecture to be adopted in the near future is utilizing Kafka as a unified interface for reading and writing, and storing hot and warm data in Kafka. Then, cold data can be progressively transitioned from Kafka to Iceberg/Hudi/Delta, transparently and without the user's awareness. This approach leverages the strengths of both Kafka and existing data lakes. Users can continue to read and write data by directly invoking the Kafka API, oblivious to the underlying structure and data format. This means that the complexities of the underlying data transformation and storage mechanisms are abstracted away from the end-users, simplifying their interaction with the system.
Building a Streaming Data Lakehouse with Kafka
A data lakehouse is a powerful data platform that merges the best features of data lakes and data warehouses. It provides a unified platform that can handle vast amounts of both structured and unstructured data and support advanced analytics and machine learning. With the evolution of Kafka into a new data lake, we can essentially build a “streaming data lakehouse” that can store and process both real-time and historical data. There are at least two key components required to build a streaming data lakehouse on top of Kafka:
Stream Processing System. The first essential component is a stream processing system, such as RisingWave, Apache Flink, or KsqlDB. These systems are designed to process real-time data streams stored in Kafka, enabling businesses to make faster and more informed decisions by analyzing data as it is generated.
Real-Time Analytical Engine. The second crucial component is a real-time analytical engine, such as Apache Spark, Trino, or ClickHouse. These engines are designed to analyze the processed data, provide insights, and facilitate decision-making. They are capable of handling large volumes of data with low latency, making them ideal for a streaming data lakehouse architecture built on Kafka.
By combining Kafka with a robust stream processing system and a powerful real-time analytical engine, businesses can create a streaming data lakehouse architecture that is capable of handling the demands of modern data processing and analytics. This architecture enables organizations to maximize the value of their data, providing real-time insights that can drive better decision-making and create a competitive advantage.
Wishlist for Kafka
While Kafka is incredibly powerful and versatile, there are areas for improvement if Kafka truly evolves into a data lake. Here are a few items on my wishlist for Kafka.
Data Type Awareness for Compression. Currently, Kafka treats data as a byte array and is unaware of the data's actual structure and type. This lack of awareness means that the compression Kafka performs is generic and not as efficient as it could be if it understood the data's structure. If Kafka could be aware of the data types it is handling, it could perform data compression more effectively. This improvement would reduce the storage requirements and optimize the performance of analytical queries by minimizing the amount of data that needs to be transferred and processed.
Support for Query Pushdown. Query pushdown is a technique that involves pushing down parts of a query (such as filters) to the storage layer, enabling more efficient data retrieval and processing. Currently, Kafka does not support query pushdown, which means that all the data needs to be loaded into memory and processed, even if only a small subset is needed. If Kafka could support query pushdown, it would enhance the performance of analytical queries by reducing the amount of data that needs to be loaded into memory and processed.
Support for Update and Delete. At present, Kafka is designed as an append-only log, and while there are workarounds to handle updates and deletes, they are not as straightforward and efficient as in traditional databases. If Kafka could natively support update and delete operations, it would make data maintenance easier and more efficient. It would also make Kafka a more complete and versatile data storage solution, increasing its suitability as a data lake. This addition would be a game-changer for many organizations, simplifying their data architectures and reducing the overhead associated with data maintenance.
CONCLUSION
Embracing Kafka as the new data lake represents a fundamental shift in data management and analysis. Its advanced features, combined with the addition of a stream processing system and a real-time analytical engine, make it a solid foundation for building a lake house architecture. Moreover, its suitability for data persistence, ability to serve as a single source of truth, and rich ecosystem further solidify its position as a viable data lake option. Let’s see how Kafka and other event-streaming platforms will evolve in the near future.