Spark Streaming in Action for Click-Stream
Ingest to HDFS from Kafka Topics by Spark Streaming
Last updated
Was this helpful?
Ingest to HDFS from Kafka Topics by Spark Streaming
Last updated
Was this helpful?
In this post, I am going to describe an end to end architecture to process and analyze the click-stream events for a website with Apache Kafka, Spark Streaming, Hadoop HDFS and Hive.
For a use case, the question we are trying to answer is; What is the sales rate of the end-users who signed up to our e-commerce last week?
Click-stream events are the end-user behaviors on any website, for this particular case this is an e-commerce web site events. All the actions of the end-users are stored as they occur during the session lifetime of the user. Some of these events are signUp events for the newcomers, orderSummary events for the buyers, search event for a product search on the website and productView events for viewing the product detail. There may be much more than the events mentioned here which is highly dependent of the e-commerce business and needs.
is a good option to start storing the click-stream events. With its high performance and scalable infrastructure, Kafka is a very good candidate for storing streaming data and act as a data hub for the streaming events. In this post we are interested in signUp and orderSummary events (remember the problem we are trying to solve)
Since you definitely need a solution for archiving data to solve the sales rate problem of the newly signUp customers, the best archiving solution would be Apache Hadoop. Hadoop HDFS is a scalable, distributed and fault-tolerant filesystem which can manage big-data streams such as click-stream or IOT events as well.
Since Hadoop HDFS is highly scalable, increasing the storage capacity is not a burden. Simply add a new node to Hadoop Cluster or just increase the DataNode filesystem on the existing data nodes of the cluster. The storage will be available as soon as the new available storage added to the system.
To solve the mentioned case defined at the beginning of this post we need the new signUp and orderSummary events. Using Spark Streaming for just filtering some of the events stored in Kafka Topics may seem as over engineered but using Spark Streaming would definitely give flexibility to add new features and transformations on top of filtering, such as data aggregation, data enhancement and data linage options.
After submitting the Spark Job to the Spark cluster you can monitor the Streaming job and investigate the metrics. Monitoring end-point of the Spark Jobs are quite handy and also makes Spark Streaming Jobs better than some other solutions in means of operability and manageability.
Since I want to give customer to connect with a SQL client with their own choice and query the data with standard ANSI SQL commands, I preferred Hive to define a table on top of the HDFS FIlesystem.
After studying the JSON format of the event data, a Hive table can be created for a signUp event with following create table command. The same applies to the orderSummary as well.
Since I have the signUp and orderSummary tables as Hive tables, I can simply run a SQL Query to find out the how many orders are purchased by the customers who are signed-up last week.
The same could be achieved with the streming data pro
The same result can be achieved with the streaming event processing features of Kafka but then the historical data would be lost because of the Kafka retentions.
With this infrastructure hit-rates of the newly signed-up users can be calculated over time and the performance of the e-commerce can be calculated over time.
Kafka is a very fast and durable distributed streaming platform and acts a streaming data hub between different application and/or micro services within the companies. Kafka has a lot of capabilities such as, mirroring, partitioning, complex event processing with Kafka Streams (or, if you are using , then with Kafka SQL). The downside of the Apache Kafka is its retention, which is 7 days by default. Of course you can increase this retention period according to your business needs. But if you increase the retention period too much, this may lead you the misusage of a distributed streaming platform because Kafka is not for data archiving.
The only missing thing is a kind of integration solution between Apache Kafka and Hadoop HDFS. There are a bunch of solutions such as , or Kafka Streams. But in this case I preferred to use which has Spark Streaming feature. Spark Streaming is a streaming data processing framework which has a good set of capabilities not only for Extract and Load but also in means of Transformation.
Apache Spark is one of my favorite data processing framework which works in-memory and distributed as well. Thanks to RDD architecture and the framework itself, Spark handles the distribution of the data across nodes and the is quite well to understand the concepts and the framework usage.
Data is written to Hadoop HDFS as partitioned in to the YEAR, MONTH, DAY, HOUR basis to increase the performance of tables defined on top of this HDFS Folder which will be discussed in the following section of this post.
Since the data pipeline is completed and the Historical Click-Stream events are now flowing into our Hadoop Filesystem we can start working on our data. To work with the data on the HDFS environment there are different options such as Apache Hadoop MapReduce, , Spark RDD, Spark SQL and .
The sample row in the HDFS Filesystem for a singUp event is as follows. To Define a table on this JSON based data I have used a JSON serde which can be downloaded from .