Enhancing your Click-Stream with CRM database
Last updated
Was this helpful?
Last updated
Was this helpful?
In summary, Click-Stream is the every event of end users on the e-commerce websites. Since every click, search, login, logout, order and listing... etc are the actions of the end-users, the more active users an e-commerce has, more performance effect is expected on the web page processing. In order to not to effect the performance of the website, this architecture needs to be very fast and lightweight. To process pages more efficiently with the extra load of the click-stream processing, architectural team has the bargain of designing the simplest and asynchronous data structure with the minimal effect on bandwidth and page load times.
Simple and lightweight data architecture brings the informational clearance within the click-stream. Instead of storing the user email and/or user name, the id of the users should be stored. Or, instead of storing and moving the name and all the features of the product, only the id of the product should be stored within the click-stream.
The problem is, since click-stream is a streaming data flow, in order to take actions on the streaming data more detailed information is needed for the user identification. For instance, if the e-commerce company needs to send an email to newly signed-in users, or to execute a campaign according to the product visits or to recommend alternative products to the users, it is obvious that more data is needed than only the user id or the product id.
This problem brings the question, how can we achieve the detailed information of a user by using only the id of the user while the data is flowing?
The answer given to this question in this headline is Spark Streaming. Spark is a distributed in-memory data processing platform and framework which also has the streaming data processing capabilities with its powerful feature Spark Streaming.
Spark Streaming can attach to Apache Kafka topics and starts consuming the streaming data which is flowing within the Kafka topics. Spark Streaming uses micro-batching to collect data from Kafka. In this headline we are not going to use time windows but we are going to use 10 second intervals for data retrieval.
As we working on a JSON representation of a click-stream event, we need to filter the productView events and map the userId and the productId with the variantId of the product itself. Timestamp value is also very nice to have if we want to investigate the timing in a more granular manner.
Since we have the Streaming Events of click-stream data with the IDs of the Users and the Products visited we need to JOIN the data with the CRM database which has the detailed information on users. Apache Spark can also connects to databases and reads database table data to its distributed memory structure which is called RDD. Streaming data can be joined with the static database data on the fly.
As we have both the Click-Stream and the CRM data we need to JOIN the data and present it.
Detailed code and information on consuming data from Apache Kafka topics with Spark Streaming can be found on .
Above code block is a small part of the Spark Streaming project. Since this code block is working on a JSON file we need to work on JSON key and values. Here I used a custom developed json_parser class. To understand he json_parser class you can check out the Post.
The complete code for Spark Streaming on Kafka can be found on Post.
The complete code on JdbcRDD can be found on post.