Customer segmentation with BigQuery ML

BLOG

Customer segmentation with BigQuery ML

Tech kitchen

DESCRIPTION

Using standard SQL to perform customer clusterization based on clickstream behavior

by Taís L. Pereira

Don’t get me wrong, I’m a big Python fan, both for unsupervised and supervised learning. However, since Kramp Hub’s ¹ data warehouse is based on Google Cloud Platform (GCP), an alternative solution got my attention this time: BigQuery ML.

BigQuery Machine Learning (ML) is a GCP’s feature to operationalize ML algorithms directly within the BigQuery environment. Using only SQL, models can be developed, trained, and used for predictions. With that, it democratizes ML’s operationalization for data analysts and other DWH users. Moreover, due to its integration with the DWH, models’ development speed and complexity are reduced.

BigQuery ML brings ML to the data

It supports a wide range of models, such as Linear and Binary Regression, Deep Neural Network, XGBoost, K-means clustering, among others. Considering that our topic is customer segmentation, can you guess the chosen one? Yes, you got it right, I’ll talk about K-means ².

But first, let’s dive into the business problem that we were trying to solve.

Part 1: customer segmentation

Customer segmentation is a fundamental part of customer analytics. My interest in that was consolidated after researching Peter Fader’s customer-centricity work. Author of the book Customer Centricity: Focus on the Right Customers for Strategic Advantage and professor at the Wharton School of the University of Pennsylvania, Fader proposes the utilization of users’ behavioral data to determine customer value.

“with today’s ability to obtain better data and the computing technology to process that data, combined with a better understanding of analytics, we can gain a better understanding of customers than ever before.”

Given the large amount of clickstream e-commerce data available on our DWH — with more than 64 million rows containing sessions’ information — I decided to do a segmentation based on that. Additionally, Kramp Groep operates in 24 different countries, with varied economical aspects that could turn into a bias in our model. Therefore, instead of the traditional RFM (recency, frequency and monetary value) method, the RFE (recency, frequency and “engagement”) approach was used, in which:

Recency is how recent was the last session captured by our back-end stack;
Frequency is the number of repeated search events;
Engagement is the total duration of the interactions (sessions) in minutes.

Part 2: the shape of the data

In our original dataset, each row is related to a unique session. However, to serve as an input to the K-means model, a different shape is needed. Instead of the previous format, the data needs to be summarized by each user_id (customer_id). With this idea, an RFE table was created:

Later, it was defined that recently acquired users, or users with recency=0 or frequency=0, should be filtered out, as can be seen in the code shown in the next section. With that, our input data contained approximately 70.000 rows.

Part 3: BigQuery ML K-means clustering

Now comes the fun technical stuff! First, to generate a model on BQ, the CREATE OR REPLACE MODEL statement has to be used, with the model type specified in sequence. In our case, MODEL_TYPE=’KMEANS’ does the job. Additionally, KMEANS_INIT_METHOD defines the centroids’ initialization method, in which KMEANS++ was the chosen argument, instead of RANDOM. According to GCP’s documentation, KMEANS++ results in a more efficient model than the latter, leading to an optimal solution when compared to an initialization based on randomly selected data points. A nice article with more technical information about those different methods can be seen here. Moreover, STANDARDIZE_FEATURES defines if the model will standardize the numerical inputs, taking the arguments True or False.

CREATE OR REPLACE MODEL data_science_datasets.rfe_clusters OPTIONS(MODEL_TYPE='KMEANS', KMEANS_INIT_METHOD = 'KMEANS++', STANDARDIZE_FEATURES = TRUE) AS WITH rfe_table AS ( SELECT user_id, country_code AS country, DATE_DIFF(MAX(DATE(bq_metric_timestamp)), MIN(DATE(bq_metric_timestamp)), DAY) AS recency, (SUM(search_event_count) -1) AS frequency, ROUND((SUM(session_duration_ms)/1000)/60,2) AS engagement, FROM `your_project.dwh.session_metrics_generic` WHERE DATE(bq_metric_timestamp) >= "2019-07-01" AND DATE(bq_metric_timestamp) < "2020-01-01" GROUP BY user_id, country) SELECT * EXCEPT(user_id, country) FROM rfe_table WHERE recency>0 AND frequency>0 ORDER BY country ASC

You must have been wondering: what about the number of clusters? Yes, it was not defined in the code through the NUM_CLUSTERS parameter. When omitted, BigQuery ML will automatically define it, based on the dataset’s size. This can lead to an optimal cluster number — which in our case was 4 — but it’s not the rule. In order to find an optimal number of clusters, different values should be tested, and the one that yields the lowest Davies-Bouldin index (see next section) is the winner. However, omitting this parameter leads to a reasonable solution and improves the automation for this part, since it’s possible to schedule the CREATE MODEL query to periodically generate a new model. In addition, this is an improvement in terms of the similar Python implementation, where it would be necessary to check an Elbow Method graph and check where the curve flattens, and then choose the number of the clusters:

The model was created using data from 6 months of the last year (2019).

Model evaluation

After creating your model, a table is generated in the specified dataset (thus change the code part data_science_datasets.rfe_clusters to what is applicable for you). Clicking in “evaluation” will display the error value (David-Bouldin index) and the numeric features for each cluster centroid:

As it can be seen, the error for a number of clusters equal to 4 is pretty low (below 1), which can be interpreted as a solid result. In comparison, with NUM_CLUSTERS=5, the index was 0.7927.

With the centroids defined, now we can attribute user-friendly names to each customer segment and predict the segments for data from the current year, using ML.PREDICT. From what can be seen from the variables’ values for centroid 3, these are our most loyal customers, “champions” — old relationship (high recency), high frequency of searches, and high engagement. In contrast, customers with centroid 2 are relatively “new”, since their recency value is considerably lower. Moreover, the centroid=4 shows old users with low frequency and low engagement, so this is our group “at-risk”. Finally, the centroid 1 accounts for “regular” customers:

WITH RFE_segments AS ( WITH rfe_table AS ( SELECT user_id, ANY_VALUE(country_code) AS country, DATE_DIFF(MAX(DATE(bq_metric_timestamp)), MIN(DATE(bq_metric_timestamp)), DAY) AS recency, (SUM(search_event_count) -1) AS frequency, ROUND((SUM(session_duration_ms)/1000)/60,2) AS engagement, FROM `your_project.dwh.session_metrics_generic` WHERE DATE(bq_metric_timestamp) >= "2020-01-01" GROUP BY user_id) SELECT user_id, country, recency, frequency, engagement, FROM rfe_table WHERE recency>0 AND frequency>0 ORDER BY country ASC) SELECT CASE WHEN CENTROID_ID = 1 THEN 'Regular' WHEN CENTROID_ID = 2 THEN 'New' WHEN CENTROID_ID = 3 THEN 'Champions' WHEN CENTROID_ID = 4 THEN 'At Risk' END AS segments, * EXCEPT (nearest_centroids_distance) FROM ML.PREDICT(MODEL data_science_datasets.rfe_clusters, ( SELECT * FROM RFE_segments))

As you can see from the code above, considering that we are interested in predicting clusters in the present year, the date condition was changed to DATE(bq_metric_timestamp) >= “2020–01–01”. Therefore, every time that I run the query above, it will consider data from January 2020 until the current date, with new sessions captured continuously.
Final part: visualization and the business value behind it
The segments can be displayed in a Google Data Studio report, where additional filters can be added, as the country filter in this example. Our report shows the count of users in each segment, the average distribution of variables, and how each segment is positioned in terms of pairs of variables:

From these graphs, we can think about the following implications:

A large number of users “at-risk” can lead to an investigation of the platform features and/or an analysis if the right group is being attracted to the website;
“Champions” can be given special attention to increase cross-selling or up-selling through search features improvements and new functionalities;
The customer segment label can be incorporated in a platform microservice as a metric, and trigger a message to the sales/marketing team;

I hope you enjoyed this post and got inspired to start your own customer segmentation journey!

Special thanks to Jelmer Krauwer and Ernest Micklei for revising it.

[1] Kramp Hub is a corporate start-up part of Kramp Group, Europe’s largest specialist in spare parts and accessories for the agricultural industry. Our technology and products enable both our current tenants, Kramp and Maykers, to stay future-proof with our e-commerce solutions.

[2] A detailed explanation about the K-means algorithm can be seen on Scikit-learn documentation, here.

BLOG