Google Store Customer Revenue Prediction

6 min readOct 4, 2021

The objective of the blog is to explain Google Store customer revenue prediction, where we need to predict the revenue for the store, given the attributes of list of customers.

Introduction

We have data which contains the details regarding the total revenue that is generated for the Google Store for each customer who visits the store. Generally, the maximum revenue generated for the store is only by the 30 Percent of the regular customers and 70 Percent are just random buyers. The Goal of the task is to identify the type of those customers that will increase revenue of the company so that the store management can target Promos and offers for those users so as to retain the customers as well as increase the probability of shopping which in turn increases revenue of the store.

To Summarize:
Identify Valuable Customers → Give Promotions/Offers for those customers → Customers will further purchase more and can retain in the same store → which further increases the store revenue.

ML Formulation

Given predictors about the Store Customers, the objective is to predict the Log of Transaction Revenue target label by minimizing the Root Mean Squared Error

The problem we have is Time Series Forecasting Regression , where the model should learn the Trend, Seasonality, Irregularity and Cyclic behaviour of the customer purchase pattern and predict revenue for a given pattern.

Business Constraints

There are no business constraints as such other than minimizing loss function. There are no latency requirement while prediction, since the results are mostly used to understand the customer, this can be deployed as nightly job and update the customers as Prime Customer/Not Prime Customer e.t.c

Feature Engineering

Repo: https://www.kaggle.com/c/ga-customer-revenue-prediction/data

The above table summarizes the features present in the dataset with data type along with the sample data. There are few columns that have json data , where each property of the json can be treated as a separate feature.

Unique value columns

There are many columns which have only one value, which does not add good weightage to the model, so those columns are removed.

Channel Grouping

The Channel Grouping feature shows that most of the users landed on the website though search engine, which in turn the total revenue; it describes how important it is to do search engine optimization for a website to increase the customer traffic.

Device Browser

Most of the users use Chrome, and partly safari and firefox. Top 10 browser details can be used as it is and the rest can be marked as ‘Others’ so as to minimize the features.

w.r.t device development , the importance should be given to these top 10 browsers for e.g bug fixes, unit testing, device testing as more no. of users are coming from these sources.

Country

Most of the users are from United Statues, and the others are from India, U.K. To attract more customers, Promo planning should be done prioritising these top 7 country Festival Seasons, Holidays e.t.c

Total Transactions w.r.t browser

We can observe the most trasaction amount comes from Chrome Firefox and Safari users.

Log of Transactions Revenue

We can obseve that, after taking Log of the Transaction revenue, it gives the normal distribution than the right skewed distribution without log, so this can also be modelled for regression but taking log will simplify the calculations that is done during Gradient Descent which helps to converge faster.

Research-Papers/Solutions/Architectures/Kernels

Below are few research papers which discuss about the time series splitting and its advantages and how various components of the time such as seasonality, trends, noise are factored in the data.

1. Time Series as Primary Structure to data:

- Dr. Francesca Lazzeri on Machine Learning for Time Series Forecasting

Francesca Lazzeri on Machine Learning for Time Series Forecasting

In the podcast, we speak with Dr. Francesca Lazzeri on machine learning for time series forecasting as the main topic…

www.infoq.com

2. Components of Time Series:

- Marco Del Pra

Time Series Forecasting with Deep Learning and Attention Mechanism

An overview of the architecture and the implementation details of the most important Deep Learning algorithms for Time…

towardsdatascience.com

Take aways from Research Paper:

Based on the above research papers, there are more hidden features in terms of seasonality and trends when we have temporal data, hence the data has been splitted by sorting the dataframe by “date” column and use the older transaction for training and the recent ones for validation

First Cut Approach

To start with, Exploratory data analysis will be done and understand the data through statistical and graphical ways, then split the data by date-wise so as to utilize the temporal nature. Since this is a regression problem, as a baseline, Linear Regression model can be trained to fit the data as well as to understand the feature importance based on the weightages provided by the model.

Data Encoding

Label encoder is used to vectorize the Categorical data and no action is taken for numerical features as the models are tree based.

Model Building

After trying different models like Linear Regression, Random Forest, XGBoost and LGB , the best model is XGBoost which gave less test loss as well as validation loss among the others for the low n_estimators

Cross Validation

Below is the plot that is drawn for n_estimators vs MSE and best value is considered as 50.

Evaluation

The test loss and validation loss are 0.05 and 0.06 respectively for the n_estimators = 50.

Metrics from Different Models

We can observe Random Forest mse is less when compared to xgboost, but the n_estimators is high when compared to XGboost, hence XGBoost is considered to model the data.

Github Repo

https://github.com/solomon-data-ml/google-store-revenue

Deployment

This is deployed in Heroku and available in the below url:

Gstore Revenue Prediction

Edit description

gstore-predictor.herokuapp.com

Conclusions and Future Work

We can conclude that how much store data is useful to predict the revenue, this can be leveraged further to improve the model performance by adding features like Category of Product being purchased, Sub Categories, Type of Promotions e.t.c