Avito Demand Prediction

9 min readOct 12, 2021

The aim of this work is to help online sellers to make them understand how much demand will be for a given product so that they can improve their way of describing the product in the website. For instance, if a seller has two products, a) Headphone and b) Mobile, if the model provides the demand result high for Headphone , it means that, the Title, Description, Product Technical Specification and Images for that product is good enough to attract Customers, else the seller has to rework on those and upload with proper Description, Images so as to increase the demand for that product.

ML Formulation:
Identify Valuable Customers → Give Promotions/Offers for those customers Given the numerical, categorical, image data of a given product, Predict the demand probability value of that product. This is basically a regression problem where we need to reduce the Root Mean Square Error.

This involves both Machine learning and Deep Learning techniques as we have text and image data.

Business Constraints

There are no business constraints as such other than minimizing loss function. There are no latency requirements while prediction, since the results are mostly used to understand the how well the text, description , specs , images of a given product, this can be deployed as a nightly job and update the product with respective demand probability values.

Dataset Column Analysis

Repo: https://www.kaggle.com/c/avito-demand-prediction/data

The below table summarizes the features present in the dataset with data type and high level description about that feature.

Excel Format : Avito_Demand_Prediction_Data_Columns

Research-Papers/Solutions/Architectures/Kernel

1. Cov1D for NLP

Taha Binhuraib
https://towardsdatascience.com/nlp-with-cnns-a6aa743bdc1e

The general use of CNN is for image processing, but when it comes to text processing, the same logic can be applied but with 1 dimensionality, to slide over a window of data to capture the essence of the sentence.

Conv1D module is available in tf.keras.layers.Conv1D with filters and kernel size as positional parameters, and it can be created as follows.

input_shape = (4, 10, 128)

x = tf.random.normal(input_shape)

y = tf.keras.layers.Conv1D( 32, 3, activation=’relu’,input_shape=input_shape[1:])(x)

2. CatBoost

Developer : Yandex
https://catboost.ai/en/

Catboost is an open source machine learning library which is created by using gradient boosting on Decision trees. The list of categorical features has to be specified in cat_features parameter and the rest of the data encoding, modeling will be done on the fly by Catboost. It outperforms the Gradient boosting, also it utilizes the GPU by providing task_type parameters. This can be tested on the Avito dataset and validate the performance.

3. Neural Networks for Predictive Monitoring of the Anaerobic Digestion Process

Mark McCormick, Alessandro E.P. Villa
https://link.springer.com/chapter/10.1007/978-3-030-30493-5_65

In this paper, the data has been modelled by using the Cov1D as well as LSTM for the purpose of the below reasons.

The Bio process has long and variable lag times between the predictor and responses, so the LSTM has been chosen, this can also be employed when it comes to the demand prediction as it is highly dependent on the time, this architecture can be experimented with the Avito dataset.
1D convolution has high potential to extract useful information/consolidate the information into windows, which can be tried to Avito dataset as well instead of directly connecting to the fully connected layer.

4. Blur Detection using Open CV

Andrian Rosebrock
https://www.pyimagesearch.com/2015/09/07/blur-detection-with-opencv/

The amount of blur in an image can be identified using Open CV and Laplace Operator.

Typically below are the steps are used to find blurriness of an image

Find Fast Fourier Transform of the image
Note down the distribution of High frequency as well as Low Frequency
If there are low amount of High frequencies then those images are marked as blurred images

But, if we use this approach, then Step C varies for different types of object, for one object the value for “low amount of high frequency” will be different for other objects. One such solution is the variation of the Laplace method. by Pech-Pacheco et al. in their 2000 ICPR paper. Below are the steps that is used here,

a. Take a single channel of an image

b. Convolve it with 3 * 3 Kernel

c. Calculate variance of the result

d. If the variance falls below a threshold it is considered as blurry.

It is available in open cv, with single line code implementation

cv2.Laplacian(image, cv2.CV_64F).var()

So, the idea here is to incorporate, blurriness factor of the images of the ADS as one of the features to decide the demand of AD.

Exploratory Data Analysis

Translation using Google Translator API:

The features are described in the Russian Language, to understand those, Google Translate API has been used. For each of the columns, corresponding ‘eng_’ columns are created to do exploratory data analysis.

translator = GoogleTranslator(source=’ru’, target=’en’)
translator.translate(“Екатеринбург”)

Region

Each Region has given Ads around mean of 40,000 , there is no long tailed skewness seen here. The most contributed region is ‘Krasnodar Region’.

City

The top 3 City captured for the given dataset is ‘Krasnodar’ ,’Ekaterinburg’ and ‘Novosibirsk’.

Distribution of Region

We can find that the huge percentage of ADs are based on the region — Krasnodar, the pie chart good visualization of where the ADs are mostly done and where it is not covered.

User Types Distribution

Almost 75% of ADs are create by private than the actual company or the Shop themselves, the reason might be, most Company spend their ADs in TV Commercials and Social Media Platforms.

Parent Category Names

From the below chart, we can find 50% of Ads belong to Personal belongings Category, and the next is home and garden, and this is obvious from the previous chart that most of the ADs belong to private people and not from the Company. So most people provide their ADs to showcase mostly the personal belongings they wish to sell.

Word Cloud of Title

From the below word cloud , we can understand most of ADs are showcased for Real estate specifically in Apartment type.

Word Cloud of Description

From the below word cloud , we can see that the words ‘in good’, ‘good condition’, ‘in excellent’, ‘excellent condition’, ‘boo’ are the most prominent , which is expected where the products are described to be Good, and Excellent condition!

Correlations

From the heatmap , we can identify that the Price has good correlation with the Deal Probability, as well as Image Sequence Numbers, so those are good predictors to start with.

Deal Probability w.r.t Region

The Krasnodar Region Ads have more likelihood to have deal because this region have more no. of Ads than the other regions.

Deal Probability w.r.t Parent Category Name

By comparing the pie chart with the below one , we can conclude the deal probability is in align with the no. of Parent Category Name and it is “Personal Belongings”.

Deal Probability w.r.t Category Name

The “Service Offer” related category and “Products for children and toys” has high probability of having a deal among all the other type of categories.

Non zero Deal Probability

There are more density around 0.2 which is less than average and nearer to the ideal likelihood which is 0.8, so there are either ADs that give very very less deal or there are ADs that give very good deal and we can identify the moderate probability which is around 0.4 to 0.6 in very less count, so the given data is very helpful in terms of finding a Good AD or a Worst AD.

Feature Engineering

As part of Feature Engineering , the title length and description length is taken into consideration, because very less title length may not have much impact on the advertisement viewers and so is very large length of title or description, also, the day of the week is considered, as the view of ADs in weekdays impacts on how customers reacts to the ADs.

The Blur value of the image is calculated based on the Laplacian, this is important because if we have very blurred image, then obviously nobody will get interested to interact with the ADs, this will certainly add more weightage to the model.

Numerical feature Data type conversion

The datatype of numerical features has mixed types such as objects and strings, This needs to be updated to numerical format for modelling.

Handling Null Values

The null values for the categorical features as set to NA, and the numerical is set to zero and only the data points that are having valid description is considered for data modelling.

Split data into Test and Train

The data we have is Temporal in nature, to explore that in modelling, first the dataframe is sorted based on activation date, and then it is split such that, the first 80 percentage of that goes to the training, and rest of it goes for testing.

First Cut Approach

Most of the features here are categorical which suits for tree based models, so for the first cut approach, Random Forest machine learning technique can be tried incorporating the blurriness and other factors of the image without using the Neural network, and set as the base line for the next models that can be explored with Conv1D and time based architectures such as LSTM.

Model 1 : Base Line - Random Forest Model

The base model which we got is giving 0.239 rounds up to 0.24 as a benchmark for the upcoming models. The n_estimator is taken as 80 by using the elbow method, though the 400 and 600 estimators gave slightly better result than the n_estimator of 80 , so as to avoid overfitting as well and to improve runtime performance.

Model 2 : CatBoost

In Catboost, we need to provide the index of the categorical features, and the rest of it will be taken care by Catboost module itself, another advantage is that it makes use of GPU in the system which comes out of the box decreasing the training time drastically.

We can also observe in the sorted features that , the main category of the product as well the Price plays a vital role in deciding the deal probability, so the AD creators should give more importance how they structure the parent category and at what price they are planning to sell the product.

Model 3: Model creation using GRU, Neural Networks and Inceptionnet

I have also tried with Neural Network with Inceptionnet to handle the Image, and GRU units for NLP embeddings. There has been pretty much good improvement above all the above models, and has been finalized as it converged from 5th epoch onwards.

The code can be accessed at https://github.com/solomon-data-ml/google-store-revenue

Observations and Conclusions

Techniques such as image imputation, adding engineering features such as blurness, aggregated features added more improvement to the model. This has led to bring the RMSE from 0.266 from 0.222.We can conclude that how much store data is useful to predict the revenue, this can be leveraged further to improve the model performance by adding features like Category of Product being purchased, Sub Categories, Type of Promotions e.t.c

Areas of improvement

We can improve this model further by tuning hyperparameters such as activation function types like tanh, and no. of GRU units, and also the images can be leveraged by identifying the color values , for instance, whiteness, Brightness etc.