L1 Regularization and Sparsity

Solomon
3 min readMar 23, 2022

--

This article describes about the L1 Regularization and why it creates sparse solution.

The context of sparsity here is the weights of the model being/tends to zero after training so as to regularize the complex model. To understand better, in the above figure, the pattern that is in Blue colour represents a model without regularization which fits all the data points and constructs a complex model whereas the green curve represents regularized model which removes the complexity (overfitting) to produce a simple model.

Modification in Loss Function

W.K.T to avoid overfitting, regularization is done. Regularization is nothing but adding another term to the Loss function or we can think of adding complexity to the loss function. If that term is sum of all weights of the model multiplied with a small factor , then its called L1 Regularization and if the term is sum of all squared weights, then its called L2 Regularization. Basically we are adding Noise so the model cannot fit exactly to the data. Training on top of regularized loss will smooths our model function, this is like a function smoothing process.

While optimizing, both loss as well as weights will be taken into consideration.

L1 Regularized Loss = Loss +( |w1|+|w2|+ . . . + |wn|)

L2 Regularized Loss = Loss +( |w1|² +|w2|²+ . . . + |wn|² ))

The shape of L2 Regularization - |w1|² is a parabola

The shape of L1 Regularization - |w1| is a V shaped function

Derivate of L2 is 2*w1

Derivate of L1 is +1 for w1>0 and -1 for w1<0

We can observe that for L2 the grandient decreases slowly, but in case of L1 , we can observe the gradient is constant, so the weights of L1 tends to zero sooner than L2 for each iteration and that is why we see more zeroes(Sparse) in L1 than L2

Here is an excellent proof with 4 lines of code to see how L1 regularization decreases faster than L2.

https://repl.it/repls/CreepyMuffledFrontpage

L1 as Feature Selector

As L1 makes most of the weights to zero, we can treat it as feature selector which nullifies the useless features, this is same as removing links between Neurons called Dropouts used in neural network regularization.

Source

Banner Image Source- By Nicoguaro — Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46259145

--

--

Solomon
Solomon

Written by Solomon

Passionate about Data Science and applying Machine Learning,Deep Learning algorithms

No responses yet