This blog explains the effect of different regularization on data which is imbalanced and data which has noise.
Support Vector Machine
The below diagram shows how SVM tries to classify the data for different ratios of imbalance and with 5 different regularization values.
Left to Right Plots → Imbalance data of ration 100:2, 100:20,100:40.100:80
Each row corresponds to Regularization ‘C’ value → 0.001,1,100,1000,2000
Model → SVM
When C value is set to 2000 we can observe that even the heavily imbalance data (1st Column, Last row) is classified perfectly.
For other C Values, the margin is overlapped across positive and negative.
Logistic Regression
Left to Right Plots → Imbalance data of ration 100:2, 100:20,100:40.100:80
Each row corresponds to Regularization ‘C’ value → 0.001,1,100,1000,2000
Model → Logistic Regression
We can observe the same effect for Logistic Regression as well.
Effect of Outliers vs Regularization
Data setup : For Illustration, Elliptical data is created and outliers are inserted at the below mentioned positions and we are going to observe how the model behave based on the outlier data present.
Left to Right Plots → Outlier Positions — (0,2) (21,13) (-23,-15) (22,14) (23,14)
Each row corresponds to alpha value for SGD Regressor → 0.0001,1,100
Model → SGDRegressor
We can observe that the Hyperplane changes when we have outlier in the data for the ‘C’ values ‘0.0001’ and ‘1’ The Hyperplane specifically incline towards outlier data for ‘C’ values ‘0.0001’, ‘1’
For the C value ‘100’ ,the model does not have impact on the data, i.e the hyperplane wont change the classification parameters and behaves well with or without outlier data
Conclusion
This summarizes the regularization parameters have huge impact on the imbalance datasets and the data with noise which we encounter in all live data, hence importance should be given to fine tune all the hyperparameters to get the best model.