This article will describe about the impact of outliers as well as solution for the same.
In Logistic Regression, the error is calculated based on the distance measure i.e how far the data points lie with respect to the line/plane/hyperplane.
Let us assume that we have datapoints as belowthat belong to two class classification -Positive, Negative.
We have created a logistic model, say M1 which predicted a line/classifer π1 that separates/classifies the two set of data points as below:
The error for the above model can be calculted by a simple distance measurement, where if the data point lies on the correct side of line we can give a positive value and if it lies in incorrect side of the line, we can give negative and sum them up, and for Model 1 the error is -16.5
Let say we have created another logistic regression model -M2, which has created a classifier π2 as below:
The total error for this classifier is zero which is what everyone likes to have.
If we compare both of the lines/classifier of both M1 and M2, we can easily identify that the first classifier is far better than the second one. But, just because of one outlier having error unit of -20, the Model M1 has more error value compared with Model M2.
So, to reduce the effect of outlier, instead of just taking the total error into consideration, the error can be passed to Sigmoid function , which squashes any given value between 0 and 1.
Below is the sigmoid function where we can clearly see that even if we give a value as large as infinity, the output always tends/approaches to the value 1, but never greater than 1.
Sigmoid function: y(x) = 1/(1+e^-x)
Eg:
y(1) = 0.73
y(2) = 0.880797078
y(10) = 0.9999546
No matter how large/small the input is, the output always lie between 0 and +1
In Logistic Regression, this is one of the reason for using Sigmoid function which wraps the cost function to control the overflow of total loss thereby reducing the effect of outliers.