What is the use of data standardization and where do we use it in machine learning
In the process of learning machine learning you will encounter the word standardization, column standardization or mean centering plus scaling but what is purpose of importing standard scalar as below
from sklearn.preprocessing import StandardScalar
What exactly is standardize??
How to standardize and why to do so the data before fitting a machine learning model?
What to standardize in the first place?
In this post you will get to know the clear insights about standardization so lets get started.
Standardization process comes in while data preprocessing step. Before learning Standardization you also need to know about normalization.
Normalization:
In column normalization we tool column values and compressed it in range between [0, 1] so as to get rid of scales of each features.
Ex: lets consider we have dataset having features such as height and weight of a person these two features have different scales values such as height in cms and weight in kgs.
To get rid of such scales of features we go with normalization, similar to normalization we have “standardization”
Note: In practice column standardization is more often used then normalization.
Suppose our dataset matrix be like
where f1, f2, f3, .. fj,…fd. are d features and 1,2,3,…n are the features/row values.
Consider for one column a1,a2,a3,…an. are ’n’ values of features ‘fj’
what column standardize does exactly is it converts your column values (a1, a2, a3, …an) to (a’1, a’2, a’3…..a’n) such that the mean of transformed data is zero and standard deviation as one. where as in column normalization we transform {a’1, a’2, a’3…..a’n} all values in range from [0 to 1].
a1,a2,a3,….,an can be of any distribution before transformation but after transformation {a’1, a’2, a’3…..a’n} will have mean = 0, std-dev =1.
But why this transformation is important at all?
Lets understand its important in geometry perspective.
Consider below dataset, where ‘x’ is the height of the person and ‘y’ is the weight of the person and mean lies in the middle of the data set as show below and variance as well.
Variance measure the spread of the data, more the spread → more the variance value.
By transforming the above data set using column standardization we get mean at the origin as show below and we have pushed point closer such that its spread/standard deviation =1. If the standard deviation before transformation is less than 1 then after transformation we will be pulling far away from each other to achieve standard deviation to 1.
Column standardization is also called as mean centering plus scaling.
If we don’t do standardization does it affect anything?
if the features follow different scaling, then it would be difficult for the model to converge faster and the computation time of the training increases and also it doesn’t yield better results.
Column standardization is more effective when the underlying distribution is Gaussian with say mean=mu and std-dev=sigma. This is due to the fact that when you normalize such a feature, you get the new feature to be of N(0,1) distribution. This is an ideal distribution with lots of ML, Stats and Optimization techniques assuming a Gaussian distribution to make the proofs work beautifully.
That doesn’t mean standardization is not good for no-gaussian distributed features. It still results in a new feature with a mean of 0 and variance of 1, but not N(0,1) distribution. In practice, we perform standardization irrespective of the underlying feature distribution. But the mathematical proofs are well suited when the distribution in Gaussian.
Summary:
- Data standardization is the process of rescaling the attributes so that they have mean as 0 and variance as 1.
- The ultimate goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values.
- In sklearn.preprocessing.StandardScaler(), centering and scaling happens independently on each feature.