How to detect outliers with z-score (2024)

Z score, also called as standard score, is used to scale the features in a dataset for machine learning model training. It can also be used to detect outliers. In this one, we will first see how to compute Z-scores and then use it to detect outliers.

How is Z-score used in machine learning?

Now, different variables/features in a dataset have different range of values.

For example: A feature like ‘Age’ may vary from 1 to 90, whereas ‘income’ may go all the to 10’s of thousands. Looking at a particular value (or observation), for any given variable, it is difficult to say how far is it from the mean without actually computing the mean.

This is where Z-score helps.

It is used to standardize the variable, so that just by knowing the value of a particular observation, you get the sense of how far away it is from the mean.

More specifically, ‘Z score’ tells how many standard deviations away a data point is from the mean.

The process of transforming a feature to its z-scores is called ‘Standardization’.

Z Score Formula

The formula for Z-score is as follows:

$$ Z score = (x -mean) / std. deviation $$

If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.

Z-score can be both positive and negative. The farther away from 0, higher the chance of a given data point being an outlier. Typically, Z-score greater than 3 is considered extreme.

source: pinterest graphic

A normal distribution is shown below and it is estimated that:
a) 68% of the data points lie between +/- 1 standard deviation.
b) 95% of the data points lie between +/- 2 standard deviation
c) 99.7% of the data points lie between +/- 3 standard deviation

Common Data that follow normal distribution:
1. Heights of people
2. Income
3. Blood pressure
4. Salaries etc

Load the dataset (dont run if you followed along previously)

# Import libraries # Data Manipulationimport numpy as np import pandas as pdfrom pandas import DataFrame# Data Visualizationimport seaborn as snsimport matplotlib.pyplot as plt# Mathsimport math# Set pandas options to show more rows and columnspd.set_option('display.max_rows', 800)pd.set_option('display.max_columns', 500)%matplotlib inline

Read the data

# Read data in form of a csv filedf = pd.read_csv("../00_Datasets/Churn_Modelling_m.csv")# First 5 rows of the datasetdf.head()
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0115634602Hargrave619.0FranceFemale42.020.00111101348.881
1215647311Hill608.0SpainFemale41.0183807.86101112542.580
2315619304Onio502.0FranceNaNNaN8159660.80310113931.571
3415701354Boni699.0FranceNaN39.010.0020093826.630
4515737888Mitchell850.0SpainFemale43.02NaN11179084.100

Histogram

plt.hist(df.CreditScore, bins=20, rwidth=0.8)plt.xlabel('CreditScore')plt.ylabel('Count')plt.title('Histogram - CreditScore')plt.show()

Standard Deviation and Mean

np.nanstd(df.CreditScore.values.tolist())
96.6527190618191
np.nanmean(df.CreditScore.values)
650.5254525452546

Check for any infinity values.

np.isinf(df[['CreditScore']]).values.sum()
0

Let’s compute the Z-score. First, compute the mean and standard deviation.

# Compute Z Scorecr_mean = np.nanmean(df.CreditScore.values.tolist())cr_std = np.nanstd(df.CreditScore.values.tolist())print("Mean Credit Score is: ", cr_mean)print("Std Credit Score is: ", cr_std)
Mean Credit Score is: 650.5254525452546Std Credit Score is: 96.6527190618191

Calculate Z Score

From each observation, subtract the mean and divide by the standard deviation.

df['zscore_CreditScore'] = (df.CreditScore - cr_mean ) / cr_stddf[["Surname", "CreditScore", "zscore_CreditScore"]].head()
SurnameCreditScorezscore_CreditScore
0Hargrave619.0-0.326172
1Hill608.0-0.439982
2Onio502.0-1.536692
3Boni699.00.501533
4Mitchell850.02.063828

Extract the outliers/extreme values based on Z-score

Generally, we consider the values outside of +3 and -3 standard deviations to be extreme values. Let’s extract them.

# Extreme values based on credit score.df[(df.zscore_CreditScore<-3) | (df.zscore_CreditScore>3)]
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedzscore_CreditScore
1405140615612494Panicucci359.0FranceFemale44.06128747.69110146955.711-3.016216
1631163215685372Azubuike350.0SpainMale54.01152677.48111191973.491-3.109333
1838183915758813Campbell350.0GermanyMale39.00109733.20200123602.111-3.109333
1962196315692416Aikenhead358.0SpainFemale52.08143542.36310141959.111-3.026562
2473247415679249Chou351.0GermanyFemale57.04163146.46110169621.691-3.098986
8723872415803202Onyekachi350.0FranceMale51.0100.00111125823.791-3.109333
8762876315765173Lin350.0FranceFemale60.030.00100113796.151-3.109333
9624962515668309Maslow350.0FranceFemale40.00111098.85111172321.211-3.109333

Treat Outliers

Find the Credit score value corresponding to z = 3 and -3. These will be the upper and lower caps.

z_3 = (3 * cr_std)+ (cr_mean)print(z_3)z_minus3 = (cr_mean) - (3 * cr_std)print(z_minus3)
940.4836097307118360.56729535979724

Replace the values by capping with the upper and lower limits.

## Cap Outliers # df[CreditScore][(df.zscore_CreditScore<-3)] = z_minus3 # df[CreditScore][(df.zscore_CreditScore>3)] = z_3

What are different ways to treat outliers?

It is not always a requirement to ‘treat’ outliers. If you feel that the outliers are valid datapoints and you want the ML algorithm to model and predict them, then no need to ‘treat’ outliers.

However, if you feel you don’t want your model to make such extreme predictions, then you should go ahead and treat them.

There are different ways to treat the outliers:

  1. Remove the observations containing the outliers
  2. Quantile based capping the extreme values. For example: All values greater than 99%ile can be replaced with 99%ile value or all values greater than z-score of 3 be replaced with 3.

  3. Treat the value as missing value and use all the different [imputation methods].

Remove the outlier observations

In the previous section, you have computed the z score. All you have to do is remove the points which has z score more than 3 or less than -3. Or have the points which have z score less than 3 and more than -3.

new_df = df[(df.zscore_CreditScore<-3) | (df.zscore_CreditScore>3)]new_df.head()
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedzscore_CreditScore
1405140615612494Panicucci359.0FranceFemale44.06128747.69110146955.711-3.016216
1631163215685372Azubuike350.0SpainMale54.01152677.48111191973.491-3.109333
1838183915758813Campbell350.0GermanyMale39.00109733.20200123602.111-3.109333
1962196315692416Aikenhead358.0SpainFemale52.08143542.36310141959.111-3.026562
2473247415679249Chou351.0GermanyFemale57.04163146.46110169621.691-3.098986

Quantile based capping

Cap the outliers with the quantile values, generally we go ahead with 5% and 95% quantiles (or 10% and 90% quantiles). As per the requirement you can change it as well

# Computing 10th, 90th percentiles and replacing the outlierslower_cap_percentile = np.nanpercentile(df['CreditScore'], 10)upper_cap_percentile = np.nanpercentile(df['CreditScore'], 90)print("10 percentile :", lower_cap_percentile)print("90 percentile :", upper_cap_percentile)
10 percentile : 521.090 percentile : 778.0

Let’s print the original values from rownumbers 1406, 1632, 1839 which has 3 outliers. Later we will print it post outlier treatment

# original valuesmask = df.RowNumber.isin([1406, 1632, 1839])df.loc[mask, 'CreditScore']
1405 359.01631 350.01838 350.0Name: CreditScore, dtype: float64

Do Outlier Capping

That is, those values lower than the lower_cap_percentile, will be replaced with lower_cap_percentile.

# Outlier cappingnew_col = np.where(df['CreditScore'] < lower_cap_percentile, lower_cap_percentile, df['CreditScore'])df['CreditScore_capped'] = new_coldf[['CreditScore', 'CreditScore_capped']][mask]
CreditScoreCreditScore_capped
1405359.0521.0
1631350.0521.0
1838350.0521.0

As you can see the outlier values have now been capped with the lower limit.

Imputation based Approaches to treat outliers

There are more efficient methods of outlier detection, especially when you think the recorded outlier value is an error and you want to fix it with what would have been an appropriate value.

You can use the multivariate prediction approaches using MICE and other methods. These have been discussed in detail in the missing value imputation methods (video) and MICE (video).

How to detect outliers with z-score (2024)

References

Top Articles
Latest Posts
Article information

Author: Arline Emard IV

Last Updated:

Views: 5954

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.