Outliers are one of the key parts of a dataset that must be removed during cleaning and pre-processing by using feature engineering approaches. Let us understand what outliers is; these are the data points that do not fit in the normal population. Their values may be extremely high or low, which might seem uncommon in a dataset. Handling outliers is necessary and can be done through various methods. The most common methods for dealing with outliers in Python are the Z score method and the interquartile range score method.
There are three different kinds of outliers are there.
- Point outlier– It is also known as the Global outlier. From the name, it is clear that it is a single outlier present in the whole data. For example, the average life expectancy of a person is around 70 to 80 years. But one of the people has lived for 150 years.
- Contextual outlier-Â To identify this type of outlier, we must have domain knowledge of that given dataset. Let us take an example. Do you think there is a 10-degree Celsius temperature is an outlier in a country like Russia? So, to find out the answer to this, we must have some knowledge about the temperature and seasons in Russian countries.
- Collective outlier- From the name, it is clear that it contains a group of data points different in their architecture from the other samples in the populations. For example, due to the tech bubble burst, all the tech company’s stocks went down, whereas the other sector’s stocks had minor fluctuations.
In the diagram below, you can see the visual representation of all these kinds of outliers.
What is the main cause of outliers in the data?
There are three main causes of having outliers in the data.
- At the time of recording the data points, there was an error. For example, a university is conducting a survey in a nearby village, collecting information such as name, profession, and age. The age of one of the people is given as 700, which is impossible. As a result, this error is referred to as an experimental or data entry error. You can just remove the outliers from the dataset to deal with them.
- We’re gathering data from a variety of places. The average height of a guy in the Netherlands, for example, is 6 feet, which is an outlier when compared to the Chinese people’s average height of 5 feet and 7 inches.
- Outliers can be caused by a variety of natural factors. For example, during the summer, we expect a usual temperature of more than 35 degrees Celsius, but owing to terrible weather, the temperature dips to 15 degrees Celsius. As a result, the temperature that was reported is now considered an outlier.
What are the impacts of outliers?
Here are some impacts that outliers can cause are:
- An outlier can impact the value of the mean and standard deviation of the population.
- If the dataset contains an outlier, then it violated the primary assumption of ANOVA, regression, and another statistical model.
- An outlier can reduce the normality value if it is distributed non-randomly.
- It creates a bias in the data.
Outlier detection
There are various ways to detect outliers in a dataset.
Visual Method
By using a Histogram
Histograms help you to know how the data is overall distributed. The extreme points in the histogram are considered outliers.
By using a Scatter plot
It is crucial way when we try to detect outliers in the multivariate environment (by taking two variables). The below plot shows that the points that do not belong to the normal population are considered outliers.
By using a Box plot
This type of plot is often used for detecting outliers or anomalies in the dataset. It is also called the Whisker plot because, with the help of this, you can know the distribution of data.
The Boxplot diagram displays the five crucial values minimum, maximum, and Quartile values like Q1, Q2 (also known as the median), and Q3.
There are a few things that we can get by using the boxplot outliers, we can check if the data is normally distributed or not, and we can also know about the skewness of the data.
Statistical Method
Z score method
The Z score is one of the most commonly used tools for determining the outliers present in the dataset. It is also known as the standard score and helps us understand if the data value is greater or smaller than the mean and its distance from the mean. The z score is defined as the number of standard deviations for which the actual value is greater than the mean value. The distribution of data in this method is calculated such that the mean is 0 and the standard deviation value is one as in the normal distribution format.
Z score can be calculated using the formula:
Z score = (x – mean) / std. deviation
After calculating the z score, it is matched against that of a data point. The data point will be regarded as an outlier if the discrepancy between the score and the data point is greater than expected. As a result, the standard deviation is one of the most important parameters to consider when calculating the z score. It will be beneficial for differentiating the outliers present in the dataset.
Inter quartile range (IQR) score method
The IQR score method is a prominent and most used way of finding outliers present in a dataset. The interquartile range, or IQR score, can measure variability by dividing a data set into quartiles, and then the data is sorted out in ascending order. This sorted data is divided into four parts, which are labeled Q1, Q2, and Q3, representing the first, second, and third quartiles, respectively. The first quartile, or the Q1 quartile, represents the 25th percentile of data, Q2 will represent the 50th percentile of data, and eventually, Q3 will represent the 75th percentile of data.Â
You have to follow the following steps to check for outliers.Â
- Get the summary statistics of the data. With the help of this, you will get things like count, mean, standard deviation, minimum, maximum, and interquartile values like Q3, Q1, and median (which is also known as Q2).
- To calculate the interquartile range, use the below formula.
IQR= Q3-Q1
- Next, compute the value of the maximum outliers by using the below formula.’
Q1- 1.5 IQR
- After that, compute the value of the minimum outlier by using the below formula.
Q3+1.5 IQR.
By using the above steps, you can find out the values of the outliers which are present in the dataset.
Coding implementation in Python
Dataset description- The data that we are going to use in this project is downloaded from the Kaggle website. It contains information about.
You can download the dataset from this link.
There are five steps which are involved in this process.
- Install and import all the necessary packages or libraries.
NumPy: This library is used to perform mathematical operations like mean and standard deviation and deal with a matrix-like operation.
Pandas:Â This library is used to read various types of files in formats like CSV, JSON, text files, etc. You can also perform data manipulation operations using this library.Â
Matplotlib and Seaborn- This library plots various types of charts and statistical graphs like correlation plots, histograms, pie charts, etc.
SciPy: Also stands for Scientific Python. We are using this library to import the Zscore function.
- Read and display information about the dataset with the help of the head () function.
- Perform basic data analysis and visualization on the dataset. There are two main functions used for this.
Describe () function: This function is used for displaying the statistical summary of the dataset. It gives information like count, min, max, standard deviation, median and interquartile values.
For data visualization, three types of charts are used. scatter plot, box plot, and histogram to find the outliers in the dataset.
- Next, we will inspect the outliers with the help of statistical functions like score, interquartile values, and half-value method.
- At last, we remove the outliers from the dataset by dropping the values considered an outlier.
Coding implementation
How to detect and remove outliers in Python
Conclusion
In this blog, we have learned about the various methods for detecting and removing outliers in the dataset. We’ve also discussed the many types of outliers that can be found in data, as well as their causes and effects. Finally, the Python programming language was used to implement the code.
If you like this blog post, you can like it and comment. If you have any doubts regarding implementation, share them with your friends and colleague.
You can connect with me on social media profiles like LinkedIn, Twitter, and Instagram.
LinkedIn – https://www.linkedin.com/in/abhishek-kumar-singh-8a6326148
Twitter- https://twitter.com/Abhi007si
Instagram- www.instagram.com/dataspoof