In this blog post, we will learn about how to deal with missing values and their implementation in Python. Some of the things that we are going to talk about are given down below.
Introduction
Missing data or missing values in a dataset is that part of the dataset which has not been recorded by the time of collection of the dataset. A missing value can be present as a single value missing or a whole row of observations missing against a particular data. The missing data can be found in both the categorical variable and the continuous variable. Missing data is very common in research of social science where the participants might skip the rest of the information, and this will result in missing data. The missing values in a dataset are represented as NA or NaN. The data containing survey results tends to contain the maximum number of missing values, and therefore before continuing to work with this type of dataset, one must apply the relevant operations.
Problems associated with missing values
Some of the problems which are associated with the missing values are
- The missing values can make the model biased if not handled carefully.
- If the data contains missing values then the algorithms fail to train the data. Some of the algorithms like support vector machines and naive Bayes algorithms.
- There is a loss of statistical power when we have missing values. It means it negatively influences the probability that the hypothesis test will reject the null hypothesis when it is invalid.
Types of Missing values
There are three different types of missing values are there
1. Missing completely at random (MCAR)
Missingness is independent between observed and unobserved data. It means that there is no relationship between missing data and the data values. The benefit of using this type of data which results in unbiased. If the dataset contains less than 5% missing values in each feature considered as MCAR.
For example– An example would be when a study is carried out to gather data on a particular demographic, and some people are left out of the study because of the random sampling procedure. In this instance, the missing data are unrelated to any traits of the people who were left out of the study.
2. Missing at random (MAR)
In this process the data is not completely missing, it is missing within subgroups of other observed variables. There is dependency within the variables which results in biasedness. The analysis of the complete dataset results in bias because some of the features in the dataset which is dependent on the other variables.
For example– Suppose that a student does not participate in the quiz test because the student is ill, we might predict the student’s health condition based on the medical records.
3. Missing not at random (MNAR)
In this technique, the missingness of data is related to events or factors which is not being captured by the researchers. In simple terms, we can say that the value which is missing is connected with the explanation it’s missing [1]. The analysis of the data results in unbias because it is not dependent on the other variables.
For example– A student does not attend the quiz test because the student was drunk last night. We sometimes call this technique nonignorable or nonresponse.
Another example will be a person who does not disclose his mental issue in the health survey conducted by the doctors because he was suffering from the problems of depression.
5 ways to handle missing values in python
There are 5 ways through which we can handle missing values in a dataset. Below subsection in which we describe the techniques and the python codes to implement that technique.
1. Listwise or case deletion
In this technique, we simply delete those cases where missing values are present and we analyze the remaining data. It means if the dataset contains missing values then we delete that rows.
Advantages of using listwise or case deletion
- It is easy to apply, and it also helps to reduce the data size.
- It gives us unbiased results because there is no dependency between the variables.
- Listwise or case deletion ensures consistency in the sample size and variables across the dataset. This makes it easier to compare results and draw valid conclusions.
Disadvantages of using listwise or case deletion
- Deleting a whole row will cause a loss of information.
- If the missing data is not MCAR, bias can be introduced via listwise or case elimination. For instance, eliminating those observations will result in biased conclusions if the missing data is connected to a particular variable or group.
- If case deletion has been performed, this procedure may make imputation challenging because it may result in the loss of some variables.
Implementation of Listwise or case deletion in python
import pandas as pd # load the dataset into a DataFrame df = pd.read_csv(‘my_data.csv’) |
# remove observations with missing values df.dropna(inplace=True) # removing missing values from a particular column df.dropna(subset=[‘age’],inplace=True) |
2. Pairwise deletion
Another approach to dealing with missing data is called pairwise deletion, in which only the observations with missing values for a given variable are eliminated from the analysis of that variable. This method allows for the use of all the available data for each variable rather than removing the whole observation.
Implementation of Pairwise deletion in python
import pandas as pd # load the dataset into a DataFrame df = pd.read_csv(‘my_data.csv’) # remove observations with missing values for variable ‘age’ df = df[[‘age’]].dropna() |
When a small percentage of the overall dataset is missing and it is Missing at Random (MAR), pairwise deletion might be used.
3. Imputation of missing values with mean, median, and mode
Imputation is a technique for dealing with missing data that involves substituting estimated values for the missing values. The mean, median, or mode of the non-missing values in the same variable can be used to impute missing values, which is one of the most popular methods.
Implementation of missing values with mean, median, or mode in python
import pandas as pd # load the dataset into a DataFrame df = pd.read_csv(‘my_data.csv’) df.head() |
# replace missing values in the ‘Age’ column with mean df[‘Age’].fillna(df[‘Age’].mean(), inplace=True) |
# replace missing values in the ‘Age’ column with a median df[‘Age’].fillna(df[‘Age’].median(), inplace=True) |
# replace missing values in the ‘Age’ column with mode df[‘Age’].fillna(df[‘Age’].mode()[0], inplace=True) |
- Mean imputation is sensitive to outliers and can be affected by them
- median is not affected by outliers but it is not recommended when the data is not continuous
- Mode imputation is helpful when the data is categorical
4. Regression imputation
Here is an example of how regression imputation can be implemented in Python using scikit-learn:
import numpy as np from sklearn.impute import SimpleImputer # load the dataset into a numpy array data = np.genfromtxt(“my_data.csv”, delimiter=”,”, skip_header=1) # create an imputer object imputer = SimpleImputer(strategy=’mean’) # fit the imputer on the data imputer.fit(data) # use the imputer to transform the data data_imputed = imputer.transform(data) |
A group of decision trees known as a Random Forest can be utilized to solve both classification and regression issues. Missing values may also be imputed using it. The rationale behind utilizing Random Forest for imputation is that it can figure out how the variables in the dataset that have missing values relate to the other variables and use that understanding to estimate the missing values.
from sklearn.ensemble import RandomForestRegressor # load the dataset into a DataFrame df = pd.read_csv(‘my_data.csv’) # create a Random Forest regressor rf = RandomForestRegressor() # split the data into a training and testing set X_train, X_test, y_train, y_test = train_test_split(df.drop(‘target’, axis=1), df[‘target’], test_size=0.3, random_state=0) # fit the Random Forest regressor on the training data rf.fit(X_train, y_train) # use the regressor to predict the missing values missing_values = rf.predict(X_test) # fill in the missing values df.fillna(missing_values, inplace=True) |
k-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for both classification and regression problems. It can also be used to impute missing values by using the values of the k-nearest neighbors to estimate the missing value.
Here is an example of how KNN can be used to impute missing values in Python using the fancyimpute library:
import the library from fancyimpute import KNN # load the dataset into a DataFrame df = pd.read_csv(‘my_data.csv’) # create an instance of the KNN imputer imputer = KNN(k=5) # fit the imputer on the data imputer.fit(df) # use the imputer to transform the data df_imputed = imputer.transform(df) |
It’s crucial to remember that Random Forest and k nearest neighbours imputation makes the assumption that the missing data is Missing at Random (MAR) and does not make up a large component of the overall dataset.
5. Multiple imputations by chained equation
A technique for dealing with missing data known as multiple imputations by chained equations (MICE) involves the creation of numerous imputed datasets and the subsequent combination of the findings from each dataset to provide a final analysis. The purpose of MICE is to produce a large number of reasonable imputed values in order to account for the uncertainty brought on by the missing data.
Implementation of MICE in python
import the library from fancyimpute import IterativeImputer # load the dataset into a DataFrame df = pd.read_csv(‘my_data.csv’) # create an instance of the MICE imputer imputer = IterativeImputer(n_iter=10, imputation_order=’random’, initial_strategy=’mean’) # fit the imputer on the data imputer.fit(df) # use the imputer to transform the data df_imputed = imputer.transform(df) |
Conclusion
In conclusion, there are several methods for handling missing data, including listwise or case deletion, pairwise deletion, and imputation. Each method has its own advantages and disadvantages and is suitable for different types of data and scenarios.
If you made it this far, please click the “like” button to show your support for our DataSpoof material. so that we can produce even more incredible content in the future. Feel free to leave a remark if you have any questions about the implementations; we will respond as early as we can.
If you like the article and would like to support me, make sure to:
- 👏 Like for this article and subscribe to our newsletter
- 📰 View more content on my DataSpoof website
- 🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter
References
[1] https://en.wikipedia.org/wiki/Missing_data