How to deal with missing values in Python

In this blog post, we will learn about how to deal with missing values and their implementation in Python. Some of the things that we are going to talk about are given down below.

Table of Contents

Introduction

Missing data or missing values in a dataset is that part of the dataset which has not been recorded by the time of collection of the dataset. A missing value can be present as a single value missing or a whole row of observations missing against a particular data. The missing data can be found in both the categorical variable and the continuous variable. Missing data is very common in research of social science where the participants might skip the rest of the information, and this will result in missing data. The missing values in a dataset are represented as NA or NaN. The data containing survey results tends to contain the maximum number of missing values, and therefore before continuing to work with this type of dataset, one must apply the relevant operations.

Problems associated with missing values

Some of the problems which are associated with the missing values are

The missing values can make the model biased if not handled carefully.
If the data contains missing values then the algorithms fail to train the data. Some of the algorithms like support vector machines and naive Bayes algorithms.
There is a loss of statistical power when we have missing values. It means it negatively influences the probability that the hypothesis test will reject the null hypothesis when it is invalid.

Types of Missing values

There are three different types of missing values are there

1. Missing completely at random (MCAR)

Missingness is independent between observed and unobserved data. It means that there is no relationship between missing data and the data values. The benefit of using this type of data which results in unbiased. If the dataset contains less than 5% missing values in each feature considered as MCAR.

For example– An example would be when a study is carried out to gather data on a particular demographic, and some people are left out of the study because of the random sampling procedure. In this instance, the missing data are unrelated to any traits of the people who were left out of the study.

2. Missing at random (MAR)

In this process the data is not completely missing, it is missing within subgroups of other observed variables. There is dependency within the variables which results in biasedness. The analysis of the complete dataset results in bias because some of the features in the dataset which is dependent on the other variables.

For example– Suppose that a student does not participate in the quiz test because the student is ill, we might predict the student’s health condition based on the medical records.

3. Missing not at random (MNAR)

In this technique, the missingness of data is related to events or factors which is not being captured by the researchers. In simple terms, we can say that the value which is missing is connected with the explanation it’s missing [1]. The analysis of the data results in unbias because it is not dependent on the other variables.

For example– A student does not attend the quiz test because the student was drunk last night. We sometimes call this technique nonignorable or nonresponse.

Another example will be a person who does not disclose his mental issue in the health survey conducted by the doctors because he was suffering from the problems of depression.

5 ways to handle missing values in python

There are 5 ways through which we can handle missing values in a dataset. Below subsection in which we describe the techniques and the python codes to implement that technique.

1. Listwise or case deletion

In this technique, we simply delete those cases where missing values are present and we analyze the remaining data. It means if the dataset contains missing values then we delete that rows.

Advantages of using listwise or case deletion

It is easy to apply, and it also helps to reduce the data size.
It gives us unbiased results because there is no dependency between the variables.
Listwise or case deletion ensures consistency in the sample size and variables across the dataset. This makes it easier to compare results and draw valid conclusions.

Disadvantages of using listwise or case deletion

Deleting a whole row will cause a loss of information.
If the missing data is not MCAR, bias can be introduced via listwise or case elimination. For instance, eliminating those observations will result in biased conclusions if the missing data is connected to a particular variable or group.
If case deletion has been performed, this procedure may make imputation challenging because it may result in the loss of some variables.

Implementation of Listwise or case deletion in python

import pandas as pd

# load the dataset into a DataFrame
df = pd.read_csv(‘my_data.csv’)

# remove observations with missing values
df.dropna(inplace=True)

# removing missing values from a particular column
df.dropna(subset=[‘age’],inplace=True)

2. Pairwise deletion

Another approach to dealing with missing data is called pairwise deletion, in which only the observations with missing values for a given variable are eliminated from the analysis of that variable. This method allows for the use of all the available data for each variable rather than removing the whole observation.

Implementation of Pairwise deletion in python

import pandas as pd

# load the dataset into a DataFrame
df = pd.read_csv(‘my_data.csv’)

# remove observations with missing values for variable ‘age’
df = df[[‘age’]].dropna()

When a small percentage of the overall dataset is missing and it is Missing at Random (MAR), pairwise deletion might be used.

3. Imputation of missing values with mean, median, and mode

Imputation is a technique for dealing with missing data that involves substituting estimated values for the missing values. The mean, median, or mode of the non-missing values in the same variable can be used to impute missing values, which is one of the most popular methods.

Implementation of missing values with mean, median, or mode in python

import pandas as pd

# load the dataset into a DataFrame
df = pd.read_csv(‘my_data.csv’)
df.head()

# replace missing values in the ‘Age’ column with mean
df[‘Age’].fillna(df[‘Age’].mean(), inplace=True)

# replace missing values in the ‘Age’ column with a median
df[‘Age’].fillna(df[‘Age’].median(), inplace=True)

# replace missing values in the ‘Age’ column with mode
df[‘Age’].fillna(df[‘Age’].mode()[0], inplace=True)

Mean imputation is sensitive to outliers and can be affected by them
median is not affected by outliers but it is not recommended when the data is not continuous
Mode imputation is helpful when the data is categorical

4. Regression imputation

Here is an example of how regression imputation can be implemented in Python using scikit-learn:

import numpy as np
from sklearn.impute import SimpleImputer

# load the dataset into a numpy array
data = np.genfromtxt(“my_data.csv”, delimiter=”,”, skip_header=1)

# create an imputer object
imputer = SimpleImputer(strategy=’mean’)

# fit the imputer on the data
imputer.fit(data)

# use the imputer to transform the data
data_imputed = imputer.transform(data)

A group of decision trees known as a Random Forest can be utilized to solve both classification and regression issues. Missing values may also be imputed using it. The rationale behind utilizing Random Forest for imputation is that it can figure out how the variables in the dataset that have missing values relate to the other variables and use that understanding to estimate the missing values.

from sklearn.ensemble import RandomForestRegressor

# load the dataset into a DataFrame
df = pd.read_csv(‘my_data.csv’)

# create a Random Forest regressor
rf = RandomForestRegressor()

# split the data into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(df.drop(‘target’, axis=1), df[‘target’], test_size=0.3, random_state=0)

# fit the Random Forest regressor on the training data
rf.fit(X_train, y_train)

# use the regressor to predict the missing values
missing_values = rf.predict(X_test)

# fill in the missing values
df.fillna(missing_values, inplace=True)

k-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for both classification and regression problems. It can also be used to impute missing values by using the values of the k-nearest neighbors to estimate the missing value.

Here is an example of how KNN can be used to impute missing values in Python using the fancyimpute library:

import the library
from fancyimpute import KNN

# load the dataset into a DataFrame
df = pd.read_csv(‘my_data.csv’)

# create an instance of the KNN imputer
imputer = KNN(k=5)

# fit the imputer on the data
imputer.fit(df)

# use the imputer to transform the data
df_imputed = imputer.transform(df)

It’s crucial to remember that Random Forest and k nearest neighbours imputation makes the assumption that the missing data is Missing at Random (MAR) and does not make up a large component of the overall dataset.

5. Multiple imputations by chained equation

A technique for dealing with missing data known as multiple imputations by chained equations (MICE) involves the creation of numerous imputed datasets and the subsequent combination of the findings from each dataset to provide a final analysis. The purpose of MICE is to produce a large number of reasonable imputed values in order to account for the uncertainty brought on by the missing data.

Implementation of MICE in python

import the library
from fancyimpute import IterativeImputer

# load the dataset into a DataFrame
df = pd.read_csv(‘my_data.csv’)

# create an instance of the MICE imputer
imputer = IterativeImputer(n_iter=10, imputation_order=’random’, initial_strategy=’mean’)

# fit the imputer on the data
imputer.fit(df)

# use the imputer to transform the data
df_imputed = imputer.transform(df)

Conclusion

In conclusion, there are several methods for handling missing data, including listwise or case deletion, pairwise deletion, and imputation. Each method has its own advantages and disadvantages and is suitable for different types of data and scenarios.

If you made it this far, please click the “like” button to show your support for our DataSpoof material. so that we can produce even more incredible content in the future. Feel free to leave a remark if you have any questions about the implementations; we will respond as early as we can.

If you like the article and would like to support me, make sure to:

👏 Like for this article and subscribe to our newsletter
📰 View more content on my DataSpoof website
🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter

References

[1] https://en.wikipedia.org/wiki/Missing_data

Introduction

Problems associated with missing values

Types of Missing values

1. Missing completely at random (MCAR)

2. Missing at random (MAR)

3. Missing not at random (MNAR)

5 ways to handle missing values in python

1. Listwise or case deletion

2. Pairwise deletion

3. Imputation of missing values with mean, median, and mode

4. Regression imputation

5. Multiple imputations by chained equation

Conclusion

Please Share This Share this content

You Might Also Like

Supercharge Financial Data Analysis with Python Libraries

Top 21 sources for Data collection

A step by step approach to perform data analysis with python

Share this content