Classification is the most used algorithms in machine learning. More than 70% of the problems in data science are classification problems.
There are two categories of classification are:
- Binary-class classifier(two-class problem)
- Multinomial classification(more than two classes are present in the target variable).
Logistic regression is the most basic algorithm in solving two-class classification problems. Some of the common problems are churn prediction, spam detection, and many more. In the 42 day series of PyTorch, previously we have covered how to perform linear regression in PyTorch. In this blog, we will learn about how to implement logistic regression in pytorch.
Definition of Logistic regression
Logistic Regression is a supervised algorithm in machine learning that is used to predict the probability of a categorical response variable. In logistic regression, the predicted variable is a binary variable that contains data encoded as 1 (True) or 0 (False). In other words, the logistic regression model predicts P(Y=1) as a function of X.
Properties of Logistic Regression:
- The predicted variable in logistic regression follows Bernoulli Distribution.
- Logistic regression uses the Maximum Likelihood method for parameter estimation
- We don’t have to compute R Square, Model fitness is calculated through Concordance, KS-Statistics.
,
Linear Regression Vs. Logistic Regression
How logistic regression works
In logistic regression the algorithm is based on logistic function(1/(1+e^(-x))) that output the probability between 0 and 1. If we plot them in a graph it resultant curve would be in an S-shaped curve like this.
As you can see from the above graph, the results of the logistic function would always be a probability between 0 and 1.
Let’s take an example to make it more precise & clear.
Suppose that the x-axis denotes the number of goals scored by Lionel Messi and the y-axis denotes the probability of Barcelona winning the match. Let’s also assume that the x-axis values range from 0 to 50. So, according to the S-curve, it would mean that there is a greater probability of Barcelona winning the match if Lionel Messi scores more than 3 goals. Similarly, there’s a greater probability of Barcelona losing the match if Lionel Messi scores less than 2 goals.
Types of Logistic Regression
Types of Logistic Regression:
- Binary Logistic Regression: The predicted variable has only two possible outcomes such as Cat or Dog, Positive or Negative.
- Multinomial Logistic Regression: The predicted variable has three or more nominal categories such as predicting the type of dog’s breed.
- Ordinal Logistic Regression: The predicted variable has three or more ordinal categories such as education level(“high school”,” Graduation”,” Post-graduation”,” Ph.D.”).
Where to use logistic regression
- We have a binary or dichotomous target variable.
- We have predictor X-variables that we think are related to the Y-variable.
Use Cases of logistic regression
Financial sector
In the financial industry, this algorithm is used to predicts loan defaulters, credit scoring, loan distribution, and many more. Many giant companies like Morgan Stanley are using these methods.
Medical sector
In the medical industry, this algorithm is used to predict if a patient has diabetes or not. There are many other applications like breast cancer prediction, tumor prediction, and many more.
Telecommunication sector
In this sector, this algorithm is used to predict customer churn, so in this way, they can give better plans to the customer, so the customer won’t churn out.
Network security
In network security, logistic regression is used to predict if a network packet has successfully delivered or not.
Implementation of logistic regression in PyTorch
The dataset comes from the UCI Machine Learning repository, and it is related to economics. The classification goal is to predict whether personal income greater than(<=50K or >50K). You can download the dataset from here.
Importing required libraries
import torch
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.autograd import Variable
import torch.nn.functional as F
The dataset provides the information of customers. It includes 48,842 records and 15 columns.
Input variables
Variable | Type | Description |
---|---|---|
age | Numeric | Age of the individual |
workclass | Categorical | Type of workclass: “Private”, “Self-emp-not-inc”, “Local-gov”, “?”, “State-gov”, “Self-emp-inc”, “Federal-gov”, “Without-pay”, “Never-worked” |
fnlwgt | Numeric | Final weight of the individual in sampling |
education | Categorical | Educational level: “HS-grad”, “Some-college”, “Bachelors”, “Masters”, “Assoc-voc”, “11th”, “Assoc-acdm”, “10th”, “7th-8th”, “Prof-school”, “9th”, “12th”, “Doctorate”, “5th-6th”, “1st-4th”, “Preschool” |
educational-num | Numeric | Numeric representation of education |
marital-status | Categorical | Marital status: “Married-civ-spouse”, “Never-married”, “Divorced”, “Separated”, “Widowed”, “Married-spouse-absent”, “Married-AF-spouse” |
occupation | Categorical | Occupation type: “Prof-specialty”, “Craft-repair”, “Exec-managerial”, “Adm-clerical”, “Sales”, “Other-service”, “Machine-op-inspct”, “?”, “Transport-moving”, “Handlers-cleaners”, “Farming-fishing”, “Tech-support”, “Protective-serv”, “Priv-house-serv”, “Armed-Forces” |
relationship | Categorical | Relationship type: “Husband”, “Not-in-family”, “Own-child”, “Unmarried”, “Wife”, “Other-relative” |
race | Categorical | Race of the individual: “White”, “Black”, “Asian-Pac-Islander”, “Amer-Indian-Eskimo”, “Other” |
gender | Categorical | Gender: “Male”, “Female” |
capital-gain | Numeric | Capital gains |
capital-loss | Numeric | Capital losses |
hours-per-week | Numeric | Number of hours worked per week |
native-country | Categorical | Native country: “United-States”, “Mexico”, “?”, “Philippines”, “Germany”, “Puerto-Rico”, “Canada”, “El-Salvador”, “India”, “Cuba”, “England”, “China”, “South”, “Jamaica”, “Italy”, “Dominican-Republic”, “Japan”, “Guatemala”, “Poland”, “Vietnam”, “Columbia”, “Haiti”, “Portugal”, “Taiwan”, “Iran”, “Greece”, “Nicaragua”, “Peru”, “Ecuador”, “France”, “Ireland”, “Hong”, “Thailand”, “Cambodia”, “Trinadad&Tobago”, “Outlying-US(Guam-USVI-etc)”, “Laos”, “Yugoslavia”, “Scotland”, “Honduras”, “Hungary”, “Holand-Netherlands” |
income | Categorical | Income level: “<=50K”, “>50K” |
Data exploration
Now we make a function name plotPerColumnDistribution which plots all the columns.
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
nunique = df.nunique()
df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
nRow, nCol = df.shape
columnNames = list(df)
nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
for i in range(min(nCol, nGraphShown)):
plt.subplot(nGraphRow, nGraphPerRow, i + 1)
columnDf = df.iloc[:, i]
if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
valueCounts = columnDf.value_counts()
valueCounts.plot.bar()
else:
columnDf.hist()
plt.ylabel('counts')
plt.xticks(rotation = 90)
plt.title(f'{columnNames[i]} (column {i})')
plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
plt.show()
plotPerColumnDistribution(data, 10, 3)
Data Cleaning
- Check for missing values in the columns.
- Now we encode all the categorical columns in the dataset.
- After that, we define features and target variables in the dataset. Next, we split the dataset into training and test set.
# there are 0 missing values in the dataset
data.isnull().sum()
# Encode all the categorical columns
for col_name in data.columns:
if(data[col_name].dtype == 'object'):
data[col_name]= data[col_name].astype('category')
data[col_name] = data[col_name].cat.codes
# Define features and target variables in the dataset
features = data.loc[:, data.columns != 'income']
target= data['income']
nc =(len(data.columns))-1
traindf, testdf = train_test_split(data, test_size=0.2)
x_data = Variable(torch.Tensor(traindf.iloc[:,0:nc].values)) #Variable(torch.Tensor([[1.0], [2.0], [3.0]]))
y_data = (Variable(torch.Tensor(traindf.iloc[:,nc:].values))) #Variable(torch.Tensor([[2.0], [4.0], [6.0]]))
xt_data = Variable(torch.Tensor(testdf.iloc[:,0:nc].values)) #test input data
yt_data = (Variable(torch.Tensor(testdf.iloc[:,nc:].values))) #test output data
Implement logistic regression
Next, we define model class aka which is logistic regression.
class Model(torch.nn.Module):
def __init__(self,input_size, num_classes): #initializing
super(Model, self).__init__()
self.linear = torch.nn.Linear(input_size, num_classes) # hidden layer
def forward(self, x):
y_pred = F.sigmoid(self.linear(x))
return y_pred
model = Model(14, 1)
criterion = torch.nn.BCELoss(size_average=True)#.nn.CrossEntropyLoss()#nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
In which we initialize init constructor and then we instantiate two nn.Linear module. In the forward function, we accept a Variable of input data and we must return a Variable of output data. We can use Modules defined in the constructor as well as arbitrary operators on Variables.
Now we save our model and Construct our loss function and an Optimizer.
Training and prediction
Let us start the training loop
# Training loop
for epoch in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x_data)
# Compute and print loss
loss = criterion(y_pred, y_data)
print('epoch {}, loss {}',epoch, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
#### Test the Model ####
predicted = model(xt_data)
meanz=predicted.mean()
meanz=1.25*meanz
for cnt in range (len(predicted)):
if predicted[cnt]>meanz : predicted[cnt]=1
else: predicted[cnt]=0
total = yt_data.size(0)
correct = (predicted == yt_data).sum()
TPcorrect=0
for cnt in range (len(predicted)):
if ((predicted[cnt] == yt_data[cnt])and (predicted[cnt] == 0)) : TPcorrect=TPcorrect+ 1
print('Accuracy of the model %d %%' % (100 * correct // total))
We perform 500 iterations and then compute predicted y by passing x to the model, calculate loss, Zero gradients, perform a backward pass, and update the weights.
Now we test the model and make the prediction. The accuracy that we are getting is 76%.
Advantages and disadvantages of logistic regression
Advantages
- It is fast to train the logistic regression model
- It works well on simple datasets.
- We can also use a logistic regression model for predicting multiple classes.
- There is no violation of Ordinary least square assumptions.
- It can handle polytomous data(more than two distinct categories).
Disadvantages
- In non-linear models, the effect is not consistent.
- Sometimes it fails to capture the complex relationship between variables.
- It requires large size datasets for stable results.
- It creates a problem when group efficiency distributions have little overlap.
Wrap up the Session
Finally, we have made it to the end of the tutorial. You may know
- what is logistic regression,
- properties of logistic regression,
- differences between linear regression and logistic regression
- types of logistic regression
- where to use logistic regression
- Logistic Regression Assumptions
- the use-case of logistic regression
- how to implement logistic regression in PyTorch.
You’ve come a long way in understanding one of the most important areas of machine learning! If you have questions or comments, then please put them in the comments section below.
You can also join our telegram channel to get free cheatsheets, projects, ebooks, study material related to machine learning, deep learning, data science, natural language processing, python programming, r programming, big data, and many more.