Doing discrete predictions in Data Science

Image Courtesy – pxhere.com
Algorithms – Using python language
Much of the professional work we do (create software for businesses) involve attributes / properties which gets a value from a universe of possible values. Say country it has a finite value, payment options that can be offered, subscription options offered by a website, status of an exam or test, number of road accidents in a locality, number of siblings an individual have etc. These are known as Categorical Variables.
In last dispatch we tete-a tete ‘ed with regression algorithm. Regression algorithm are good continuous variables which are abundantly observed in nature and sometimes in professional work we do e.g. earnings of a business. In this dispatch we will look at an algorithm which does the same with discrete variables.
Logistic Regression is about estimating parameters to a simple logistic model. A logistic model is used to model binary classes i.e. binary dependent variables. It can be visualized in the form of a graph as below –

Figure 1 – Visualization of Logistic (Regression) Model
The x-axis represents the independent variable and y-axis represents the dependent variable i.e. probability of occurrence of an interesting dependent variable. It isn’t always necessary to have x-axis have both positive and negative values.
Though logistic regression is about finding the probability of a dependent variable it does not classify as such by the algorithm. Implementation of this algorithm can though be used for classification to interpret the probability values emitted by the algorithm.
The binary classes are interpreted to the left and right of the y-axis in this graph. The shift over from 0.5 indicates the class B and for probability values lesser than 0.5 they represent the class A.
The interesting question that comes across now is what happens when we start applying this to use cases where there are more than 2 classes. From the examples above say for payment options there are could be VISA/Master card, Maestro card, Diners Card, AMEX card. So, if we want to determine which of the given options will customer opt for dependent on the total value of the cart; we still will use Logistic Regression to determine the preferred card but instead of Logistic Regression we will apply Multinomial Logistic Regression.
Multinomial Logistic Regression is an extension of Logistics Regression. The multinomial logistic regression can be realized either as linear model with weighted sum of individual predictions or as independent logistics regression for each class independently with values indicating applicability of class.
Before we progress further let us remove one prickly issue which bothers us when we put this in perspective with Linear Regression. Linear regression is about deriving numerical parameters involved in the equation of line which explains the observed relationship between independent and dependent variables.
Logistic regression though does not seem to describe any such equation between dependent and independent variables. The question – is it not fair to call it something else than a regression algorithm; stands its ground.
To the defence of it should still be called as regression algorithm – The logistic algorithm still determines the relationship between independent and dependent variable. The relationship instead of being visualized geometrically as line is rather visualized as probability by the formula –
l=〖log〗_b (p/(1-p))= β_0+β_1×x_1+β_2×x_2
The regression here is involved to translate this equation to the form below and estimate β_i for the relationship –
p= 1/(1+ b^(-(β_0+β_1×x_1+β_2×x_2 ) ) )
Let us now take a sample problem and apply this math (i.e. the logistic algorithm) to experience its capabilities.
Implementation in Python
Let us pick a scenario where we want to determine whether a management studies aspiring student candidate is more likely to get picked for interview in a prestigious overseas university. The independent variables we have with us for this modelling is – GMAT score, GPA score and Prior work experience of the student candidate.
The peek into the data we have looks like this –

Figure 2 – Sample data of top 6 records
This is just a peek and assume the entire dataset is made available as CSV file available in the disk. Proceed to next step, load relevant python packages for this modelling. We will require –
pandas – for modelling dataset as data frames
sklearn – for the actual logistic regression algorithm implementation
seaborn – for visualization of the accuracy of the modelling activity
Following import statements will be required for us to move forward –

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn

We load the data from the excel file using the following lines

labelledStudentInterview = pd.read_excel(r’C:\dataScience\project\studentInterviewInvitew\labelledDate.xlsx’)

We will have to append ‘r’ to the path to escape the special characters like ‘\’ in the text. We will have to transform this to data frame to be able to work with data from now on, so let us do that –

dfLabelledStudentInterview = pd.DataFrame(labelledStudentInterview, columns=[‘gmat’, ‘gpa’, ‘work_exp’, ‘invited’])

Let us segregate the independent and dependent variables

x = dfLabelledStudentInterview[[‘gmat’,’gpa’,’work_exp’]]
Y = dfLabelledStudentInterview[‘invited’]

Next, we will split the data into test and training set by doing this

x_train, x_test, Y_train, Y_test = train_test_spit(x, Y, test_size=0.25, random_state=0)

Here we have split the dataset in 75 and 25 percentage where 75% of data is used for training and 25% of data is used for testing.
Now the stage is ready for training the logistic regression on the training data set –

algorithm = LogisticRegression()
algorithm.fit(x_train, Y_train)
Y_pred = algorithm.predict(x_test)

Next, we evaluate the correctness of training algorithm using the code below –

confusion_matrix = pd.crosstab(Y_test, Y_pred, rownames=[‘Actual’], colnames=[‘Predicted’])
sn.heatmap(confusion_matrix, annot=True)

This will yield us something like this on the window –

Figure 3 – Confusion matrix for algorithm
Accuracy of the algorithm is computed as follows –
Accuracy= (True Positive+True Negative)/Total
This yields a value of 0.8 which is 80% accuracy based on the current training. This number can be derived as –

print(‘accuracy = ‘, metrics.accuracy_score(Y_test, Y_pred)

For a moment let us take the flight further little ahead and see how we could use the current training to predict propensity for the invite –

import pickle

modelFileName = ‘studentInvitePrediction.pck’

pickle.dump(algorithm, open(modelFileName, ‘wb’))

We will use this file sometime latter to predict the propensity of invite based on scores as follows –

import pickle
modelFileName = ‘studentInvitePrediction.pck’
model = pickle.load(open(modelFileName, ‘rb’))
studentInvitePossibility = model.score([[pd.DataFrame({‘gmat’:590, ‘gpa’:’2’, ‘work_exp’:3}, colums[‘gmat’, ‘gpa’,’work_exp’])]])
print(studentInvitePossibility)

The above code will print the possibility to be – 0.