First Algorithm for Data Science

 

Using python language

We had looked at ecosystem of Python and the reason why we tend to use the Python for the data science project. In a data science project, one will easily find themselves picking up tasks in any of the binsdepicted in the diagram below. For being in the true spirit of data science we have pictured the tasks as bins of activities which will be executed over time.

1 – Typical Tasks of data science projects

The first two bins have dominant domain flavour. i.e. they need understanding of the domain where the data science is practiced. E.g. if we are attempting to solve the problem related to insurance, in those two bins topics of rate determination, risk quality, and in larger sense underwriting practices or claim practices are to be well understood. We might take a stab and demonstrate activities there in a future dispatch. For this dispatch we however, focus on the Algorithm Development and Dev Sec Ops which are the activities where developers will find natural home.
Most of the algorithms are already implemented and well-tuned for production usage in libraries to the likes of SciPy or NumPy. In a typical day at work you will end up using the library calls to those functions. But here is the challenge with such approach – You miss to see the reason a particular algorithm is to be applied and reason certain parameters are expected and more importantly the influence of those parameters on the data. To discover that you will end up scouring through wiki articles on the algorithm and refer the docs supplied by the library publishers.
There is another way to be good at algorithms. As hobby you implement those algorithms by yourself. This helps you tweak them for purpose i.e. your purpose of using the algorithm.
We start with the simplest of most in the algorithm development – Simple Linear Regression.
Simple Linear Regression
In its simplest form the linear regression is about determining the explained / dependent variable based on the explanatory / dependent variable. This algorithm is best suited for the situation when we have data which is quantitative and can either be discrete or continuous. For example, height, income of a sample populace, number of tickets sold during IPL season etc. The simple linear regression while we get started work with one dependent variable and one independent variable. This kind of relationship is visualized using a scatter plot resembling the follow graph. This is a graph drawn with generating two random numbers using a generator.

2 – Plot linear relationship between independent and dependent variable

There are complex regression scenarios where multiple independent variables yield a single predicted / dependent variable’s value. Such algorithms are known as multiple linear regression. Regression need not be linear always. Polynomial regression also exists where relationship discovered is not linear. That reminds us a simple linear regression is about determining the y in following equation –
y=mx+c
m and c are part which we anticipate regression algorithm will help us determine. By nomenclature, x is the independent variable (represented in the horizontal axis) whereas the y is the dependent / predicted variable (represented in the vertical axis).
Armed with this mathematical requirement we set to build a program which will help us plot a graph like the one depicted above.
Implementation in Python
Though we have understood the premise of Simple Linear Regression we have not yet looked upon the statistical rigour to determine the parts of the equation which we intend to anticipate. Let us take a simple approach known and practiced widely – Taking average. To demonstrate let us take an openly available dataset – Swedish auto insurance dataset .
Let us convert the text file from the sample dataset and covert to comma separated value (CSV) file using excel. The resulting csv file can be parsed using the following piece of code in python –

def load_csv(csvFile):
insurance_claims = list()
with open(csvFile, ‘r’) as fileInstance:
data = reader(fileInstance)
for data_row in data:
if not data_row:
continue
insurance_claims.append([int(data_row[0]), float(data_row[1])])
return insurance_claims

Let us do the following to take the average

simple_average = list(map(sum, zip(*load_csv(‘Swedish_data.csv’))))[1] / len(load_csv(‘Swedish_data.csv’))

The value hovers at 98.187. This value is of no reference until we device the measurement criteria for the algorithm. The measurement criteria for Simple Linear Regression is Root Mean Square Error (RMSE). Without divulging much of the mathematics required to understand the RMSE following formula can be used to calculate RMSE for a list or actual value and another list of predicted value –

def root_mean_square_error(actual, predicted):
error_sum = 0.0
for in in range(len(actual)):
prediction_error = predicted[i] – actual[i]
error_sum += (prediction_error**2)
error_mean = error_sum / float(len(actual))
return sqrt(error_mean)

By calling this function with actual data loaded from the CSV and the mean derived above, we will get an RMSE of 81. That is a large error but is the simplest to explain and calculate. Let us apply Simple Linear Regression to this situation. To do that we will have the following determination functions –
Mean of the independent and dependent variable values
Variance of independent variable values
Covariance between the independent and dependent values
Estimation of m and c values
We already have the function for determination of error in the prediction of dependent variable. Let us start one by one. Mean is simply the average which is similar to the evaluation we did earlier. We can convert it to function as listed here.

def mean(actual):
return( (list(map(sum, zip(*load_csv(‘swedish_data.csv’))))[1]) / lent(load_csv(‘swedish_data.csv’))

Variance is the square of distance of a value in list to mean of all values in the list.Let us implement this in python as follows –

def variance(actual, mean):
return sum([(i-mean)**2 for i in actual])

Covariance describes the relationship between two variables. i.e. how the two variables change together.

def covariance(dep_values, mean_dep, ind_values, mean_ind):
covar = 0.0
for i in range(len(dep_values)):
covar += (dep_values[i] – mean_dep) * (ind_values[i] – mean_ind)
return covar

To estimate the values for m and c we mathematical literature dictates –
c=covariance(x,y)/(variance(x)) and m=mean(y)- c ×mean(x)
Let us implement this as function in python –

def eval_coefficients(actual):
x = [row[0] for row in actual]
y = [row[1] for row in actual]
x_mean, y_mean = mean(x), mean(y)
c = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
m = y_mean – c*x_mean
return [m, c]

Now we have all the ingredients to make the predictions. To determine the values of the dependent variable we will have to use the standard practice of splitting data into train and test data. We will use the training set to determine the coefficients and the test data to determine the RMSE which will help us determine the performance of the algorithm by yielding squared distance of the error in predicted value.
We split the records for train and test as follows –

def split_records(actual):
train = list()
test = list()
train_ratio = 0.6
train_last_index = int(len(actual)*train_ratio)
for row in range(train_last_index):
train_row = list(row)
train.append(train_row)
for row in range(train_last_index, len(actual)):
test_row = list(row)
test.append(test_row)
return [train, test]

Now we have a function which splits the data into train and test list in 60:40 ratio. We intend to determine the m and c using the train list and then apply to the test list for us to determine the RMSE for the algorithm. We do that making call to functions we have written so far –

data = load_csv(‘swedish_data.csv’)
algo_data = split_records(data)
coeff = eval_coefficients(algo_data[0])
predictions = list()
for row in range(len(algo_data[1])):
pred_y = coeff[0]*algo_data[1][0] + coeff[1]
predictions.append(algo_data[1][0], pred_y)

rmse_val = root_mean_square_error(test, predictions)

When you run this, the RMSE for the Swedish auto loan comes to be 33.6. This is far better than the RMSE of 81. Simple Linear Regression algorithm have out-performed the easiest of approach.
This is first step towards building complex algorithms. Happy predicting and inferring on data.