In this post I will share my visuals ,learning and understanding of a dataset and using various algorithms of Machine Learning to analyse it.

Problem :

A Finance company wants to automate the process of loan sanction based on customer details which were entered online by the customer. The dataset and full problem statement can be found here.

1.Reading the Dataset

I have used Python based implementation for this project and I have used python libraries like Pandas and NumPy heavily to solve the problem. I have used Matplotlib for visualisation.
I am using jupyter notebook as my python development environment.

Open your jupyter notebook with the following command on the linux shell

$ jupyter notebook

As you connect with the jupyter notebook open a new .ipnb

import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
df = pd.read_csv('path_of_file')

This will help you load in all the datasets and ready to use.

2. Meta Knowledge

First of all we need to go through all the headers and features of data which can be done through the use of following commands

df.head(n) ===> It prints out n rows present in dataset
df.sample(n) ===> Shows 'n' random roes present in dataset
df.info() ===> Describe data types present, number of rows, number of columns, memory usage of dataset
df.describe() ===> Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution ,excluding NaN values.
df.isnull().sum() ===> Shows the number of NaN values present in each column in Dataframe
df.column_name.value_counts() ===> Shows the number of unique values present in the column in the DataFrame

All of these techniques help significantly to understand the dataset without actually seeing each cell and value present inside it manually.

Let's have a look at 10 entries of the dataset by using the command:

df.head(10)

Now let us look at the summary of dataset by using the function:

<div>
<br class="Apple-interchange-newline">df.describe()</div>

df.describe()
​

Out[3]:

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	614.000000	614.000000	592.000000	600.00000	564.000000
mean	5403.459283	1621.245798	146.412162	342.00000	0.842199
std	6109.041673	2926.248369	85.587325	65.12041	0.364878
min	150.000000	0.000000	9.000000	12.00000	0.000000
25%	2877.500000	0.000000	100.000000	360.00000	1.000000
50%	3812.500000	1188.500000	128.000000	360.00000	1.000000
75%	5795.000000	2297.250000	168.000000	360.00000	1.000000
max	81000.000000	41667.000000	700.000000	480.00000	1.000000

As we can make out from this statistical data:
There are (614-592) 22 missing values in LoanAmount
There are (614-600) 14 missing values in Loan_Amount_Term
There are (614-564) 50 missing values in Credit_History

3. Visualization of the dataset

In this Loan Prediction Dataset we have 5 columns also called as Dataframes namely
LoanAmount , Loan_Amount_Term, ApplicantIncome, CoapplicantIncome, Credit_History

Using the following commands we can get the histograms and boxplots of ApplicantIncome

df['ApplicantIncome'].hist(bins=50)

df.boxplot(column = 'ApplicantIncome')

As we can see from the visuals there are lot of extreme values present in this dataset which need to be kept in mind.

Now let's check the dependency of ApplicantIncome on education:

df.boxplot(column='ApplicantIncome', by='Education')

Now let's analyse the Credit_History against the Loan status. It is an analysis similar to what we do on a pivot table in MS Excel, against the index of credit history we will check the average value of Loan_Status by considering all 'Y' as 1 and 'N' as 0.

<div>
<br class="Apple-interchange-newline">temp1 = df['Credit_History'].value_counts(ascending=True)
temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x:x.map({'Y':1,'N':0}).mean())
print('Frequency Table for Credit History:')
print(temp1)
print('\nProbability of getting loan for each Credit History class:')
print(temp2)
<br></div>









temp1 = df['Credit_History'].value_counts(ascending=True)
temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x:x.map({'Y':1,'N':0}).mean())
print('Frequency Table for Credit History:')
print(temp1)
print('\nProbability of getting loan for each Credit History class:')
print(temp2)
​

Frequency Table for Credit History:
0.0     89
1.0    475
Name: Credit_History, dtype: int64

Probability of getting loan for each Credit History class:
                Loan_Status
Credit_History             
0.0                0.078652
1.0                0.795789

This shows that with 0 Credit_History the percentage of getting Loan is 8% whereas with 1 Credit_History the percentage of getting Loan is about 80 %. That means a person with Credit_History deserves Loan Sanction. As can be seen from the below bar graphs.

<div>
<br class="Apple-interchange-newline">import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')

ax2 =fig.add_subplot(122)

ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title("Probability of getting loan by credit history")
temp2.plot(kind='bar')</div>








import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')
​
ax2 =fig.add_subplot(122)
​
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title("Probability of getting loan by credit history")
temp2.plot(kind='bar')

We can also plot the Loan Status based on two Categories viz Credit_History and Gender by using the following Code:

temp4 = pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)

We can infer from the graph the ratio of Male with Credit History getting Loan is quite high.

4. Filling out the Missing Data

As we saw previously there are many missing values in our dataset. Let's get a summary of the total number of missing values by using the following function:

df.apply(lambda x: sum(x.isnull()),axis=0)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
Avg loan status       0
dtype: int64

These missing values can be filled as mean of the values in the column

Consideration of extreme values in distribution of LoanAmount and ApplicantIncome?

As we saw earlier there are certain amount of extreme values which are possible as per the needs or higher education of the person. We can nullify their effect:

5. Building a Predictive Model in Python

Scikit-Learn(sklearn) is the commonly used library in Python for data modelling. Scikit requires all input to be numeric so we need to convert all categorical values into numeric by the following code:

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)

from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes

Next we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation score.

#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome])

#Make predictions on training set:
predictions = model.predict(data[predictors])

#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

#Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:])

# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]

# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)

#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))

print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])

Logistic Regression

Let's make our first Logistic Regression model. We can take all the variables to build the model but that will make the model complex for understanding specific relations to data and will not provide a general solution.

The chances of getting loan will be higher for:

Applicants having credit history
Applicants with higher applicant-income and co-applicant income
Applicants with higher education
Properties in urban areas with higher growth perspectives

So let's make our first model with Credit_History

outcome_var = 'Loan_Status'

model = LogisticRegression()

predictor_var = ['Credit_History']

classification_model(model, df,predictor_var,outcome_var)

We get output as :

Accuracy : 80.945%
Cross-Validation Score : 80.946%

Decision Tree

Another method for making predictive model which is more accurate than regression model is Decision Trees. As the name suggest it uses a tree-like model of decisions and their possible consequences.

model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)

Accuracy : 81.930%
Cross-Validation Score : 76.656%

Here we can see the accuracy went up but Cross-validation score went down because of over-fitting model.
So we can draw a conclusion that using more sophisticated model does not guarantee better results. Its more important to understand the underlying concepts and make the features better suit the model.

My Coding Journey

Sunday, 3 March 2019

Loan Prediction Analysis

Loan Prediction Analysis