Sunday 3 March 2019

Loan Prediction Analysis

In this post I will share my visuals ,learning and understanding of a dataset and using various algorithms of Machine Learning to analyse it.

Problem :

A Finance company wants to automate the process of loan sanction based on customer details which were entered online by the customer. The dataset and full problem statement can be found here.


1.Reading the Dataset

I have used Python based implementation for this project and I have used python libraries like Pandas and NumPy heavily to solve the problem. I have used Matplotlib for visualisation.
I am using jupyter notebook as my python development  environment.

Open your jupyter notebook with the following command on  the linux shell

$ jupyter notebook

As you connect with the jupyter notebook open a new .ipnb

import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
df = pd.read_csv('path_of_file')

This will help you load in all the datasets and ready to use.

2. Meta Knowledge

First of all we need to go through all the headers and features of data which can be done through the use of following commands

df.head(n) ===> It prints out n rows present in dataset
df.sample(n) ===> Shows 'n' random roes present in dataset
df.info() ===> Describe data types present, number of rows, number of columns, memory usage of dataset
df.describe() ===> Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution ,excluding NaN values.
df.isnull().sum() ===> Shows the number of NaN values present in each column in Dataframe
df.column_name.value_counts() ===> Shows the  number of unique values present in the column in the DataFrame


All of these techniques help significantly to understand the dataset without actually seeing each cell and value present inside it manually.

Let's have a look at 10 entries of the dataset by using the command:

df.head(10)












Now let us look at the summary of dataset by using the function:


Out[3]:
ApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_History
count614.000000614.000000592.000000600.00000564.000000
mean5403.4592831621.245798146.412162342.000000.842199
std6109.0416732926.24836985.58732565.120410.364878
min150.0000000.0000009.00000012.000000.000000
25%2877.5000000.000000100.000000360.000001.000000
50%3812.5000001188.500000128.000000360.000001.000000
75%5795.0000002297.250000168.000000360.000001.000000
max81000.00000041667.000000700.000000480.000001.000000

As we can make out from this statistical data:
There are (614-592) 22 missing values in LoanAmount
There are (614-600) 14 missing values in Loan_Amount_Term
There are (614-564) 50 missing values in Credit_History






3. Visualization of the dataset

In this Loan Prediction Dataset we have 5 columns also called as Dataframes namely
LoanAmount , Loan_Amount_Term, ApplicantIncome, CoapplicantIncome, Credit_History

Using the following commands we can get the histograms and boxplots of ApplicantIncome

df['ApplicantIncome'].hist(bins=50)





df.boxplot(column = 'ApplicantIncome')

















As we can see from the visuals there are lot of extreme values present in this dataset which need to be kept in mind.

Now let's check the dependency of ApplicantIncome on education:

df.boxplot(column='ApplicantIncome', by='Education')













Now let's analyse the Credit_History  against the Loan status. It is an analysis similar to what we do on a pivot table in MS Excel, against the index of credit history we will check the average value of Loan_Status by considering all 'Y' as 1 and 'N' as 0.


Frequency Table for Credit History:
0.0     89
1.0    475
Name: Credit_History, dtype: int64

Probability of getting loan for each Credit History class:
                Loan_Status
Credit_History             
0.0                0.078652
1.0                0.795789

This shows that with 0 Credit_History the percentage of getting Loan is 8% whereas with 1 Credit_History the percentage of getting Loan is about 80 %. That means a person with Credit_History deserves Loan Sanction. As can be seen from the below bar graphs.

















We can also plot the Loan Status based on two Categories viz Credit_History and Gender by using the following Code:

temp4 = pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)




We can infer from the graph the ratio  of Male with Credit History getting Loan is quite high.

4. Filling out the Missing Data

As we saw previously there are many missing values in our dataset. Let's get a summary of the total number of missing values by using the following function:

df.apply(lambda x: sum(x.isnull()),axis=0)


Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
Avg loan status       0
dtype: int64


 These missing values can be filled as mean of  the values in the column



Consideration of extreme values in distribution of LoanAmount and ApplicantIncome?

As we saw earlier there are certain amount of extreme values which are possible as per the needs or higher education of the person. We can nullify their effect:

5. Building a Predictive Model in Python

Scikit-Learn(sklearn) is the commonly used library in Python for data modelling. Scikit requires all input to be numeric so we need to convert all categorical values into numeric by the following code:

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)


from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])
df.dtypes 


Next we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation score. 




#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])

  #Make predictions on training set:
  predictions = model.predict(data[predictors])

  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])

    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]

    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)

    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))

  print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome]) 


Logistic Regression

Let's make our first Logistic Regression model. We can take all the variables to build the model but that will make the model complex for understanding specific relations to data and will not provide a general solution.

The chances of getting loan will be higher for:


  1. Applicants having credit history
  2. Applicants with higher applicant-income and co-applicant income
  3. Applicants with higher education
  4. Properties in urban areas with higher growth perspectives
So let's make our first model with Credit_History
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)

We get output as  :
Accuracy : 80.945%
Cross-Validation Score : 80.946%


Decision Tree

Another method for making predictive model which is more accurate than regression model is Decision Trees.  As the name suggest it uses a tree-like model of decisions and their possible consequences.

model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)


Accuracy : 81.930%
Cross-Validation Score : 76.656%


Here we can see the accuracy went up but Cross-validation score went down because of over-fitting model.
So we can draw a conclusion that  using more sophisticated model does not guarantee better results. Its more important to understand the underlying concepts and make the features better suit the model.



























Loan Prediction Analysis

In this post I will share my visuals ,learning and understanding of a dataset and using various algorithms of Machine Learning to analyse i...