In this post I will share my visuals ,learning and understanding of a dataset and using various algorithms of Machine Learning to analyse it.
Problem :
A Finance company wants to automate the process of loan sanction based on customer details which were entered online by the customer. The dataset and full problem statement can be found here.
1.Reading the Dataset
I have used Python based implementation for this project and I have used python libraries like Pandas and NumPy heavily to solve the problem. I have used Matplotlib for visualisation.
I am using jupyter notebook as my python development environment.
Open your jupyter notebook with the following command on the linux shell
$ jupyter notebook
As you connect with the jupyter notebook open a new .ipnb
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
df = pd.read_csv('path_of_file')
This will help you load in all the datasets and ready to use.
2. Meta Knowledge
First of all we need to go through all the headers and features of data which can be done through the use of following commands
df.head(n) ===> It prints out n rows present in dataset
df.sample(n) ===> Shows 'n' random roes present in dataset
df.info() ===> Describe data types present, number of rows, number of columns, memory usage of dataset
df.describe() ===> Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution ,excluding NaN values.
df.isnull().sum() ===> Shows the number of NaN values present in each column in Dataframe
df.column_name.value_counts() ===> Shows the number of unique values present in the column in the DataFrame
All of these techniques help significantly to understand the dataset without actually seeing each cell and value present inside it manually.
Let's have a look at 10 entries of the dataset by using the command:
df.head(10)
Now let us look at the summary of dataset by using the function:
As we can make out from this statistical data:
There are (614-592) 22 missing values in LoanAmount
There are (614-600) 14 missing values in Loan_Amount_Term
There are (614-564) 50 missing values in Credit_History
3. Visualization of the dataset
In this Loan Prediction Dataset we have 5 columns also called as Dataframes namely
LoanAmount , Loan_Amount_Term, ApplicantIncome, CoapplicantIncome, Credit_History
Using the following commands we can get the histograms and boxplots of ApplicantIncome
df.boxplot(column = 'ApplicantIncome')
As we can see from the visuals there are lot of extreme values present in this dataset which need to be kept in mind.
Now let's check the dependency of ApplicantIncome on education:
df.boxplot(column='ApplicantIncome', by='Education')
Now let's analyse the Credit_History against the Loan status. It is an analysis similar to what we do on a pivot table in MS Excel, against the index of credit history we will check the average value of Loan_Status by considering all 'Y' as 1 and 'N' as 0.
We can also plot the Loan Status based on two Categories viz Credit_History and Gender by using the following Code:
temp4 = pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)
We can infer from the graph the ratio of Male with Credit History getting Loan is quite high.
4. Filling out the Missing Data
As we saw previously there are many missing values in our dataset. Let's get a summary of the total number of missing values by using the following function:
df.apply(lambda x: sum(x.isnull()),axis=0)
These missing values can be filled as mean of the values in the column
Consideration of extreme values in distribution of LoanAmount and ApplicantIncome?
As we saw earlier there are certain amount of extreme values which are possible as per the needs or higher education of the person. We can nullify their effect:
5. Building a Predictive Model in Python
Scikit-Learn(sklearn) is the commonly used library in Python for data modelling. Scikit requires all input to be numeric so we need to convert all categorical values into numeric by the following code:
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome])
#Make predictions on training set:
predictions = model.predict(data[predictors])
#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print ("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
Logistic Regression
Let's make our first Logistic Regression model. We can take all the variables to build the model but that will make the model complex for understanding specific relations to data and will not provide a general solution.
The chances of getting loan will be higher for:
Problem :
A Finance company wants to automate the process of loan sanction based on customer details which were entered online by the customer. The dataset and full problem statement can be found here.
1.Reading the Dataset
I have used Python based implementation for this project and I have used python libraries like Pandas and NumPy heavily to solve the problem. I have used Matplotlib for visualisation.
I am using jupyter notebook as my python development environment.
Open your jupyter notebook with the following command on the linux shell
$ jupyter notebook
As you connect with the jupyter notebook open a new .ipnb
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
df = pd.read_csv('path_of_file')
This will help you load in all the datasets and ready to use.
2. Meta Knowledge
First of all we need to go through all the headers and features of data which can be done through the use of following commands
df.head(n) ===> It prints out n rows present in dataset
df.sample(n) ===> Shows 'n' random roes present in dataset
df.info() ===> Describe data types present, number of rows, number of columns, memory usage of dataset
df.describe() ===> Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution ,excluding NaN values.
df.isnull().sum() ===> Shows the number of NaN values present in each column in Dataframe
df.column_name.value_counts() ===> Shows the number of unique values present in the column in the DataFrame
All of these techniques help significantly to understand the dataset without actually seeing each cell and value present inside it manually.
Let's have a look at 10 entries of the dataset by using the command:
df.head(10)
Now let us look at the summary of dataset by using the function:
Out[3]:
As we can make out from this statistical data:
There are (614-592) 22 missing values in LoanAmount
There are (614-600) 14 missing values in Loan_Amount_Term
There are (614-564) 50 missing values in Credit_History
3. Visualization of the dataset
In this Loan Prediction Dataset we have 5 columns also called as Dataframes namely
LoanAmount , Loan_Amount_Term, ApplicantIncome, CoapplicantIncome, Credit_History
Using the following commands we can get the histograms and boxplots of ApplicantIncome
df['ApplicantIncome'].hist(bins=50)
df.boxplot(column = 'ApplicantIncome')
As we can see from the visuals there are lot of extreme values present in this dataset which need to be kept in mind.
Now let's check the dependency of ApplicantIncome on education:
df.boxplot(column='ApplicantIncome', by='Education')
Now let's analyse the Credit_History against the Loan status. It is an analysis similar to what we do on a pivot table in MS Excel, against the index of credit history we will check the average value of Loan_Status by considering all 'Y' as 1 and 'N' as 0.
This shows that with 0 Credit_History the percentage of getting Loan is 8% whereas with 1 Credit_History the percentage of getting Loan is about 80 %. That means a person with Credit_History deserves Loan Sanction. As can be seen from the below bar graphs.
We can also plot the Loan Status based on two Categories viz Credit_History and Gender by using the following Code:
temp4 = pd.crosstab([df['Credit_History'],df['Gender']],df['Loan_Status'])
temp4.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)
We can infer from the graph the ratio of Male with Credit History getting Loan is quite high.
4. Filling out the Missing Data
As we saw previously there are many missing values in our dataset. Let's get a summary of the total number of missing values by using the following function:
df.apply(lambda x: sum(x.isnull()),axis=0)
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 Avg loan status 0 dtype: int64
These missing values can be filled as mean of the values in the column
Consideration of extreme values in distribution of LoanAmount and ApplicantIncome?
As we saw earlier there are certain amount of extreme values which are possible as per the needs or higher education of the person. We can nullify their effect:
5. Building a Predictive Model in Python
Scikit-Learn(sklearn) is the commonly used library in Python for data modelling. Scikit requires all input to be numeric so we need to convert all categorical values into numeric by the following code:
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes
Next we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation score.
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome])
#Make predictions on training set:
predictions = model.predict(data[predictors])
#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print ("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
Logistic Regression
Let's make our first Logistic Regression model. We can take all the variables to build the model but that will make the model complex for understanding specific relations to data and will not provide a general solution.
The chances of getting loan will be higher for:
- Applicants having credit history
- Applicants with higher applicant-income and co-applicant income
- Applicants with higher education
- Properties in urban areas with higher growth perspectives
So let's make our first model with Credit_History
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)
We get output as :
We get output as :
Accuracy : 80.945%
Cross-Validation Score : 80.946%
Decision Tree
Another method for making predictive model which is more accurate than regression model is Decision Trees. As the name suggest it uses a tree-like model of decisions and their possible consequences.
Another method for making predictive model which is more accurate than regression model is Decision Trees. As the name suggest it uses a tree-like model of decisions and their possible consequences.
model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)
Accuracy : 81.930%
Cross-Validation Score : 76.656%
Here we can see the accuracy went up but Cross-validation score went down because of over-fitting model.
So we can draw a conclusion that using more sophisticated model does not guarantee better results. Its more important to understand the underlying concepts and make the features better suit the model.
So we can draw a conclusion that using more sophisticated model does not guarantee better results. Its more important to understand the underlying concepts and make the features better suit the model.