Introduction¶

Apart from applying the techniques in the EDA module,I will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.nce the tendency to default.

Business Understanding¶

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it to their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,

All other cases: All other cases when the payment is paid on time.

When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

Approved: The Company has approved loan Application

Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client, he received worse pricing which he did not want.

Refused: The company had rejected the loan (because the client does not meet their requirements etc.).

Unused offer: Loan has been cancelled by the client but at different stages of the process.

In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency to default.

Business Objectives¶

This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

To develop your understanding of the domain, you are advised to independently research a little about risk analytics - understanding the types of variables and their significance should be enough.

This dataset has 3 files as explained below:

  1. 'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.

  2. 'previous_application.csv' contains information about the client’s previous loan data. It contains the data on whether the previous application had been Approved, Cancelled, Refused or Unused offer.

  3. 'columns_description.csv' is data dictionary which describes the meaning of the variables.

Table Of Content¶

  1. Reading a Files
    • Application Data
    • Previous Application Data
  2. Data Cleaning On Application Data
    • Checking Application Data Null Values
    • Handling Outlier For Application Data
  3. Data Visualization
    • UNIVARIATE ANALYSIS
      • Plotting Categorical Columns
      • Insights From Categorical column
      • Plotting Numerical Columns
    • Bivariate & Multivariate Analysis
      • Insights From Analysis
  4. Data Cleaning On Previous Application Data
  5. Data Visualization on Previous Application Data
    • Insights On Previous Application
  6. Merging the application Data and previous application data
  7. Conclusion
In [6]:
# Importing important libraries
import pandas as pd , numpy as np
import matplotlib.pyplot as plt
import seaborn as sns ; sns.set(color_codes = True)
import warnings 
warnings.filterwarnings("ignore")

Reading a Files¶

Application Data¶

In [9]:
# Reading Application Dataset
application_data = pd.read_csv("application_data.csv")
In [10]:
# To Read All Rows And Columns
pd.set_option("display.max_columns",None)
#pd.set_option("display.max_rows",None)
In [11]:
# Fetching Firat Five Columns
application_data.head()
Out[11]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
In [12]:
#shape of the data
application_data.shape
Out[12]:
(307511, 122)
In [13]:
# the info of all the columns
application_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB

Data Cleaning On Application Data¶

Checking Application Data Null Values¶

lets check the missing value one by one

In [16]:
# shows how much percentage of data are null
(application_data.isnull().sum()/application_data.shape[0]*100).sort_values(ascending = False)
Out[16]:
COMMONAREA_MEDI             69.872297
COMMONAREA_AVG              69.872297
COMMONAREA_MODE             69.872297
NONLIVINGAPARTMENTS_MODE    69.432963
NONLIVINGAPARTMENTS_AVG     69.432963
                              ...    
NAME_HOUSING_TYPE            0.000000
NAME_FAMILY_STATUS           0.000000
NAME_EDUCATION_TYPE          0.000000
NAME_INCOME_TYPE             0.000000
SK_ID_CURR                   0.000000
Length: 122, dtype: float64
In [17]:
# Lets See the columns who has more than 40 % of data with Null values 
cut_off=40
cols_to_drop=list(application_data.columns[100*application_data.isnull().mean()>cut_off])
print(cols_to_drop)
['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
In [18]:
len(cols_to_drop)
Out[18]:
49
As we can see there are 49 columns in application dataset which has more than 40% of data has NULL values, so we need to drop those columns are not required for analysis¶
In [20]:
application_data.drop(columns =cols_to_drop, axis=1, inplace = True )
In [21]:
application_data.shape
Out[21]:
(307511, 73)
In [22]:
application_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 73 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   SK_ID_CURR                   307511 non-null  int64  
 1   TARGET                       307511 non-null  int64  
 2   NAME_CONTRACT_TYPE           307511 non-null  object 
 3   CODE_GENDER                  307511 non-null  object 
 4   FLAG_OWN_CAR                 307511 non-null  object 
 5   FLAG_OWN_REALTY              307511 non-null  object 
 6   CNT_CHILDREN                 307511 non-null  int64  
 7   AMT_INCOME_TOTAL             307511 non-null  float64
 8   AMT_CREDIT                   307511 non-null  float64
 9   AMT_ANNUITY                  307499 non-null  float64
 10  AMT_GOODS_PRICE              307233 non-null  float64
 11  NAME_TYPE_SUITE              306219 non-null  object 
 12  NAME_INCOME_TYPE             307511 non-null  object 
 13  NAME_EDUCATION_TYPE          307511 non-null  object 
 14  NAME_FAMILY_STATUS           307511 non-null  object 
 15  NAME_HOUSING_TYPE            307511 non-null  object 
 16  REGION_POPULATION_RELATIVE   307511 non-null  float64
 17  DAYS_BIRTH                   307511 non-null  int64  
 18  DAYS_EMPLOYED                307511 non-null  int64  
 19  DAYS_REGISTRATION            307511 non-null  float64
 20  DAYS_ID_PUBLISH              307511 non-null  int64  
 21  FLAG_MOBIL                   307511 non-null  int64  
 22  FLAG_EMP_PHONE               307511 non-null  int64  
 23  FLAG_WORK_PHONE              307511 non-null  int64  
 24  FLAG_CONT_MOBILE             307511 non-null  int64  
 25  FLAG_PHONE                   307511 non-null  int64  
 26  FLAG_EMAIL                   307511 non-null  int64  
 27  OCCUPATION_TYPE              211120 non-null  object 
 28  CNT_FAM_MEMBERS              307509 non-null  float64
 29  REGION_RATING_CLIENT         307511 non-null  int64  
 30  REGION_RATING_CLIENT_W_CITY  307511 non-null  int64  
 31  WEEKDAY_APPR_PROCESS_START   307511 non-null  object 
 32  HOUR_APPR_PROCESS_START      307511 non-null  int64  
 33  REG_REGION_NOT_LIVE_REGION   307511 non-null  int64  
 34  REG_REGION_NOT_WORK_REGION   307511 non-null  int64  
 35  LIVE_REGION_NOT_WORK_REGION  307511 non-null  int64  
 36  REG_CITY_NOT_LIVE_CITY       307511 non-null  int64  
 37  REG_CITY_NOT_WORK_CITY       307511 non-null  int64  
 38  LIVE_CITY_NOT_WORK_CITY      307511 non-null  int64  
 39  ORGANIZATION_TYPE            307511 non-null  object 
 40  EXT_SOURCE_2                 306851 non-null  float64
 41  EXT_SOURCE_3                 246546 non-null  float64
 42  OBS_30_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 43  DEF_30_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 44  OBS_60_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 45  DEF_60_CNT_SOCIAL_CIRCLE     306490 non-null  float64
 46  DAYS_LAST_PHONE_CHANGE       307510 non-null  float64
 47  FLAG_DOCUMENT_2              307511 non-null  int64  
 48  FLAG_DOCUMENT_3              307511 non-null  int64  
 49  FLAG_DOCUMENT_4              307511 non-null  int64  
 50  FLAG_DOCUMENT_5              307511 non-null  int64  
 51  FLAG_DOCUMENT_6              307511 non-null  int64  
 52  FLAG_DOCUMENT_7              307511 non-null  int64  
 53  FLAG_DOCUMENT_8              307511 non-null  int64  
 54  FLAG_DOCUMENT_9              307511 non-null  int64  
 55  FLAG_DOCUMENT_10             307511 non-null  int64  
 56  FLAG_DOCUMENT_11             307511 non-null  int64  
 57  FLAG_DOCUMENT_12             307511 non-null  int64  
 58  FLAG_DOCUMENT_13             307511 non-null  int64  
 59  FLAG_DOCUMENT_14             307511 non-null  int64  
 60  FLAG_DOCUMENT_15             307511 non-null  int64  
 61  FLAG_DOCUMENT_16             307511 non-null  int64  
 62  FLAG_DOCUMENT_17             307511 non-null  int64  
 63  FLAG_DOCUMENT_18             307511 non-null  int64  
 64  FLAG_DOCUMENT_19             307511 non-null  int64  
 65  FLAG_DOCUMENT_20             307511 non-null  int64  
 66  FLAG_DOCUMENT_21             307511 non-null  int64  
 67  AMT_REQ_CREDIT_BUREAU_HOUR   265992 non-null  float64
 68  AMT_REQ_CREDIT_BUREAU_DAY    265992 non-null  float64
 69  AMT_REQ_CREDIT_BUREAU_WEEK   265992 non-null  float64
 70  AMT_REQ_CREDIT_BUREAU_MON    265992 non-null  float64
 71  AMT_REQ_CREDIT_BUREAU_QRT    265992 non-null  float64
 72  AMT_REQ_CREDIT_BUREAU_YEAR   265992 non-null  float64
dtypes: float64(20), int64(41), object(12)
memory usage: 171.3+ MB
In [23]:
application_data.head()
Out[23]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.262949 0.139376 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 NaN 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.555912 0.729567 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 0.650442 NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion 0.322738 NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
In [24]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[24]:
DAYS_LAST_PHONE_CHANGE         0.000325
CNT_FAM_MEMBERS                0.000650
AMT_ANNUITY                    0.003902
AMT_GOODS_PRICE                0.090403
EXT_SOURCE_2                   0.214626
DEF_60_CNT_SOCIAL_CIRCLE       0.332021
OBS_60_CNT_SOCIAL_CIRCLE       0.332021
DEF_30_CNT_SOCIAL_CIRCLE       0.332021
OBS_30_CNT_SOCIAL_CIRCLE       0.332021
NAME_TYPE_SUITE                0.420148
AMT_REQ_CREDIT_BUREAU_QRT     13.501631
AMT_REQ_CREDIT_BUREAU_HOUR    13.501631
AMT_REQ_CREDIT_BUREAU_DAY     13.501631
AMT_REQ_CREDIT_BUREAU_WEEK    13.501631
AMT_REQ_CREDIT_BUREAU_MON     13.501631
AMT_REQ_CREDIT_BUREAU_YEAR    13.501631
EXT_SOURCE_3                  19.825307
OCCUPATION_TYPE               31.345545
dtype: float64

Handling Null Columns¶

1. Columns analysis for handling null values For credit Bureau Enquiry¶

  • AMT_REQ_CREDIT_BUREAU_HOUR
  • AMT_REQ_CREDIT_BUREAU_DAY
  • AMT_REQ_CREDIT_BUREAU_WEEK
  • AMT_REQ_CREDIT_BUREAU_MON
  • AMT_REQ_CREDIT_BUREAU_QRT
  • AMT_REQ_CREDIT_BUREAU_YEAR
In [27]:
#Number of enquiries to Credit Bureau about the client before submitting the application 
application_data[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']]
Out[27]:
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
307506 NaN NaN NaN NaN NaN NaN
307507 NaN NaN NaN NaN NaN NaN
307508 1.0 0.0 0.0 1.0 0.0 1.0
307509 0.0 0.0 0.0 0.0 0.0 0.0
307510 0.0 0.0 0.0 2.0 0.0 1.0

307511 rows × 6 columns

In [28]:
application_data['AMT_REQ_CREDIT_BUREAU_HOUR'].describe()
Out[28]:
count    265992.000000
mean          0.006402
std           0.083849
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           4.000000
Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
In [29]:
application_data[['AMT_REQ_CREDIT_BUREAU_HOUR']].aggregate(['mean','median'])
Out[29]:
AMT_REQ_CREDIT_BUREAU_HOUR
mean 0.006402
median 0.000000

Going deep down on the credit enquiry I get to know why credit equiries has been done applying for the loan application

What is Credit Enquiry ? (research has been done from google)¶

A credit inquiry is a request for an individual's credit report information from a credit bureau. credit enquiry can be made by financial institutes, individuals, and Company. A credit inquiry is a credit check. Inquiries happen when there is a legally permitted request to see your credit report from a company or person.

The columns are inportant for the analysis as we can get to know already inquiry has been done from the credit bureau or not we can replace the NAN value with median in all these 6 Columns

In [31]:
fill_na = application_data[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']]
In [32]:
for col in fill_na:
    med = application_data[col].median()
    application_data[col] = application_data[col].fillna(med)
In [33]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[33]:
DAYS_LAST_PHONE_CHANGE       0.000325
CNT_FAM_MEMBERS              0.000650
AMT_ANNUITY                  0.003902
AMT_GOODS_PRICE              0.090403
EXT_SOURCE_2                 0.214626
OBS_30_CNT_SOCIAL_CIRCLE     0.332021
DEF_30_CNT_SOCIAL_CIRCLE     0.332021
OBS_60_CNT_SOCIAL_CIRCLE     0.332021
DEF_60_CNT_SOCIAL_CIRCLE     0.332021
NAME_TYPE_SUITE              0.420148
EXT_SOURCE_3                19.825307
OCCUPATION_TYPE             31.345545
dtype: float64

2. Handling Null Values For Days Past Dus¶

  • OBS_30_CNT_SOCIAL_CIRCLE
  • DEF_30_CNT_SOCIAL_CIRCLE
  • OBS_60_CNT_SOCIAL_CIRCLE
  • DEF_60_CNT_SOCIAL_CIRCLECLE

Reserach is done from google

DPD is a significant indicator of the financial behaviour of the borrower. Late payment of the dues can affect the CIBIL report in more ways than one. DPD indicates the number of days by which the payment is delayed. In case the applicant has missed any payments in the past, the DPD section reports the details of the delay. It depicts how many days the delay was made and in which month did it happen.

In [36]:
application_data.loc[:,'OBS_30_CNT_SOCIAL_CIRCLE':'DEF_60_CNT_SOCIAL_CIRCLE']
Out[36]:
OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE
0 2.0 2.0 2.0 2.0
1 1.0 0.0 1.0 0.0
2 0.0 0.0 0.0 0.0
3 2.0 0.0 2.0 0.0
4 0.0 0.0 0.0 0.0
... ... ... ... ...
307506 0.0 0.0 0.0 0.0
307507 0.0 0.0 0.0 0.0
307508 6.0 0.0 6.0 0.0
307509 0.0 0.0 0.0 0.0
307510 0.0 0.0 0.0 0.0

307511 rows × 4 columns

In [37]:
application_data[application_data['OBS_30_CNT_SOCIAL_CIRCLE'] == 30.0].loc[:,'OBS_30_CNT_SOCIAL_CIRCLE':'DEF_60_CNT_SOCIAL_CIRCLE']
Out[37]:
OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE
169517 30.0 0.0 30.0 0.0
280641 30.0 0.0 29.0 0.0
In [38]:
#pd.set_option("display.max_rows",None)
application_data[application_data['OBS_30_CNT_SOCIAL_CIRCLE'].isnull()].loc[:,'OBS_30_CNT_SOCIAL_CIRCLE':'DEF_60_CNT_SOCIAL_CIRCLE']
Out[38]:
OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE
68 NaN NaN NaN NaN
394 NaN NaN NaN NaN
397 NaN NaN NaN NaN
457 NaN NaN NaN NaN
1042 NaN NaN NaN NaN
... ... ... ... ...
305526 NaN NaN NaN NaN
305623 NaN NaN NaN NaN
305641 NaN NaN NaN NaN
305642 NaN NaN NaN NaN
307402 NaN NaN NaN NaN

1021 rows × 4 columns

As we can observe from the above two table

  • First one shows that there are people who has taken the loan and have difficult to repay the loan amount with 30 past due days
  • seconed one shows thta the nan is for all the four columns, means there no due days for any of the rows
    so observing the data we can replace the NaN values with 0 , assuming the client has not done any default or observed in these 30 and 60 DPD time Frame
In [40]:
fill_na_2 = application_data[['OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE']]
In [41]:
for col in fill_na_2:
    application_data[col].fillna(0,inplace=True)
In [42]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[42]:
DAYS_LAST_PHONE_CHANGE     0.000325
CNT_FAM_MEMBERS            0.000650
AMT_ANNUITY                0.003902
AMT_GOODS_PRICE            0.090403
EXT_SOURCE_2               0.214626
NAME_TYPE_SUITE            0.420148
EXT_SOURCE_3              19.825307
OCCUPATION_TYPE           31.345545
dtype: float64

3. DAYS_LAST_PHONE_CHANGE¶

By seeing the data set in DAYS_LAST_PHONE_CHANGE contains negative values which is not correct need to change in absolute and days cannot be negative fill Nan valus with mode

In [44]:
application_data[['DAYS_LAST_PHONE_CHANGE']]
Out[44]:
DAYS_LAST_PHONE_CHANGE
0 -1134.0
1 -828.0
2 -815.0
3 -617.0
4 -1106.0
... ...
307506 -273.0
307507 0.0
307508 -1909.0
307509 -322.0
307510 -787.0

307511 rows × 1 columns

In [45]:
# converting from days to month first converting to abs value to make all number to positive
application_data['DAYS_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE'].abs()
In [46]:
application_data['DAYS_LAST_PHONE_CHANGE'].fillna(application_data['DAYS_LAST_PHONE_CHANGE'].mode()[0],inplace=True)

Lets Standardise DAYS_LAST_PHONE_CHANGE Change Days to year and change the data tyope float to int

In [48]:
application_data['DAYS_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE'].astype(int)
In [49]:
application_data['DAYS_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE'] //365
In [50]:
application_data['YEAR_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE']
In [51]:
application_data.drop(columns = 'DAYS_LAST_PHONE_CHANGE',axis=1, inplace=True)
In [52]:
application_data.shape
Out[52]:
(307511, 73)
In [53]:
application_data.head()
Out[53]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.262949 0.139376 2.0 2.0 2.0 2.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 3
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 NaN 1.0 0.0 1.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.555912 0.729567 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 0.650442 NaN 2.0 0.0 2.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion 0.322738 NaN 0.0 0.0 0.0 0.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 3

4. CNT_FAM_MEMBERS¶

First of all count of family numbers is in float while seeing the data for count of family members where shoukd be whle numbers data type should be integer Fill the NaN values with median

In [55]:
application_data[['CNT_FAM_MEMBERS']]
Out[55]:
CNT_FAM_MEMBERS
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
... ...
307506 1.0
307507 1.0
307508 1.0
307509 2.0
307510 2.0

307511 rows × 1 columns

In [56]:
#fillinf na with family members
application_data["CNT_FAM_MEMBERS"] =application_data["CNT_FAM_MEMBERS"].replace(np.NaN, application_data["CNT_FAM_MEMBERS"].median())
In [57]:
application_data['CNT_FAM_MEMBERS'].isnull().sum()
Out[57]:
0

changing datatype from float to int CNT_FAM_MEMBERS

In [59]:
application_data['CNT_FAM_MEMBERS'] =  application_data['CNT_FAM_MEMBERS'].astype(int)
In [60]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[60]:
AMT_ANNUITY         0.003902
AMT_GOODS_PRICE     0.090403
EXT_SOURCE_2        0.214626
NAME_TYPE_SUITE     0.420148
EXT_SOURCE_3       19.825307
OCCUPATION_TYPE    31.345545
dtype: float64

5. EXT_SOURCE_2 and EXT_SOURCE_3¶

By seeing data of both the columns seems it can be an additional source for the credit score and can valuable for the analysis further. Dont know extacly what it is but we can remove the NaN values with mean or median

In [62]:
application_data[['EXT_SOURCE_2','EXT_SOURCE_3']]
Out[62]:
EXT_SOURCE_2 EXT_SOURCE_3
0 0.262949 0.139376
1 0.622246 NaN
2 0.555912 0.729567
3 0.650442 NaN
4 0.322738 NaN
... ... ...
307506 0.681632 NaN
307507 0.115992 NaN
307508 0.535722 0.218859
307509 0.514163 0.661024
307510 0.708569 0.113922

307511 rows × 2 columns

In [63]:
application_data[['EXT_SOURCE_2','EXT_SOURCE_3']].aggregate(['mean','median'])
Out[63]:
EXT_SOURCE_2 EXT_SOURCE_3
mean 0.514393 0.510853
median 0.565961 0.535276
In [64]:
application_data[['EXT_SOURCE_2','EXT_SOURCE_3']].describe()
Out[64]:
EXT_SOURCE_2 EXT_SOURCE_3
count 3.068510e+05 246546.000000
mean 5.143927e-01 0.510853
std 1.910602e-01 0.194844
min 8.173617e-08 0.000527
25% 3.924574e-01 0.370650
50% 5.659614e-01 0.535276
75% 6.636171e-01 0.669057
max 8.549997e-01 0.896010

As we can see the description above there is no major difference between mean and max we can either go for mean or we can for median. it is better to change the NaN value with median

In [66]:
application_data['EXT_SOURCE_2'].fillna(application_data['EXT_SOURCE_2'].median(), inplace=True)
In [67]:
application_data['EXT_SOURCE_3'].fillna(application_data['EXT_SOURCE_3'].median(), inplace=True)
In [68]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[68]:
AMT_ANNUITY         0.003902
AMT_GOODS_PRICE     0.090403
NAME_TYPE_SUITE     0.420148
OCCUPATION_TYPE    31.345545
dtype: float64

5. AMT_ANNUITY -- Numerical Variable¶

In [70]:
application_data[['AMT_ANNUITY']]
Out[70]:
AMT_ANNUITY
0 24700.5
1 35698.5
2 6750.0
3 29686.5
4 21865.5
... ...
307506 27558.0
307507 12001.5
307508 29979.0
307509 20205.0
307510 49117.5

307511 rows × 1 columns

In [71]:
application_data[['AMT_ANNUITY']].describe()
Out[71]:
AMT_ANNUITY
count 307499.000000
mean 27108.573909
std 14493.737315
min 1615.500000
25% 16524.000000
50% 24903.000000
75% 34596.000000
max 258025.500000
In [72]:
sns.boxplot(application_data['AMT_ANNUITY'])
Out[72]:
<Axes: ylabel='AMT_ANNUITY'>
No description has been provided for this image

As we can see both description and boxplot visualization there are high number of outliers , so we can fill the Nan values with median

In [74]:
application_data['AMT_ANNUITY'].fillna(application_data['AMT_ANNUITY'].median(),inplace=True)
In [75]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[75]:
AMT_GOODS_PRICE     0.090403
NAME_TYPE_SUITE     0.420148
OCCUPATION_TYPE    31.345545
dtype: float64

6. AMT_GOODS_PRICE -- Numerical Variable¶

In [77]:
application_data[['AMT_GOODS_PRICE']]
Out[77]:
AMT_GOODS_PRICE
0 351000.0
1 1129500.0
2 135000.0
3 297000.0
4 513000.0
... ...
307506 225000.0
307507 225000.0
307508 585000.0
307509 319500.0
307510 675000.0

307511 rows × 1 columns

In [78]:
application_data['AMT_GOODS_PRICE'].describe()
Out[78]:
count    3.072330e+05
mean     5.383962e+05
std      3.694465e+05
min      4.050000e+04
25%      2.385000e+05
50%      4.500000e+05
75%      6.795000e+05
max      4.050000e+06
Name: AMT_GOODS_PRICE, dtype: float64
In [79]:
sns.boxplot(application_data['AMT_GOODS_PRICE'])
Out[79]:
<Axes: ylabel='AMT_GOODS_PRICE'>
No description has been provided for this image

As we can see both description and boxplot visualization there are high number of outliers , so we can fill the Nan values with median

In [81]:
application_data['AMT_GOODS_PRICE'].fillna(application_data['AMT_GOODS_PRICE'].median(),inplace=True)
In [82]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[82]:
NAME_TYPE_SUITE     0.420148
OCCUPATION_TYPE    31.345545
dtype: float64

7. NAME_TYPE_SUITE -- Categorical Variable¶

In [84]:
application_data[['NAME_TYPE_SUITE']]
Out[84]:
NAME_TYPE_SUITE
0 Unaccompanied
1 Family
2 Unaccompanied
3 Unaccompanied
4 Unaccompanied
... ...
307506 Unaccompanied
307507 Unaccompanied
307508 Unaccompanied
307509 Unaccompanied
307510 Unaccompanied

307511 rows × 1 columns

In [85]:
application_data['NAME_TYPE_SUITE'].value_counts()
Out[85]:
NAME_TYPE_SUITE
Unaccompanied      248526
Family              40149
Spouse, partner     11370
Children             3267
Other_B              1770
Other_A               866
Group of people       271
Name: count, dtype: int64
In [86]:
# Most occcured 
application_data['NAME_TYPE_SUITE'].mode()[0]
Out[86]:
'Unaccompanied'
In [87]:
# since ita an categorical variable flling NaN values with mode
application_data['NAME_TYPE_SUITE'].fillna(application_data['NAME_TYPE_SUITE'].mode()[0],inplace=True)

8. OCCUPATION_TYPE -- Categorical Variable¶

In [89]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[89]:
OCCUPATION_TYPE    31.345545
dtype: float64
In [90]:
application_data['OCCUPATION_TYPE'].value_counts()
Out[90]:
OCCUPATION_TYPE
Laborers                 55186
Sales staff              32102
Core staff               27570
Managers                 21371
Drivers                  18603
High skill tech staff    11380
Accountants               9813
Medicine staff            8537
Security staff            6721
Cooking staff             5946
Cleaning staff            4653
Private service staff     2652
Low-skill Laborers        2093
Waiters/barmen staff      1348
Secretaries               1305
Realty agents              751
HR staff                   563
IT staff                   526
Name: count, dtype: int64
In [91]:
# As we can see OCCUPATION_TYPE has 31 % and most of them are laboerers it will not be good approach to fill missing values with mode of that column 
sns.countplot(application_data['OCCUPATION_TYPE'])
Out[91]:
<Axes: xlabel='count', ylabel='OCCUPATION_TYPE'>
No description has been provided for this image
In [92]:
# Lets fill the missing values with others as it will good for the analysis
application_data['OCCUPATION_TYPE'].fillna('Others',inplace=True)
In [93]:
application_data.head()
Out[93]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.262949 0.139376 2.0 2.0 2.0 2.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 3
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 0.535276 1.0 0.0 1.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.555912 0.729567 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 0.650442 0.535276 2.0 0.0 2.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion 0.322738 0.535276 0.0 0.0 0.0 0.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 3

There are some more anomalies in data i.e. Days columns seems to be negative values and days cannot be in negative to need to change it into abs then we can standardise day into Year

In [95]:
for col in application_data:
    if col.startswith("DAYS"):
        print(col)
    
DAYS_BIRTH
DAYS_EMPLOYED
DAYS_REGISTRATION
DAYS_ID_PUBLISH
In [96]:
# now leats chage data type of DAYS_REGISTRATION and DAYS_LAST_PHONE_CHANGE float to integer
application_data['YEAR_REGISTRATION'] = application_data['DAYS_REGISTRATION'].astype(int)
In [97]:
application_data['DAYS_BIRTH'] = application_data['DAYS_BIRTH'].abs()
application_data['DAYS_EMPLOYED'] = application_data['DAYS_EMPLOYED'].abs()
application_data['YEAR_REGISTRATION'] = application_data['YEAR_REGISTRATION'].abs()
application_data['DAYS_ID_PUBLISH'] = application_data['DAYS_ID_PUBLISH'].abs()

Lets add new column to change the day_birth as Age and DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH to Number of Years

In [99]:
application_data['AGE'] = application_data['DAYS_BIRTH']//365
application_data['YEAR_EMPLOYED'] = application_data['DAYS_EMPLOYED']//365
application_data['YEAR_REGISTRATION'] = application_data['YEAR_REGISTRATION']//365
application_data['YEAR_ID_PUBLISH'] = application_data['DAYS_ID_PUBLISH']//365

Note: All days column are now considered as number of years¶

Lets Drop all Days Columns

In [101]:
application_data.drop(columns = "DAYS_BIRTH",axis=1,inplace=True)
application_data.drop(columns = "DAYS_EMPLOYED",axis=1,inplace=True)
application_data.drop(columns = "DAYS_REGISTRATION",axis=1,inplace=True)
application_data.drop(columns = "DAYS_ID_PUBLISH",axis=1,inplace=True)
In [102]:
application_data.shape
Out[102]:
(307511, 73)
In [103]:
application_data['NAME_EDUCATION_TYPE'].unique()
Out[103]:
array(['Secondary / secondary special', 'Higher education',
       'Incomplete higher', 'Lower secondary', 'Academic degree'],
      dtype=object)
In [104]:
application_data['NAME_FAMILY_STATUS'].unique()
Out[104]:
array(['Single / not married', 'Married', 'Civil marriage', 'Widow',
       'Separated', 'Unknown'], dtype=object)
In [105]:
application_data['NAME_HOUSING_TYPE'].unique()
Out[105]:
array(['House / apartment', 'Rented apartment', 'With parents',
       'Municipal apartment', 'Office apartment', 'Co-op apartment'],
      dtype=object)
In [106]:
application_data['ORGANIZATION_TYPE'].unique()
Out[106]:
array(['Business Entity Type 3', 'School', 'Government', 'Religion',
       'Other', 'XNA', 'Electricity', 'Medicine',
       'Business Entity Type 2', 'Self-employed', 'Transport: type 2',
       'Construction', 'Housing', 'Kindergarten', 'Trade: type 7',
       'Industry: type 11', 'Military', 'Services', 'Security Ministries',
       'Transport: type 4', 'Industry: type 1', 'Emergency', 'Security',
       'Trade: type 2', 'University', 'Transport: type 3', 'Police',
       'Business Entity Type 1', 'Postal', 'Industry: type 4',
       'Agriculture', 'Restaurant', 'Culture', 'Hotel',
       'Industry: type 7', 'Trade: type 3', 'Industry: type 3', 'Bank',
       'Industry: type 9', 'Insurance', 'Trade: type 6',
       'Industry: type 2', 'Transport: type 1', 'Industry: type 12',
       'Mobile', 'Trade: type 1', 'Industry: type 5', 'Industry: type 10',
       'Legal Services', 'Advertising', 'Trade: type 5', 'Cleaning',
       'Industry: type 13', 'Trade: type 4', 'Telecom',
       'Industry: type 8', 'Realtor', 'Industry: type 6'], dtype=object)

As we can see above NAME_EDUCATION_TYPE has education Secondary / secondary special we can give it to one number Secondary , NAME_FAMILY_STATUS has Single / not married we can give Single and NAME_HOUSING_TYPE has House / apartment we can one name as House

In [108]:
application_data['NAME_EDUCATION_TYPE'].replace("Secondary / secondary special","Secondary",inplace=True)
In [109]:
application_data[['NAME_EDUCATION_TYPE']]
Out[109]:
NAME_EDUCATION_TYPE
0 Secondary
1 Higher education
2 Secondary
3 Secondary
4 Secondary
... ...
307506 Secondary
307507 Secondary
307508 Higher education
307509 Secondary
307510 Higher education

307511 rows × 1 columns

In [110]:
application_data['NAME_FAMILY_STATUS'].replace("Single / not married","Single",inplace=True)
application_data['NAME_HOUSING_TYPE'].replace("House / apartment","House",inplace=True)
In [111]:
application_data.head()
Out[111]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary Single House 0.018801 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.262949 0.139376 2.0 2.0 2.0 2.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 3 9 25 1 5
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House 0.003541 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 0.535276 1.0 0.0 1.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary Single House 0.010032 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.555912 0.729567 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2 11 52 0 6
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary Civil marriage House 0.008019 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 0.650442 0.535276 2.0 0.0 2.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1 26 52 8 6
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary Single House 0.028663 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion 0.322738 0.535276 0.0 0.0 0.0 0.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 3 11 54 8 9
In [112]:
application_data[application_data['ORGANIZATION_TYPE'] == 'XNA']
Out[112]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH
8 100011 0 Cash loans F N Y 0 112500.000 1019610.0 33826.5 913500.0 Children Pensioner Secondary Married House 0.018634 1 0 0 1 0 0 Others 2 2 2 WEDNESDAY 14 0 0 0 0 0 0 XNA 0.205747 0.751724 1.0 0.0 1.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 0 20 55 1000 9
11 100015 0 Cash loans F N Y 0 38419.155 148365.0 10678.5 135000.0 Children Pensioner Secondary Married House 0.015221 1 0 0 1 1 0 Others 2 2 2 FRIDAY 7 0 0 0 0 0 0 XNA 0.555183 0.652897 0.0 0.0 0.0 0.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 6 14 55 1000 6
23 100027 0 Cash loans F N Y 0 83250.000 239850.0 23850.0 225000.0 Unaccompanied Pensioner Secondary Married House 0.006296 1 0 0 1 1 0 Others 2 3 3 FRIDAY 12 0 0 0 0 0 0 XNA 0.624305 0.669057 0.0 0.0 0.0 0.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 2 24 68 1000 10
38 100045 0 Cash loans F N Y 0 99000.000 247275.0 17338.5 225000.0 Unaccompanied Pensioner Secondary Married House 0.006207 1 0 0 1 1 0 Others 2 2 2 FRIDAY 11 0 0 0 0 0 0 XNA 0.650765 0.751724 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 0 26 65 1000 13
43 100050 0 Cash loans F N Y 0 108000.000 746280.0 42970.5 675000.0 Unaccompanied Pensioner Higher education Single House 0.010966 1 0 0 1 0 0 Others 1 2 2 WEDNESDAY 9 0 0 0 0 0 0 XNA 0.766138 0.684828 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 1 15 64 1000 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307469 456209 0 Cash loans F N Y 0 202500.000 703728.0 29943.0 607500.0 Unaccompanied Pensioner Secondary Single House 0.031329 1 0 0 1 0 0 Others 1 2 2 MONDAY 15 0 0 0 0 0 0 XNA 0.195625 0.360613 11.0 2.0 11.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 1.0 0.0 0.0 4 32 61 1000 11
307483 456227 0 Cash loans F N Y 0 99000.000 247275.0 16479.0 225000.0 Unaccompanied Pensioner Secondary Separated House 0.007330 1 0 0 1 0 0 Others 1 2 2 MONDAY 12 0 0 0 0 0 0 XNA 0.589906 0.520898 1.0 1.0 1.0 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 5.0 4 30 68 1000 11
307487 456231 0 Cash loans M N Y 0 117000.000 1071909.0 31473.0 936000.0 Unaccompanied Pensioner Secondary Married House 0.010147 1 0 0 1 0 0 Others 2 2 2 WEDNESDAY 10 0 0 0 0 0 0 XNA 0.307082 0.255332 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 8.0 2 15 63 1000 11
307505 456249 0 Cash loans F N Y 0 112500.000 225000.0 22050.0 225000.0 Unaccompanied Pensioner Secondary Single House 0.022800 1 0 0 1 1 0 Others 1 2 2 MONDAY 15 0 0 0 0 0 0 XNA 0.346391 0.742182 0.0 0.0 0.0 0.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 2.0 0.0 0.0 0 20 66 1000 6
307507 456252 0 Cash loans F N Y 0 72000.000 269550.0 12001.5 225000.0 Unaccompanied Pensioner Secondary Widow House 0.025164 1 0 0 1 1 0 Others 1 2 2 MONDAY 8 0 0 0 0 0 0 XNA 0.115992 0.535276 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 0 12 56 1000 11

55374 rows × 73 columns

As seen above data I have seen that NAME_INCOME_TYPE who are pensioner there ORGANIZATION_TYEP is XNA lets make another organization type as Retired as well

In [114]:
application_data['ORGANIZATION_TYPE'].replace("XNA",'Retired', inplace=True)
In [115]:
application_data['ORGANIZATION_TYPE'].isnull().sum()
Out[115]:
0

There in Flag Documents Columns Which tells Us how many Documnets client submitted lets sum up all the columns and put it in a one columns i.e. All_Flag_document and we drop all the flag_document columns

In [117]:
for col in application_data:
    if col.startswith("FLAG_DOCUMENT"):
        print(col)
    
FLAG_DOCUMENT_2
FLAG_DOCUMENT_3
FLAG_DOCUMENT_4
FLAG_DOCUMENT_5
FLAG_DOCUMENT_6
FLAG_DOCUMENT_7
FLAG_DOCUMENT_8
FLAG_DOCUMENT_9
FLAG_DOCUMENT_10
FLAG_DOCUMENT_11
FLAG_DOCUMENT_12
FLAG_DOCUMENT_13
FLAG_DOCUMENT_14
FLAG_DOCUMENT_15
FLAG_DOCUMENT_16
FLAG_DOCUMENT_17
FLAG_DOCUMENT_18
FLAG_DOCUMENT_19
FLAG_DOCUMENT_20
FLAG_DOCUMENT_21
In [118]:
All_Document = ["FLAG_DOCUMENT_2", "FLAG_DOCUMENT_3","FLAG_DOCUMENT_4","FLAG_DOCUMENT_5","FLAG_DOCUMENT_6","FLAG_DOCUMENT_7","FLAG_DOCUMENT_8","FLAG_DOCUMENT_9","FLAG_DOCUMENT_10", 
"FLAG_DOCUMENT_11","FLAG_DOCUMENT_12","FLAG_DOCUMENT_13","FLAG_DOCUMENT_14","FLAG_DOCUMENT_15","FLAG_DOCUMENT_16","FLAG_DOCUMENT_17","FLAG_DOCUMENT_18",
"FLAG_DOCUMENT_19","FLAG_DOCUMENT_20","FLAG_DOCUMENT_21"]
In [119]:
application_data['All_Flag_document'] = application_data[All_Document].sum(axis= 1)
In [120]:
application_data.drop(All_Document, axis=1,inplace = True)
In [121]:
application_data.shape
Out[121]:
(307511, 54)
In [122]:
application_data.head()
Out[122]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH All_Flag_document
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary Single House 0.018801 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.262949 0.139376 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 3 9 25 1 5 1
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House 0.003541 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 0.535276 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0 1
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary Single House 0.010032 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.555912 0.729567 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 11 52 0 6 0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary Civil marriage House 0.008019 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 0.650442 0.535276 2.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 26 52 8 6 1
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary Single House 0.028663 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion 0.322738 0.535276 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 11 54 8 9 1

Segragatting Numericqal and Categorical Columns¶

In [124]:
Cat_col = list(set(application_data.columns) - set(application_data.describe().columns))
Num_col = application_data.describe().columns
In [125]:
application_data[Cat_col].head()
Out[125]:
OCCUPATION_TYPE NAME_EDUCATION_TYPE CODE_GENDER NAME_TYPE_SUITE FLAG_OWN_REALTY NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE NAME_INCOME_TYPE NAME_FAMILY_STATUS FLAG_OWN_CAR NAME_CONTRACT_TYPE
0 Laborers Secondary M Unaccompanied Y House WEDNESDAY Business Entity Type 3 Working Single N Cash loans
1 Core staff Higher education F Family N House MONDAY School State servant Married N Cash loans
2 Laborers Secondary M Unaccompanied Y House MONDAY Government Working Single Y Revolving loans
3 Laborers Secondary F Unaccompanied Y House WEDNESDAY Business Entity Type 3 Working Civil marriage N Cash loans
4 Core staff Secondary M Unaccompanied Y House THURSDAY Religion Working Single N Cash loans
In [126]:
application_data[Num_col].head()
Out[126]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH All_Flag_document
0 100002 1 0 202500.0 406597.5 24700.5 351000.0 0.018801 1 1 0 1 1 0 1 2 2 10 0 0 0 0 0 0 0.262949 0.139376 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 3 9 25 1 5 1
1 100003 0 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 1 1 0 1 1 0 2 1 1 11 0 0 0 0 0 0 0.622246 0.535276 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0 1
2 100004 0 0 67500.0 135000.0 6750.0 135000.0 0.010032 1 1 1 1 1 0 1 2 2 9 0 0 0 0 0 0 0.555912 0.729567 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 11 52 0 6 0
3 100006 0 0 135000.0 312682.5 29686.5 297000.0 0.008019 1 1 0 1 0 0 2 2 2 17 0 0 0 0 0 0 0.650442 0.535276 2.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 26 52 8 6 1
4 100007 0 0 121500.0 513000.0 21865.5 513000.0 0.028663 1 1 0 1 0 0 1 2 2 11 0 0 0 0 1 1 0.322738 0.535276 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 11 54 8 9 1

Handling Outlier For Application Data¶

  • AMT_INCOME_TOTAL
  • AMT_CREDIT
  • AMT_ANNUITY
  • AMT_GOODS_PRICE
  • EXT_SOURCE_2
  • EXT_SOURCE_3
  • AGE

1. AMT_INCOME_TOTAL¶

In [130]:
application_data['AMT_INCOME_TOTAL'].describe()
Out[130]:
count    3.075110e+05
mean     1.687979e+05
std      2.371231e+05
min      2.565000e+04
25%      1.125000e+05
50%      1.471500e+05
75%      2.025000e+05
max      1.170000e+08
Name: AMT_INCOME_TOTAL, dtype: float64
In [131]:
sns.boxplot(application_data.AMT_INCOME_TOTAL)
Out[131]:
<Axes: ylabel='AMT_INCOME_TOTAL'>
No description has been provided for this image
In [132]:
#print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of Amt Income Total variable
application_data.AMT_INCOME_TOTAL.quantile([0.5,0.7,0.95,0.99])
Out[132]:
0.50    147150.0
0.70    180000.0
0.95    337500.0
0.99    472500.0
Name: AMT_INCOME_TOTAL, dtype: float64
In [133]:
application_data['AMT_INCOME_TOTAL'].max()
Out[133]:
117000000.0
In [134]:
application_data[application_data['AMT_INCOME_TOTAL'] == 117000000.0]
Out[134]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH All_Flag_document
12840 114967 1 Cash loans F N Y 1 117000000.0 562491.0 26194.5 454500.0 Unaccompanied Working Secondary Married House 0.010643 1 1 0 1 0 0 Laborers 3 2 2 TUESDAY 14 0 0 0 0 0 0 Business Entity Type 3 0.113161 0.145543 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 18 34 2 9 1

As we can see above the person income is 117000000.0 and have 2 years of employed where age is 34 this i think wont be possible where occuoation type is laboers definately this is an outlier. so we need to remove an outlier for amt_income_total

In [136]:
Q1 = application_data['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = application_data['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1-1.5*IQR
higher_range = Q3+1.5*IQR
application_data = application_data[(application_data['AMT_INCOME_TOTAL'] > lower_range) & (application_data['AMT_INCOME_TOTAL'] < higher_range)]
In [137]:
application_data.shape
Out[137]:
(291686, 54)
In [138]:
application_data.shape
Out[138]:
(291686, 54)

2. AMT_CREDIT¶

In [140]:
sns.boxplot(application_data.AMT_CREDIT)
Out[140]:
<Axes: ylabel='AMT_CREDIT'>
No description has been provided for this image
In [141]:
Q1 = application_data['AMT_CREDIT'].quantile(0.25)
Q3 = application_data['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1-1.5*IQR
higher_range = Q3+1.5*IQR
application_data = application_data[(application_data['AMT_CREDIT'] > lower_range) & (application_data['AMT_CREDIT'] < higher_range)]
In [142]:
application_data.shape
Out[142]:
(286024, 54)

2. AMT_ANNUITY¶

In [144]:
sns.boxplot(application_data['AMT_ANNUITY'])
Out[144]:
<Axes: ylabel='AMT_ANNUITY'>
No description has been provided for this image

There are outliers in AMT_ANNUITY but we cannot drop these columns because amt annuity is the payment made by the client on the regular basis

In [146]:
sns.boxplot(application_data['AMT_GOODS_PRICE'])
Out[146]:
<Axes: ylabel='AMT_GOODS_PRICE'>
No description has been provided for this image

There are outliers in AMT_GOODS_PRICE but we cannot drop these columns because AMT_GOODS_PRICE is the price of goods for which client has taken the loan

3. EXT_SOURCE_2¶

In [149]:
sns.boxplot(application_data['EXT_SOURCE_2'])
Out[149]:
<Axes: ylabel='EXT_SOURCE_2'>
No description has been provided for this image

There are not single putlier found in EXT_SOURCE_2 column

In [151]:
sns.boxplot(application_data['EXT_SOURCE_3'])
Out[151]:
<Axes: ylabel='EXT_SOURCE_3'>
No description has been provided for this image
In [152]:
Q1 = application_data['EXT_SOURCE_3'].quantile(0.25)
Q3 = application_data['EXT_SOURCE_3'].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1-1.5*IQR
higher_range = Q3+1.5*IQR
application_data = application_data[(application_data['EXT_SOURCE_3'] > lower_range) & (application_data['EXT_SOURCE_3'] < higher_range)]
In [153]:
sns.boxplot(application_data['EXT_SOURCE_3'])
Out[153]:
<Axes: ylabel='EXT_SOURCE_3'>
No description has been provided for this image
In [154]:
application_data.head()
Out[154]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH All_Flag_document
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary Single House 0.018801 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.262949 0.139376 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 3 9 25 1 5 1
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House 0.003541 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 0.535276 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0 1
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary Single House 0.010032 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.555912 0.729567 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 11 52 0 6 0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary Civil marriage House 0.008019 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 0.650442 0.535276 2.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 26 52 8 6 1
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary Single House 0.028663 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion 0.322738 0.535276 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 11 54 8 9 1
In [155]:
application_data.shape
Out[155]:
(281692, 54)

Data Visualization¶

Target column Analysis¶

As we can see below only 8.0 % of the people are the defaulters as compare to repayor's that is 92.0% this is an imbalanced data where we can see there only few defaulters¶

  • 1 represents the defaulters who get difficulty to repay the loan
  • 0 represents the repayer who was able to replay the loan
In [157]:
plt.pie(application_data['TARGET'].value_counts(), labels= application_data['TARGET'].value_counts().index, autopct='%1.1f%%')
plt.title("Distribustion of Target Variable")
plt.show()
No description has been provided for this image
In [158]:
Ratio = (application_data['TARGET']==0).sum() /(application_data['TARGET']== 1).sum()
Ratio
Out[158]:
11.540266215554468

Binning Some columns¶

1. Age column¶

In [161]:
application_data['Age'] = pd.cut(application_data.AGE,[0,10,20,30,40,50,60,70,80,90,100], labels=['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100'])
#application_data.drop(columns='AGE',inplace=True,axis= 1)
In [162]:
application_data.Age.value_counts()
Out[162]:
Age
30-40     75076
40-50     66883
50-60     62511
20-30     49094
60-70     28127
10-20         1
0-10          0
70-80         0
80-90         0
90-100        0
Name: count, dtype: int64

2. Amt Income¶

In [164]:
application_data['AMT_INCOME_TOTAL'].describe()
Out[164]:
count    281692.000000
mean     151664.187576
std       61891.551436
min       25650.000000
25%      112500.000000
50%      135000.000000
75%      185863.500000
max      337050.000000
Name: AMT_INCOME_TOTAL, dtype: float64
In [165]:
application_data['AMT_INCOME_GROUP'] = pd.cut(application_data['AMT_INCOME_TOTAL'],[25000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 200000, 300000, 400000], labels=['25k - 30k', '30k - 40k', '40k - 50k', '50k - 60k', '60k - 70k', '70k - 80k', '80k - 90k', '90k - 100k', '100k - 200k', '200k - 300k', '300k - 400k' ])
#application_data.drop(columns='AMT_INCOME_TOTAL', inplace=True, axis=1)
In [166]:
application_data.AMT_INCOME_GROUP.value_counts()
Out[166]:
AMT_INCOME_GROUP
100k - 200k    151617
200k - 300k     61073
80k - 90k       30147
60k - 70k       12733
300k - 400k      6133
70k - 80k        5841
90k - 100k       5674
50k - 60k        3995
40k - 50k        3522
30k - 40k         818
25k - 30k         139
Name: count, dtype: int64

2. Amt Credit¶

In [168]:
application_data['AMT_CREDIT'].describe()
Out[168]:
count    2.816920e+05
mean     5.536835e+05
std      3.418958e+05
min      4.500000e+04
25%      2.700000e+05
50%      4.950000e+05
75%      7.702920e+05
max      1.562076e+06
Name: AMT_CREDIT, dtype: float64
In [169]:
application_data['AMT_CREDIT_GROUP'] = pd.cut(application_data['AMT_CREDIT'], [45000, 270000, 513531, 808650, 1200000, 1600000, 2000000, 2500000, 4050000], labels=['45k-270k', '270k-513.5k', '513.5k-808.65k', '808.65k-1.2M', '1.2M-1.6M', '1.6M-2M', '2M-2.5M', '2.5M-4.05M'])
#application_data.drop(columns= 'AMT_CREDIT', inplace=True, axis= 1)
In [170]:
application_data.AMT_CREDIT_GROUP.value_counts()
Out[170]:
AMT_CREDIT_GROUP
45k-270k          75968
513.5k-808.65k    73002
270k-513.5k       72055
808.65k-1.2M      43273
1.2M-1.6M         17170
1.6M-2M               0
2M-2.5M               0
2.5M-4.05M            0
Name: count, dtype: int64

Done with the Data cleaning now analyzing the columns¶

UNIVARIATE ANALYSIS¶

In [173]:
def plotting(data, column):
    fig = plt.figure(figsize=(16,6))

    ax1 = plt.subplot(1,2,1)
    sns.countplot(data = data, palette='Set1',x = column, ax = ax1)
    plt.title('Plotting data for the column: '+ column)
    plt.xticks(rotation=90)
    ax1.set_yscale("log")

    
    ax2 = plt.subplot(1,2,2)
    sns.countplot(data = data, palette='Set1',x = column, hue= 'TARGET', ax = ax2) # hue is used to differentiate the target variable
    plt.title('Plotting data for the column: '+ column)
    plt.xticks(rotation=90)
    ax2.set_yscale("log")

    plt.tight_layout() # Or equivalently,  "plt.tight_layout()"

    plt.show()
In [174]:
cat_col = list(set(application_data.columns) - set(application_data.describe().columns))
num_col = application_data.describe().columns
In [175]:
cat_col
Out[175]:
['OCCUPATION_TYPE',
 'NAME_EDUCATION_TYPE',
 'Age',
 'CODE_GENDER',
 'NAME_TYPE_SUITE',
 'FLAG_OWN_REALTY',
 'NAME_HOUSING_TYPE',
 'AMT_CREDIT_GROUP',
 'AMT_INCOME_GROUP',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE',
 'NAME_INCOME_TYPE',
 'NAME_FAMILY_STATUS',
 'FLAG_OWN_CAR',
 'NAME_CONTRACT_TYPE']

Plotting Categorical Columns¶

In [177]:
for column in cat_col:
    plotting(application_data, column)
    print('----------------------------------------------------------------------------------------------------------------------------------------------------------')
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
----------------------------------------------------------------------------------------------------------------------------------------------------------

Insights From Categorical column¶

1 NAME_CONTRACT_TYPE¶

client owns a house or flat¶

  1. The majority of the client own a house.flat thta is just a double who dont owe a flat
  2. As compare to target variable we see that there is no much difference between the owning the flat and loan defaulters it can seen that the people who dont owe a flat also considered as a defaulter

2. OCCUPATION TYPE¶

Client Occupation and loan defaulters¶

Count of the people taken loan¶

  1. The majority of people who has tajken a loan comes under "laborers" , "Others" (These are the clients whose occupation is not mentioned in the data) ,"Core staff", "Sales Staff"
  2. It and Hr Staff comes under small group who are taen the loan

Loan Defaulter¶

  1. Majority of the people who has taken the loan are comes under the loan defaulter
  2. We can also see to highest cateogry is "Loberers" and "Others" and "Drivers" ,"Sales Staff", "Cleaning Staff" who are not high skilled to there education

3. AGE¶

Count of the people taken loan¶

  1. We can see all the age group has taken loan
  2. 10 - 20 age group are very less who has taken the loan because majority of the people are students or and who has taken the loan has taken education loan who are above 18 years of the age group

Loan Defaulter¶

  1. Majority of the people who are under 30- 40 has done loan default
  2. As the age increses there are less chance to do the default in loan > 70 year of age

4. WEEkDAY_APP_PROCESS_START¶

Count of the people taken loan¶

  1. As we can see mostly all weeks clients has taken the loan except sunday because mostly sunday are holidays

Loan Defaulter¶

  1. As I cann see the plot with the default loan repay there nothing much we can see in the weekday_app_process_start column I can also drop this column from my analysis

5. AMT_CREDIT_GROUP¶

Count of the people taken loan¶

  1. As we can see mostly of the people taken loan is between 45 k - 270k

Loan Defaulter¶

  1. People taken loan between 1.6 M - 4.05 M are less deaulters

6. Name_Housing_Type¶

Count of the people taken loan¶

  1. the Majority of the people has taken loan who owns house, and leaving with parents

Loan Defaulter¶

  1. People leaving in house and with parents are highst defaulters
  2. The people leaving in co appartment rate are less dfaulters

7. Name_Contract_Type¶

Count of the people taken loan¶

  1. Cash Loan are just double then the revolving loan

Loan Defaulter¶

  1. We can see cliet taken revolving loan is more defaulter as revolving loan percentage is less than cash loan as well as cash loan ans revolving loan are both comes under defulater.

8. Name_Family_Status¶

Count of the people taken loan¶

  1. Most of the people who has taken loan are married followed by single and civil marriage

Loan Defaulter¶

  1. if we see civil marriage has the highest percent of not repayment with Widow the lowest.

9. NAME_INCOME_TYPE¶

Count of the people taken loan¶

  1. Most of client for loans have income type as Working, followed by Commercial associate, Pensioner and State servant

Loan Defaulter¶

  1. if we see the client who are on maternity leave has the highest possibility not to repay the loan, followed by unemployed.
  2. Students and businessman are not done in default ratio these are group which safer to provide the loan.

10. FLAG_OWN_CAR¶

Count of the people taken loan¶

  1. Clients who own a car are half in number of the clients who dont own a car.

Loan Defaulter¶

  1. But based on the percentage of deault, there is no correlation between owning a car and loan repayment as in both cases the default percentage is almost same.

11. AMT_INCOME_GROUP¶

Count of the people taken loan¶

  1. the mostly client income comes under 100k - 200k followed by 200k - 300k

Loan Defaulter¶

  1. The client whoes income is between 100k to 300k is likely to do more defaults
  2. the client whoes income is less than 30k is likely to do less defaults.

12. NAME_EDUCATION_TYPE¶

Count of the people taken loan¶

  1. The most of the people are secondary educated followed by higher education very less number of academic degree holder

Loan Defaulter¶

  1. lower secondary, secondar and incomplete higher are highest number of not repaying the loan

13. CODE_GENDER¶

Count of the people taken loan¶

  1. As compare to male female has more number loans taken

Loan Defaulter¶

  1. As we can see there male as comapare to female who has not repayed the loan

14. NAME_TYPE_SUTE¶

Count of the people taken loan¶

  1. Majority client are unaccompanied while applying for the loan and the clients were with their families and less amount of them were accompanied by their spouse.
  2. .

Loan Defaulter¶

  1. ALL of them have the same default rate, therefore the people who accompany the client while applying for the loan does not play a big role in deciding whether he/she defaults the loan.n

15. FLAG_OWN_REALTY¶

Count of the people taken loan¶

The majority of clients in the dataset own real estate. Clients who own real estate are more than double in number compared to those who don’t ow### n

Loan Defaulter

Based on the data, there doesn’t appear to be a significant correlation between owning real estate and loan defalting.

Plotting Numerical Columns¶

In [212]:
num_col
Out[212]:
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
       'YEAR_LAST_PHONE_CHANGE', 'YEAR_REGISTRATION', 'AGE', 'YEAR_EMPLOYED',
       'YEAR_ID_PUBLISH', 'All_Flag_document'],
      dtype='object')
In [213]:
correlation = application_data[num_col].corr()
In [214]:
correlation.TARGET.sort_values(ascending=True)
Out[214]:
EXT_SOURCE_2                  -0.156071
EXT_SOURCE_3                  -0.140068
AGE                           -0.077633
YEAR_LAST_PHONE_CHANGE        -0.056615
YEAR_ID_PUBLISH               -0.050140
YEAR_EMPLOYED                 -0.048522
YEAR_REGISTRATION             -0.043160
REGION_POPULATION_RELATIVE    -0.032299
AMT_GOODS_PRICE               -0.026982
FLAG_PHONE                    -0.024581
HOUR_APPR_PROCESS_START       -0.023639
AMT_CREDIT                    -0.016544
AMT_REQ_CREDIT_BUREAU_MON     -0.014527
AMT_INCOME_TOTAL              -0.012079
AMT_REQ_CREDIT_BUREAU_QRT     -0.008763
AMT_REQ_CREDIT_BUREAU_WEEK    -0.002383
SK_ID_CURR                    -0.001632
FLAG_EMAIL                    -0.001489
AMT_REQ_CREDIT_BUREAU_HOUR    -0.000324
FLAG_CONT_MOBILE              -0.000271
FLAG_MOBIL                     0.000555
AMT_REQ_CREDIT_BUREAU_DAY      0.002053
AMT_ANNUITY                    0.002669
LIVE_REGION_NOT_WORK_REGION    0.005222
REG_REGION_NOT_LIVE_REGION     0.005793
OBS_60_CNT_SOCIAL_CIRCLE       0.008020
OBS_30_CNT_SOCIAL_CIRCLE       0.008181
REG_REGION_NOT_WORK_REGION     0.009107
AMT_REQ_CREDIT_BUREAU_YEAR     0.010842
CNT_FAM_MEMBERS                0.011666
All_Flag_document              0.018902
CNT_CHILDREN                   0.021147
FLAG_WORK_PHONE                0.027574
DEF_60_CNT_SOCIAL_CIRCLE       0.031207
DEF_30_CNT_SOCIAL_CIRCLE       0.032278
LIVE_CITY_NOT_WORK_CITY        0.032832
REG_CITY_NOT_LIVE_CITY         0.044125
FLAG_EMP_PHONE                 0.047510
REG_CITY_NOT_WORK_CITY         0.051325
REGION_RATING_CLIENT           0.055849
REGION_RATING_CLIENT_W_CITY    0.057660
TARGET                         1.000000
Name: TARGET, dtype: float64

Lets first Drop irrelevant columns from the numerical data

In [216]:
flag_own_drop = []
for col in application_data:
    if col.startswith("FLAG"):
        flag_own_drop.append(col)
print(flag_own_drop)
['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL']
  1. We can see EXT_SOURCE_2 and EXT_SOURCE_3 are negative correlation with target variable we can drop these two columns
  2. Need to Flag Columns mentioned above need to remove it
In [218]:
drop_columns = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL','EXT_SOURCE_2','EXT_SOURCE_3']
application_data.drop(columns=drop_columns, axis =1 , inplace= True)
In [219]:
application_data.shape
Out[219]:
(281692, 47)
In [220]:
num_col = application_data.describe().columns
In [221]:
num_col
Out[221]:
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
       'YEAR_LAST_PHONE_CHANGE', 'YEAR_REGISTRATION', 'AGE', 'YEAR_EMPLOYED',
       'YEAR_ID_PUBLISH', 'All_Flag_document'],
      dtype='object')

Lets do some univariate analysis on amt columns and cnt Family Members columns¶

In [223]:
num_col_anlysis = application_data[[ 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE','CNT_FAM_MEMBERS']]

Segregating the dataset on Target=0 and Target=1¶

In [225]:
train_0 = application_data.loc[application_data['TARGET'] == 0] # 0 - Loan was repaid
train_1 = application_data.loc[application_data['TARGET'] == 1] # 1 - Loan was not repaid
In [226]:
for column in num_col_anlysis:
    title = "Plot of "+column
    print(title)
    plt.hist(train_0[column], alpha=0.5, label='0') # plotting a histogram for target 0 i.e. loan was repaid
    plt.hist(train_1[column], alpha=0.5, label='1') # plotting a histogram for target 1 i.e. loan was not repaid
    plt.legend()
    plt.show()
    
    sns.distplot(train_0[column], label='0') # plotting a distribution plot for target 0 i.e. loan was repaid
    sns.distplot(train_1[column],  label='1') # plotting a distribution plot for target 1 i.e. loan was not repaid
    plt.legend()
    plt.show()
    print("------------------------------------------------------------------------")
Plot of AMT_INCOME_TOTAL
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_CREDIT
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_ANNUITY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_GOODS_PRICE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of CNT_FAM_MEMBERS
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------

Insights¶

I have considered dist plot to give the insights

AMT_INCOME_TOTAL¶

  • The income type for both targeted type lied around 1000000 to 150000

AMT_CREDIT¶

  • Most of the credit amount 25k to 50 k

AMT_ANNUITY¶

  • Most client pay an annuity below 60K for their credit loans.

AMT_GOODS_PRICE¶

  • The majority of loans are granted for goods prices below 1M.

CNT_FAM_MEMBERS¶

  • the mojority of family members have size < 3 have difficult to repay the loan

Bivariate & Multivariate Analysis¶

In [230]:
# Create a pairplot with hue as 'TARGET'
sns.pairplot(data = application_data, vars = [ 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE'], hue = 'TARGET', diag_kind='kde')
Out[230]:
<seaborn.axisgrid.PairGrid at 0x14f578efe00>
No description has been provided for this image

Insights From Analysis¶

AMT_CREDIT and AMT_GOODS_PRICEare highly corelated to each other as based on the scatterplot where most of the data are consolidated in form of a line There are very less defaulters for AMT_CREDIT > 1M 3There are very few defaulters for loan amounts exceeding 1M. This suggests that clients who borrow larger amount are less likely to default on their loans.ion

Previous Application ¶

In [234]:
previous_data = pd.read_csv("previous_application.csv")
In [235]:
previous_data.head()
Out[235]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 Y 1 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
In [236]:
previous_data.shape
Out[236]:
(1670214, 37)
In [237]:
previous_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
In [238]:
# shows how much percentage of data are null
null_col = (previous_data.isnull().sum()/previous_data.shape[0]*100).sort_values(ascending = False)
null_col[null_col>0]
Out[238]:
RATE_INTEREST_PRIVILEGED     99.643698
RATE_INTEREST_PRIMARY        99.643698
AMT_DOWN_PAYMENT             53.636480
RATE_DOWN_PAYMENT            53.636480
NAME_TYPE_SUITE              49.119754
NFLAG_INSURED_ON_APPROVAL    40.298129
DAYS_TERMINATION             40.298129
DAYS_LAST_DUE                40.298129
DAYS_LAST_DUE_1ST_VERSION    40.298129
DAYS_FIRST_DUE               40.298129
DAYS_FIRST_DRAWING           40.298129
AMT_GOODS_PRICE              23.081773
AMT_ANNUITY                  22.286665
CNT_PAYMENT                  22.286366
PRODUCT_COMBINATION           0.020716
AMT_CREDIT                    0.000060
dtype: float64
In [239]:
# Lets See the columns who has more than 40 % of data with Null values 
cut_off=40
cols_to_drop=list(previous_data.columns[100*previous_data.isnull().mean()>cut_off])
print(cols_to_drop)
['AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'NAME_TYPE_SUITE', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL']
In [240]:
len(cols_to_drop)
Out[240]:
11
In [241]:
previous_data.drop(columns= cols_to_drop, axis= 1, inplace= True)
In [242]:
previous_data.shape
Out[242]:
(1670214, 26)
In [243]:
# shows how much percentage of data are null
null_col = (previous_data.isnull().sum()/previous_data.shape[0]*100).sort_values(ascending = False)
null_col[null_col>0]
Out[243]:
AMT_GOODS_PRICE        23.081773
AMT_ANNUITY            22.286665
CNT_PAYMENT            22.286366
PRODUCT_COMBINATION     0.020716
AMT_CREDIT              0.000060
dtype: float64
In [244]:
previous_data[['AMT_GOODS_PRICE','AMT_ANNUITY','CNT_PAYMENT','PRODUCT_COMBINATION','AMT_CREDIT']]
Out[244]:
AMT_GOODS_PRICE AMT_ANNUITY CNT_PAYMENT PRODUCT_COMBINATION AMT_CREDIT
0 17145.0 1730.430 12.0 POS mobile with interest 17145.0
1 607500.0 25188.615 36.0 Cash X-Sell: low 679671.0
2 112500.0 15060.735 12.0 Cash X-Sell: high 136444.5
3 450000.0 47041.335 12.0 Cash X-Sell: middle 470790.0
4 337500.0 31924.395 24.0 Cash Street: high 404055.0
... ... ... ... ... ...
1670209 267295.5 14704.290 30.0 POS industry with interest 311400.0
1670210 87750.0 6622.020 12.0 POS industry with interest 64291.5
1670211 105237.0 11520.855 10.0 POS household with interest 102523.5
1670212 180000.0 18821.520 12.0 Cash X-Sell: low 191880.0
1670213 360000.0 16431.300 48.0 Cash X-Sell: middle 360000.0

1670214 rows × 5 columns

Data Cleaning On Previous Application Data¶

In [246]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = previous_data.count()/len(previous_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[246]:
AMT_CREDIT              0.000060
PRODUCT_COMBINATION     0.020716
CNT_PAYMENT            22.286366
AMT_ANNUITY            22.286665
AMT_GOODS_PRICE        23.081773
dtype: float64

1. PRODUCT_COMBINATION¶

In [248]:
previous_data['PRODUCT_COMBINATION'].value_counts()
Out[248]:
PRODUCT_COMBINATION
Cash                              285990
POS household with interest       263622
POS mobile with interest          220670
Cash X-Sell: middle               143883
Cash X-Sell: low                  130248
Card Street                       112582
POS industry with interest         98833
POS household without interest     82908
Card X-Sell                        80582
Cash Street: high                  59639
Cash X-Sell: high                  59301
Cash Street: middle                34658
Cash Street: low                   33834
POS mobile without interest        24082
POS other with interest            23879
POS industry without interest      12602
POS others without interest         2555
Name: count, dtype: int64

We change the production combination null values wih mode as percentage of null values are very less we can replace it with a mode

In [250]:
previous_data['PRODUCT_COMBINATION'].fillna(previous_data['PRODUCT_COMBINATION'].mode()[0],  inplace= True)
In [251]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = previous_data.count()/len(previous_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[251]:
AMT_CREDIT          0.000060
CNT_PAYMENT        22.286366
AMT_ANNUITY        22.286665
AMT_GOODS_PRICE    23.081773
dtype: float64
In [252]:
Null_columns = previous_data[['AMT_GOODS_PRICE','AMT_ANNUITY','CNT_PAYMENT','AMT_CREDIT']]

Lets see the histogram fro all null columns and then fill the null values with approprimate values

In [254]:
for col in Null_columns:
    sns.histplot(previous_data, x = col)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Seams all of them are right skewed its better to fill all null values with median

In [256]:
for col in Null_columns:
    med = previous_data[col].median()
    previous_data[col] = previous_data[col].fillna(med)
In [257]:
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0 
Null_columns = previous_data.count()/len(previous_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Out[257]:
Series([], dtype: float64)

Previous Application Data¶

In [259]:
previous_data.head()
Out[259]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 17145.0 SATURDAY 15 Y 1 XAP Approved -73 Cash through the bank XAP Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 607500.0 THURSDAY 11 Y 1 XNA Approved -164 XNA XAP Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 112500.0 TUESDAY 11 Y 1 XNA Approved -301 Cash through the bank XAP Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 450000.0 MONDAY 7 Y 1 XNA Approved -512 Cash through the bank XAP Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 337500.0 THURSDAY 9 Y 1 Repairs Refused -781 Cash through the bank HC Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high
In [260]:
previous_data['NAME_CONTRACT_TYPE'].value_counts()
Out[260]:
NAME_CONTRACT_TYPE
Cash loans         747553
Consumer loans     729151
Revolving loans    193164
XNA                   346
Name: count, dtype: int64
In [261]:
previous_data['NAME_CONTRACT_STATUS'].value_counts()
Out[261]:
NAME_CONTRACT_STATUS
Approved        1036781
Canceled         316319
Refused          290678
Unused offer      26436
Name: count, dtype: int64
In [262]:
previous_data['NAME_CASH_LOAN_PURPOSE'].value_counts()
Out[262]:
NAME_CASH_LOAN_PURPOSE
XAP                                 922661
XNA                                 677918
Repairs                              23765
Other                                15608
Urgent needs                          8412
Buying a used car                     2888
Building a house or an annex          2693
Everyday expenses                     2416
Medicine                              2174
Payments on other loans               1931
Education                             1573
Journey                               1239
Purchase of electronic equipment      1061
Buying a new car                      1012
Wedding / gift / holiday               962
Buying a home                          865
Car repairs                            797
Furniture                              749
Buying a holiday home / land           533
Business development                   426
Gasification / water supply            300
Buying a garage                        136
Hobby                                   55
Money for a third person                25
Refusal to name the goal                15
Name: count, dtype: int64

lets change days to year in DAYS_DECISION columns beacuse cant be negative and standardise into year

In [264]:
previous_data['YEAR_DECISION'] =previous_data['DAYS_DECISION'].abs()// 365 
In [265]:
previous_data.head()
Out[265]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION YEAR_DECISION
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 17145.0 SATURDAY 15 Y 1 XAP Approved -73 Cash through the bank XAP Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 607500.0 THURSDAY 11 Y 1 XNA Approved -164 XNA XAP Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 112500.0 TUESDAY 11 Y 1 XNA Approved -301 Cash through the bank XAP Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 450000.0 MONDAY 7 Y 1 XNA Approved -512 Cash through the bank XAP Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 1
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 337500.0 THURSDAY 9 Y 1 Repairs Refused -781 Cash through the bank HC Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high 2
In [266]:
previous_data.drop(columns= ['DAYS_DECISION'], axis =1 , inplace = True)
In [267]:
previous_data['NAME_YIELD_GROUP'].value_counts()
Out[267]:
NAME_YIELD_GROUP
XNA           517215
middle        385532
high          353331
low_normal    322095
low_action     92041
Name: count, dtype: int64

Data Visualization on Previous Application Data¶

In [268]:
preanalysis_col = ['NAME_CONTRACT_TYPE','NAME_CONTRACT_STATUS','NAME_CLIENT_TYPE']
In [269]:
def prev_app_analysis(data, columns):
    plt.figure(figsize=(8, 6))
    ax1 = plt.subplot(1,2,2)
    previous_data[columns].value_counts().plot.pie(autopct = "%1.0f%%",ax = ax1)
    plt.xlabel(columns)
    plt.axis('equal')
    plt.title('Distribution of' +" "+ columns)
    plt.xticks(rotation=90)
    plt.show()
In [270]:
for col in preanalysis_col:
    prev_app_analysis(preanalysis_col, col)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Insights On Previous Application¶

1. Name Contract Type¶

  1. Cash loans: The most common contract type, accounting for 45% of all contracts.
  2. Consumer loans: A significant portion of contracts, making up 44% of the total.
  3. Revolving loans: Less common, representing 12% of contracts. Typically associated with credit cards or lines of credit.
  4. There is 0 % of XNA category

2. Name Contract Status¶

  1. The majority of loan applications were successfully approved ie 62% approved loan..
  2. Approximately 19% of applicants canceled their loan requests
  3. About 17% of applications werrefuseded, highlighting the importance of assessing rejection criteria and applicant profile
  4. A small percentage received offers but didn’t proceed, warranting investigation into factors influencing this decisiosks.

3. Name Client Type¶

  1. Repeater: 74% of clients fall into this category.
  2. New: Approximately 10% of clients are categorized as “New.
  3. Refreshed: About.1% of clients are classified as “Refreshed
  4. There is 0 % of XNA percentageA.”

Merging the application Data and previous application data¶

In [276]:
data_final = application_data.merge(previous_data, left_on='SK_ID_CURR', 
                                         right_on='SK_ID_CURR', how='inner')
In [277]:
data_final.head()
Out[277]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE_x CODE_GENDER CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT_x AMT_ANNUITY_x AMT_GOODS_PRICE_x NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START_x HOUR_APPR_PROCESS_START_x REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR YEAR_LAST_PHONE_CHANGE YEAR_REGISTRATION AGE YEAR_EMPLOYED YEAR_ID_PUBLISH All_Flag_document Age AMT_INCOME_GROUP AMT_CREDIT_GROUP SK_ID_PREV NAME_CONTRACT_TYPE_y AMT_ANNUITY_y AMT_APPLICATION AMT_CREDIT_y AMT_GOODS_PRICE_y WEEKDAY_APPR_PROCESS_START_y HOUR_APPR_PROCESS_START_y FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION YEAR_DECISION
0 100002 1 Cash loans M 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary Single House 0.018801 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 2.0 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 3 9 25 1 5 1 20-30 200k - 300k 270k-513.5k 1038818 Consumer loans 9251.775 179055.0 179055.0 179055.0 SATURDAY 9 Y 1 XAP Approved XNA XAP New Vehicles POS XNA Stone 500 Auto technology 24.0 low_normal POS other with interest 1
1 100003 0 Cash loans F 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House 0.003541 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0 1 40-50 200k - 300k 1.2M-1.6M 1810518 Cash loans 98356.995 900000.0 1035882.0 900000.0 FRIDAY 12 Y 1 XNA Approved XNA XAP Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 low_normal Cash X-Sell: low 2
2 100003 0 Cash loans F 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House 0.003541 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0 1 40-50 200k - 300k 1.2M-1.6M 2636178 Consumer loans 64567.665 337500.0 348637.5 337500.0 SUNDAY 17 Y 1 XAP Approved Cash through the bank XAP Refreshed Furniture POS XNA Stone 1400 Furniture 6.0 middle POS industry with interest 2
3 100003 0 Cash loans F 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House 0.003541 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 3 45 3 0 1 40-50 200k - 300k 1.2M-1.6M 2396755 Consumer loans 6737.310 68809.5 68053.5 68809.5 SATURDAY 15 Y 1 XAP Approved Cash through the bank XAP Refreshed Consumer Electronics POS XNA Country-wide 200 Consumer electronics 12.0 middle POS household with interest 6
4 100004 0 Revolving loans M 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary Single House 0.010032 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 11 52 0 6 0 50-60 60k - 70k 45k-270k 1564014 Consumer loans 5357.250 24282.0 20106.0 24282.0 FRIDAY 5 Y 1 XAP Approved Cash through the bank XAP New Mobile POS XNA Regional / Local 30 Connectivity 4.0 middle POS mobile without interest 2
In [278]:
data_final.shape
Out[278]:
(1292479, 72)
In [279]:
data_final.columns.values
Out[279]:
array(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER',
       'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT_x',
       'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
       'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START_x',
       'HOUR_APPR_PROCESS_START_x', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
       'YEAR_LAST_PHONE_CHANGE', 'YEAR_REGISTRATION', 'AGE',
       'YEAR_EMPLOYED', 'YEAR_ID_PUBLISH', 'All_Flag_document', 'Age',
       'AMT_INCOME_GROUP', 'AMT_CREDIT_GROUP', 'SK_ID_PREV',
       'NAME_CONTRACT_TYPE_y', 'AMT_ANNUITY_y', 'AMT_APPLICATION',
       'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y',
       'WEEKDAY_APPR_PROCESS_START_y', 'HOUR_APPR_PROCESS_START_y',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
       'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS',
       'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_CLIENT_TYPE',
       'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
       'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
       'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
       'YEAR_DECISION'], dtype=object)
In [280]:
train_0 = application_data.loc[application_data['TARGET'] == 0] # train_0 - It contains the data where the loan was repaid
train_1 = application_data.loc[application_data['TARGET'] == 1] # train_1 - It contains the data where the loan was not repaid
In [281]:
ptrain_0 = data_final.loc[data_final['TARGET'] == 0]  # ptrain_0 - It contains the data where the loan was repaid
ptrain_1 = data_final.loc[data_final['TARGET'] == 1] # ptrain_1 - It contains the data where the loan was not repaid
In [282]:
def plotting(column, hue):
    col = column
    hue = hue
    fig = plt.figure(figsize=(13,10))

    ax1 = plt.subplot(221)
    application_data[col].value_counts().plot.pie(autopct = "%1.0f%%", ax=ax1) # autopct is used to show the percentage on the pie chart
    plt.title('Plotting data for the column: '+ column)

    ax2 = plt.subplot(222)
    df = pd.DataFrame()
    df['0']= ((train_0[col].value_counts())/len(train_0))
    df['1']= ((train_1[col].value_counts())/len(train_1))
    df.plot.bar(ax=ax2) # plotting the bar chart
    plt.title('Plotting data for target in terms of total count')


    ax3 = plt.subplot(223)
    sns.countplot(x=col, hue=hue, data=ptrain_0, ax = ax3) # plotting the countplot
    plt.xticks(rotation=90)
    plt.title('Plotting data for Target=0 in terms of percentage')

    ax4 = plt.subplot(224)
    sns.countplot(x=col, hue=hue, data=ptrain_1, ax = ax4) # plotting the countplot
    plt.xticks(rotation=90)
    plt.title('Plotting data for Target=1 in terms of percentage')



    fig.tight_layout() # Or equivalently,  "plt.tight_layout()"

    plt.show()
    
In [283]:
plotting('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')    
No description has been provided for this image
In [284]:
plotting('Age','NAME_EDUCATION_TYPE')    
No description has been provided for this image

INSIGHTS¶

NAME_EDUCATION_TYPE VS NAME_CONTRACT_STATUS¶

As we can see higher education and secondary education gives more approved loans and as able repay the loan as well There is less defaulter percentage if we give a loan to higher education people

Age VS NAME_EDUCATION_TYPE¶

There is a mjority of perople who are under age of 30- 40 year of age who has repay the loan we can target the people under the age of 30 to 60 year of age groups who are safe to give the loan and higher chances they repay the loan as well with this particular data We have to take care of clients who has not done higher education or whoes education are incomplete the can be defaulter

In [288]:
def pointplotdata(column1, column2):
    col1 = column1
    col2 = column2
    
    fig = plt.figure(figsize=(13,10))

    ax1 = plt.subplot(221)
    sns.pointplot(data=data_final, x=col1, y=col2, hue='TARGET') 
    plt.xticks(rotation=90)
    plt.title('Plotting data for the TARGET column vs : '+ column2 + " " +'By'+ " "+ column1)
In [289]:
pointplotdata('NAME_CONTRACT_STATUS','AMT_INCOME_TOTAL')
No description has been provided for this image

INSIGHTS¶

we can see here the people whoes income is higher than others have not used the offer we can target these kind of people as well

In [291]:
pointplotdata('NAME_CONTRACT_STATUS','DEF_60_CNT_SOCIAL_CIRCLE')
No description has been provided for this image

INSIGHTS¶

Clients who have average of 0.13 or higher DEF_60_CNT_SOCIAL_CIRCLE score tend to default more and hence client's social circle has to be analysed before providing the loan.

In [293]:
pointplotdata('NAME_HOUSING_TYPE','AMT_CREDIT_x')
No description has been provided for this image

INSIGHTS¶

the People who are leaving in co-op apartment and office apartment has higher chances to be defaulter

Conclusion¶

Analysing the whole dataset we can consider the follwing thing that would help bank to provide the loan who can repay the or not

Successful applicant will be Repayer:¶

  1. AGE : - Loan applicants above the age of 60 has a lower tendency to default.
  2. NAME_INCOME_TYPE :- Student and Businessmen have no defaults.
  3. AMT_INCOME_TOTAL :- Applicants with incomes exceeding 10 Millions experience a lower likelihood of default.
  4. NAME_HOUSING_TYPE :- Applicant leaving in there own house, or parents , or rental apartment are lower likelihood of default
  5. CNT_CHILDREN:- People with zero to two children tend to repay the loans.

applicant will be Defaulter:¶

  1. NAME_EDUCATION_TYPE: People with Lower Secondary & Secondary education has higher defaulter
  2. CNT_CHILDREN: Client who have children equal to or more than 9 default 100% and hence their applications are to be rejected.
  3. CODE_GENDER: Men are at relatively higher default rate
  4. NAME_FAMILY_STATUS : People who have civil marriage or who are single default a lot.
  5. OCCUPATION_TYPE: Avoid Low-skill Laborers, Drivers and Waiters staff, Security staff, Laborers and Cooking staff has the higher default rate .
  6. DAYS_BIRTH: Avoid young people who are in age group of 20-30 as they have higher probability of defaulting
  7. AMT_GOODS_PRICE: When the credit amount goes beyond 3M, there is an increase in defaulters.