A huge amount of data is generally referred as Big Data. It is
enormous in size, diverse variety and has highest velocity of data arrival.
This huge information is useless unless the data is examined to uncover the new
correlations, customer experiences and other useful information that can help
the organization to take more informed business decisions. Big data is widely
applied in all sectors like healthcare, insurance, finance and many more.
Big data in insurance
sector is one of the most promising. Traditional marketing system of insurance
is offline based sales business. They generally sell the insurance policies by
calling and visiting the customers. This fixed marketing system also achieved
good results in past time. But currently many new private insurances companies
also have entered into the marketplace which gives healthier competition. On
other hand, eagerness of people to pay for the insurance service is also
increased. Therefore, understanding the need and purchase plan of clients is
extremely essential for insurance companies to raise the sales volume.
Big data technology
supports the insurance companies’ transformations. Due to lack of principle and
innovation of traditional marketing, badly structured insurance data, unclear
customers purchasing characteristics leads to imbalanced data, which brings the
difficulty of classification of user and insurance product recommendation.
Decision making task is difficult with imbalanced data
distribution. To solve this problem, we usually use few resampling methods
which will construct the balanced training datasets. This will improve the
performance of predictive model.
Main purpose of this
paper is to identify the potential customer with help of big data technology.
This paper does not only provide good strategy for identifying the potential
client but also act as good reference for classification problems.
We propose supervised learning algorithm call ensemble
decision tree to analysis the potential customer and their major
This paper is organized as follows. Section II introduces the
current research status of machine learning; Section III puts forward the
classification model and intelligent recommendation algorithm based on XGBoost
algorithm for insurance business data, and analyzes its efficiency; Section IV gives
you experiment and result. Section V puts forward the analysis result. Section
VI Conclusion and future work.
The classification problem for US bank insurance business data
has imbalanced data distribution. This means ratio between positive and
negative proportion are extremely unbalanced, the prediction models generated
directly by supervised learning algorithms like SVM, Logistic Regression are
biased for large proportion. Example, the ratio between positive and negative
classes is 100:1. Therefore, this can be seen as such model does not help in
distribution will affect the performance of classification problem. Thus, some
techniques should be applied to deal this problem. One approach to handle the
problem of unbalanced class distribution is sampling techniques. This will
rebalance the dataset. Sampling techniques are broadly classified into two
types. They are under sampling and over sampling. Under sampling technique is
applied to major class for reduction process (e.g. Random Under Sampling) and
over sampling is another technique applied to add missing scores to set of
samples of minor class (e.g. Random Over Sampling).The drawback of ROS is
redundancy in dataset this will again lead to classification problem that is
classifier may not recognize the minor class significantly. To overcome this
problem, SMOTE (Synthetic Minority Over Sampling) is used. This will create
additional sample which are close and similar to near neighbors along with
samples of the minor class to rebalance the dataset with help of K-Nearest
Neighbors (KNN) 2.Sampling method is divided into non-heuristic method and
heuristic method. Non-heuristic will randomly remove the samples from majority
class in order to reduce the degree of imbalance. Heuristic sampling is another
method which will distinguish samples based on nearest neighbor algorithm
7.Another difficulty in classification problem is data quality, which is
existence of missing data. Frequent occurrence of missing data will give biased
result. Mostly, dataset attributes are dependent to each other. Thus,
identifying the correlation between those attributes can be used to discover
the missing data values. One approach to replace the missing values with some probable
values is called imputation 6.
One of the challenges in big data is data quality. We need to
ensure the quality of data otherwise it will mislead to wrong predictions
sometimes. One significant problem of data quality is missing data.
Imputation is method for handling the missing data. This will
reconstruct the missing data with estimated ones. Imputation method has
advantage of handling missing data without help of learning algorithms and this
will also allow the researcher to select the suitable imputation method for
particular circumstance 3.
There are many imputation methods for missing value treatment
(Some widely used data imputation methods are Case substitution, Mean and Mode
imputation, Predictive model). In this paper we have built the predictive model
for missing value treatment.
There are various machine learning algorithms to solve both
classification and regression problems. Machine learning is process of
designing the system which has ability to automatically learn and practice
without being explicitly programmed. Machine learning algorithms are classified
into three types (Supervised learning, Unsupervised learning, Reinforcement Learning).In this paper, we
propose supervised machine learning algorithms to built the model. Some of the
supervised learning algorithms are listed below: Regression, Decision Tree, Random Forest, KNN,
Logistic Regression etc 8. Decision tree in machine learning can be used for
both classification and regression. In decision analysis, a decision tree
can be used to visually and explicitly represent decisions and decision making. The tree has two important entities
that are decision nodes and leaves. The leaves are the decisions or the final
outcomes. And the decision nodes are where the data is split. Classification
tree is type of decision tree where the outcome was a variable like ‘fit’ or ‘unfit’.
Here the decision variable is Categorical.
One of the best ensemble methods is random forest. It is used
for both classification and regression 5. Random Forest is collection of many
decision trees; every tree has its full growth. And it has advantage of
automatic feature selection and etc 4.
Gradient Boosting looks to consecutively reduce error with
each consecutive model, until one final model is produced. The main aim of all
machine learning algorithms is to build the strongest predictive model while
accounting for computational efficiency as well. This is where XGBoosting
algorithm plays important role. XGBoost
(eXtreme Gradient Boosting) is a direct application of Gradient Boosting for
decision trees. It gives you more regularized model formalization to control
over-fitting, which gives better performance 8.
Traditional sales approach of insurance product is offline process and it as
following disadvantages: (1) lack of customer evaluation system, don’t know the
characteristics influence weight of the potential customers; (2) the data
accumulated in this way usually has serious ruinous, indirect influence the
accuracy of classification model 4.
For a bunch of classification models, distribution of class
and correlation features affects the forecast results. Imbalanced data
classification and independent attributes of insurance dataset will have
serious deviation in classification model result. We can handle this kind of
problems with different sampling method and supervised learning algorithms.
In this article, we have used over sampling approach with
supervised learning algorithms on car insurance dataset to build the best
predictive model. Imbalanced data classification problem is resolved with over
sampling method and finally we build the model with supervised learning
algorithms using training dataset. Finally, predictive model is validated with
test dataset and Performance of algorithms is evaluated using confusion matrix
method with test dataset. And Precision-Recall, F-measure is also other
performance metrics calculated for accuracy of algorithms. The taxonomy of
proposed classification model is given below:
1.Taxonomy of proposed methodology
The key objective of this paper is to analyze and
understanding the need and purchase plan of find who all buy the car insurance
service of campaign. So, we propose different classification algorithms in R on
large-scale insurance data to improve the performance and predictive modeling.
We have collected the data from Kaggle datasets: The Home of Data Science &
This is a dataset from one bank in the United States. Besides
usual services, this bank also provides car insurance services. The bank
organizes regular campaigns to attract new clients. The bank has potential
customers’ data, and bank’s employees call them for advertising available car
insurance options. We are provided with general information about clients (age,
job, etc.) as well as more specific information about the current insurance
sell campaign (communication, last contact day) and previous campaigns
(attributes like previous attempts, outcome).You have data about 4000 customers
who were contacted during the last campaign and for whom the results of
campaign (did the customer buy insurance or not) are known.
Data is usually collected for unspecified applications. Data
quality is one of the major issues that are needed to be addressed in process
of big data analytics. Problems that affect the data quality are given in the
following: 1.Noise and outliers 2. Missing values 3. Duplicate data. Preprocessing
is a technique used to make data more appropriate for data analytics. Data
Cleaning is method to handle the missing data values. We have used predictive
model imputation method to predict the missing values using the non-missing
data. Here, we used KNN algorithm to estimate the missing data. This will
estimate missing data with help of the recent neighbor values. Data transformation is one
of the methods in preprocessing to normalize data. Normalization is a process in which we modify the
complex dataset into simpler dataset. Here, we used Min-Max
Normalization to normalize the data. It will scale the data between the 0 and
Where, x is the
vector that we going to normalize. Then min and max are
the minimum and maximum values in x given its range. Once the
dataset is pre-processed. Now it is ready for data partition.
In this step, we will split data into separate roles of learn
(train) and test rather than just working with all of the data. Training data
will contain the input data with expected output. More or less 60% of the
original dataset constitutes the training dataset and other 40% is considered
as testing dataset. This is the data that validate the core model and checks
for accuracy of the model. Here, we partitioned the original insurance dataset
into train and test set with probability of 60 and 40 split.
Supervised learning is machine learning technique. This infers
function and practice with training data without explicitly programmed.
Learning is said to be supervised, when the desired outcome is already known.
After partition, next step is to build the model with training
sample set. Here, our target variable is chosen first. We selected our target
variable as car insurance and other attributes in dataset is taken as
predictors to develop the predictive model. Now, I want to create a model
to predict who will buy the car insurance service during campaign? In this
problem, we need to segregate clients who buy car insurance in campaign based
on highly significant input variables.
In this paper, we are using Random Forest and Extreme gradient
boosting Algorithm to predict the model. And evaluate which algorithm confers
Before we move to random forest
it is necessary to have a look on decision tree.
What is Decision Tree?
Decision Tree can be used for both classification and
regression. Unlike linear models, tree based predictive model gives high
accuracy. Decision tree is mostly used in classification problem. It will
segregate the clients based on all values of predictor variables
and identify the variable, which creates the best homogeneous sets of
clients. In this, our decision variable is categorical.
Why Random Forest?
Random forest is one of the frequently used predictive model
and machine learning technique.
In a normal decision tree, one
decision tree is built to identify the potential client but in case of random
forest algorithm, numbers of decision trees are built during the process to
identify the potential client. A vote from each of the decision trees is
considered in deciding the final class of an object.
Sampling is one of the methods in preprocessing. This will
select the subset of original samples. This is mainly used in case of balance
the data classification. In our model, we have used under sampling approaches
to balance the data sampling.
It will reduce the majority class to make their frequencies closer to the
rarest class. Original insurance data is balanced with
under sampling. So further we will use this sample in Random Forest
Algorithms to build the model. This randomly generates the n number of trees to
build the effective model.
Extreme Gradient Boosting:
Another classifier is extreme gradient boosting .The XGBoost
has an immensely high predictive model. This algorithm works ten percent faster
than existing algorithms. It can be used for both regression, classification
and also ranking.
One of the most interesting things about the XGBoost is that
it is also called as regularized boosting technique. This helps to reduce
overfit modeling. Over-fitting is the occurrence in which the learning model
tightly fits the given training data so much that it would be inaccurate in
predicting the outcomes of the untrained data.
In our model, first we used over sampling method to balance
the classification. Sampling techniques can be used to get better prediction
performance in the case of imbalanced classes using R and caret. Over sampling
will randomly duplicate samples from the class with few instances. Here,
we used over sampling method with train set to improve the performance of
model. Now, balanced samples are collected. We will pass these samples to
XGBoost as train set and built the model. XGBoost built the binary
classification model with insurance data. After this, model is validated with
test set. This produces much better prediction performance compared to
random forest algorithm.
Performance analysis of classification problems includes the
matrix analysis of predicted result. In this paper, we have used following
metrics to evaluate the performance of classification algorithms. They are
Precision is the fraction of predicted occurrence that is
related. It is also called positive predicted value.
Recall is fraction of related instances that have been retrieved over the total
amount of relevant instances. F1-Measure is the weighted
harmonic mean of the precision and recall and represents the overall
Where, TP – True
positive ,FP – False Positive, TN-True Negative, FN-False Negative.
Table 1: Confusion Matrix
EXPERIMENT AND RESULT
We used KNN for missing data treatment and after all
preprocessing we have built the predictive models with XGBoost and Random
Forest for business case. The comparison table for two models is given below:
Table 2: Performance comparison of XGBoost and random forest.
This above result shows that XGBoosting algorithm out
performed than random forest.
ANALYSIS OF THE RESULT
Figure 1: Effect of Missing values before Imputation
Figure 2: Important Features that impact on target variable
using Random Forest Algorithm.
Figure 3: AUC curve for Random Forest Algorithm
Figure 4: AUC curve for Extreme
Gradient Boosting Algorithm
Figure 5: Important Features that impact on target variable
using Gradient Boosting Algorithm.
Figure 6: Model Analysis for Random Forest
Figure 6: Model Analysis for Extreme Gradient Boosting
This paper analyzed the imbalance distribution of insurance
business data, concluded the preprocessing algorithms of imbalance dataset,
proposed an random forest algorithm based on R which can be used in the large
scaled imbalanced classification of insurance business data, the experiment
result showed that the random forest algorithm is more suitable to identify how
many people will buy the insurance product in campaign. Here, XGBoost algorithm
out performed than other decision tree algorithm called Random Forest. Our
future works include combining proposed algorithm with deep learning.
1. E. Ramentol, Y. Caballero, R. Bello, and F. Herrera,
“SMOTE-RSB:A hybrid preprocessing approach based on oversampling and
undersampling for high imbalanced data-sets using SMOTE and rough sets
theory,”Knowl. Inf. Syst., vol. 33, no. 2, pp. 245_265, 2012.
2. Maryam Farajzadeh-Zanjani, Roozbeh Razavi-Far, Mehrdad
Saif,” Efficient Sampling Techniques for Ensemble Learning and Diagnosing
Bearing Defects under Class Imbalanced Condition”.
3. Gustavo E. A. P. A. Batista and Maria Carolina Monard,”
An Analysis of Four Missing Data Treatment Methods for Supervised Learning”.
4. Weiwei Lin, Ziming Wu, Longxin Lin, Angzhan Wen, And Jin
Li,” An Ensemble Random Forest Algorithm for Insurance Big Data Analysis”,2017.
5. Eesha Goel, Er. Abhilasha,” Random Forest: A
Of Data Preprocessing And Partitioning Procedure For Machine Learning.Avaliable:http://www.academia.edu/9517738/conception_of_data_preprocessing_and_partitioning_procedure_for_machine_learning_algorithm.
7.Down-Sampling Using Random Forests, Avaliable:https://www.r-bloggers.com/down-sampling-using-random-forests/
8. Boosting in Machine Learning and the Implementation of
XGBoost Avaliable: https://
9. Tianqi Chen and Tong He,”xgboost: eXtreme Gradient
Boosting”, January 4, 2017.