博客
关于我
Algorithm: Random Forest, ensemble model
阅读量:371 次
发布时间:2019-03-04

本文共 8554 字,大约阅读时间需要 28 分钟。

Ensemble Model

For clasisfication problem the ensemble model is very effective. Such as  the situation of the Image recognition via deep learning.(black box)

For a grade system, we use the GBDT or XGBoost etc.

In engineering field, the Interpretable is very important,since we could determine the problem once we meet an issue.

 

How to build an ensemble model? Bagging and Boosting

Bagging: Random forest

Boosting: GBDT, XGBoost

We calculate the average value from all of the predictions from the models

We use the variance / standard deviation to evaluate the stability of the model

 

from the example above,we know that the model will become more stable

 

Random Forest

Bagging is a framwork for ensemble model

The random forest using multiple decision trees for the final predictions

It also can be used for regression problem(mean value)

Build the random forest

if we train the decision trees with big correlation, then the performance of random forest will not be very good

The diversity is the most important property of the random forest

1) Randomization of the training sample, it means that we choose diffferent part of the training data for each decision tree of the random forest

sample with replacement

we could also randomize the features. For example, if we have 100 features, we choose 10 from 100 randomly, then we build the decision tree via the 10 features.

Overfitting of random forest

Hyperparameter of random forest:

n_estimators: the number of decision trees we used. The more decision trees, the more training time of the random forest.

criterion: how to choose the features for the current node. or the measure the quality of the current split. gini, or entropy.

max_depth: the maximum depth of the decison tree.

min_samples_split, min_samples_leaf: control the number of the leaves

max_features: the number of features to consider when looking for the best split

An example:

# import the data setfrom sklearn.datasets import load_digits# import random forest classifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitimport pandas as pdimport numpy as np# import datadigits = load_digits()X = digits.datay = digits.targetX_train, X_test, y_train, y_test = train_test_split(X,                                    y, test_size = 0.2, random_state = 42)# create the random forest classifierclf = RandomForestClassifier(n_estimators=400, criterion='entropy',                            max_depth = 5, min_samples_split = 3, max_features = 'sqrt', random_state = 0)clf.fit(X_train, y_train)print("Accuracy in train data set is: %.2f, in the test data set is %.2f"        %(clf.score(X_train, y_train), clf.score(X_test, y_test)))

output:

Accuracy in train data set is: 0.98, in the test data set is 0.95

Another Demo:

prediction for turnover rate

# Turnover rate demo# import packageimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport matplotlib as matplot#%matplotlib inlinefrom sklearn.model_selection import train_test_split# read data as pandas dataframedf = pd.read_csv('HR_comma_sep.csv', index_col = None)# check that there is any data missingprint (df.isnull().any(), '\n\n')# print some dataprint (df.head(), "\n\n")
satisfaction_level  last_evaluation  number_project  average_montly_hours  \0                0.38             0.53               2                   157   1                0.80             0.86               5                   262   2                0.11             0.88               7                   272   3                0.72             0.87               5                   223   4                0.37             0.52               2                   159      time_spend_company  Work_accident  left  promotion_last_5years  sales  \0                   3              0     1                      0  sales   1                   6              0     1                      0  sales   2                   4              0     1                      0  sales   3                   5              0     1                      0  sales   4                   3              0     1                      0  sales      salary  0     low  1  medium  2  medium  3     low  4     low
# rename the columndf = df.rename(columns = {'satisfaction_level' : 'satisfaction',                          'last_evaluation' : 'evaluation',                          'number_project' : 'projectCount',                          'average_montly_hours' : 'averageMonthlyHours',                          'time_spend_company' : 'yearsAtCompany',                          'Work_accident' : 'workAccident',                          'promotion_last_5years' : 'promotion',                          'sales' : 'department',                          'left' : 'turnover'                        })# move the label to the first columnfront = df['turnover']df.drop(labels=['turnover'], axis = 1, inplace = True)df.insert(0, 'turnover', front)#df.head()# calculate the turnover rateturnover_rate = df.turnover.value_counts() / len(df)print ("the turnover rate is: %.2f\n\n" % turnover_rate[1])# print the describe() infoprint(df.describe(), "\n\n")
turnover  satisfaction    evaluation  projectCount  \count  12504.000000  12504.000000  12504.000000  12504.000000   mean       0.200256      0.621834      0.716446      3.803503   std        0.400208      0.245010      0.169745      1.196592   min        0.000000      0.090000      0.360000      2.000000   25%        0.000000      0.450000      0.560000      3.000000   50%        0.000000      0.650000      0.720000      4.000000   75%        0.000000      0.820000      0.870000      5.000000   max        1.000000      1.000000      1.000000      7.000000          averageMonthlyHours  yearsAtCompany  workAccident     promotion  count         12504.000000    12504.000000  12504.000000  12504.000000  mean            200.721769        3.385717      0.149472      0.016555  std              49.341169        1.321437      0.356568      0.127601  min              96.000000        2.000000      0.000000      0.000000  25%             157.000000        3.000000      0.000000      0.000000  50%             200.000000        3.000000      0.000000      0.000000  75%             244.000000        4.000000      0.000000      0.000000  max             310.000000       10.000000      1.000000      1.000000
# convert the string value into integerdf['department'] = df['department'].astype('category').cat.codesdf['salary'] = df['salary'].astype('category').cat.codes# split the train / test data settarget_name = 'turnover'X = df.drop('turnover', axis = 1)y = df[target_name]# the stratify = y means that the turnover rate equal to the turnover rate in the datasetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 123, stratify = y)# Now, time to train the datafrom sklearn.metrics import classification_reportfrom sklearn.ensemble import RandomForestClassifierfrom sklearn import treefrom sklearn.tree import DecisionTreeClassifier# train the decision treedtree = tree.DecisionTreeClassifier(    criterion = 'entropy',    #max_depth = 3, # constraint the depth of the tree to prevent from overfitting    min_weight_fraction_leaf = 0.01 # using the % rate to set the examples of the node    )dtree = dtree.fit(X_train, y_train)print("\n\n ---Decision Tree---")print(classification_report(y_test, dtree.predict(X_test)))
---Decision Tree---              precision    recall  f1-score   support           0       0.97      0.98      0.98      1500           1       0.93      0.89      0.91       376    accuracy                           0.96      1876   macro avg       0.95      0.94      0.94      1876weighted avg       0.96      0.96      0.96      1876

 

# train the random forestrf = RandomForestClassifier(    criterion = 'entropy',    n_estimators = 1000,    max_depth = None, # prevent from over fitting, None means that no limitation    min_samples_split = 10, # at least number of nodes for the next split    #min_weight_fraction_leaf = 0.02 # define the number of sample of the leaf node to prevent from overfitting    )rf.fit(X_train, y_train)print("\n\n ---Random Forest---")print(classification_report(y_test, rf.predict(X_test)))
---随机森林---              precision    recall  f1-score   support           0       0.98      1.00      0.99      1500           1       0.99      0.90      0.94       376    accuracy                           0.98      1876   macro avg       0.98      0.95      0.96      1876weighted avg       0.98      0.98      0.98      1876

 

转载地址:http://hpbg.baihongyu.com/

你可能感兴趣的文章
Netty工作笔记0077---handler链调用机制实例4
查看>>
Netty工作笔记0078---Netty其他常用编解码器
查看>>
Netty工作笔记0079---Log4j整合到Netty
查看>>
Netty工作笔记0080---编解码器和处理器链梳理
查看>>
Netty工作笔记0081---编解码器和处理器链梳理
查看>>
Netty工作笔记0082---TCP粘包拆包实例演示
查看>>
Netty工作笔记0083---通过自定义协议解决粘包拆包问题1
查看>>
Netty工作笔记0084---通过自定义协议解决粘包拆包问题2
查看>>
Netty工作笔记0085---TCP粘包拆包内容梳理
查看>>
Netty常用组件一
查看>>
Netty常见组件二
查看>>
Netty应用实例
查看>>
netty底层——nio知识点 ByteBuffer+Channel+Selector
查看>>
netty底层源码探究:启动流程;EventLoop中的selector、线程、任务队列;监听处理accept、read事件流程;
查看>>
Netty心跳检测
查看>>
Netty心跳检测机制
查看>>
netty既做服务端又做客户端_网易新闻客户端广告怎么做
查看>>
netty时间轮
查看>>
Netty服务端option配置SO_REUSEADDR
查看>>