forked from WinVector/Examples
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathExample.Rmd
More file actions
274 lines (237 loc) · 12.6 KB
/
Example.Rmd
File metadata and controls
274 lines (237 loc) · 12.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
title: "EncodingExample"
author: "Win-Vector LLC"
date: "4/15/2017"
output:
md_document:
variant: markdown_github
---
[`R`](https://cran.r-project.org) has "one-hot" encoding hidden in most of its modeling paths. Asking an `R`
user where one-hot encoding is used is like asking a fish where there is water; they can't point to it as it is everywhere.
For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:
```{r modelenc}
dTrain <- data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
```
Much of the encoding in `R` is essentially based on "contrasts" implemented in
`stats::model.matrix()` Note: do not use `base::data.matrix()` or use [hashing](http://www.win-vector.com/blog/2014/12/a-comment-on-preparing-data-for-classifiers/) before
modeling- you might get away with them (especially with tree based methods), but they are [not in general
good technique](http://www.win-vector.com/blog/2014/12/a-comment-on-preparing-data-for-classifiers/) as we show below:
```{r datamatrix}
data.matrix(dTrain)
```
`stats::model.matrix()` does not store its one-hot plan in a convenient manner (it
can be inferred by pulling the "`contrasts`" attribute plus examining
the column names of the first encoding, but the levels identified are not conveniently
represented).
When directly applying `stats::model.matrix()` you can not
safely assume the same formula applied to two different data sets (say train
and application or test) are using the same encoding! We demonstrate this below:
```{r modelmatrixexample}
dTrain <- data.frame(x= c('a','b','c'),
stringsAsFactors = FALSE)
encTrain <- stats::model.matrix(~x, dTrain)
print(encTrain)
dTest <- data.frame(x= c('b','c'),
stringsAsFactors = FALSE)
stats::model.matrix(~x, dTest)
```
The above mal-coding can be a critical flaw when you are building a model and then later using the model on new
data (be it cross-validation data, test data, or future application data). Many
`R` users are not familiar with the above issue as encoding is hidden in model training,
and how to encode new data is stored as part of the model. `Python` `scikit-learn` users coming
to `R` often ask "where is the one-hot encoder" (as it isn't discussed as much in `R` as it
is in `scikit-learn`) and even supply a number of (low quality) one-off packages "porting one-hot encoding to `R`."
The main place an `R` user needs a proper encoder (and that is an encoder that stores its encoding
plan in a conveniently re-usable form, which many of the "one-off ported from `Python`" packages actually fail to do) is
when using a machine learning implementation that isn't completely `R`-centric. One such system is
[`xgboost`](https://github.com/dmlc/xgboost) which requires (as is typical of machine learning in `scikit-learn`)
data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a `data.frame`).
This requires explicit conversion on the part of the `R` user, and many `R` users get it wrong (fail to store the
encoding plan somewhere). To make this concrete let's work a simple example.
Let's try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into `xgboost`. As we said: `xgboost` requires a numeric matrix for its input, so unlike many `R` modeling methods we must manage the data encoding ourselves (instead of leaving that to `R` which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).
We bring in our data:
```{r modelingexample}
# set up example data set
library("titanic")
data(titanic_train)
str(titanic_train)
summary(titanic_train)
outcome <- 'Survived'
target <- 1
shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch')
for(v in shouldBeCategorical) {
titanic_train[[v]] <- as.factor(titanic_train[[v]])
}
tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId")
vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed))
dTrain <- titanic_train
```
And design our cross-validated modeling experiment:
```{r setup}
library("xgboost")
library("sigr")
library("WVPlots")
library("vtreat")
set.seed(4623762)
crossValPlan <- vtreat::kWayStratifiedY(nrow(dTrain),
10,
dTrain,
dTrain[[outcome]])
evaluateModelingProcedure <- function(xMatrix, outcomeV, crossValPlan) {
preds <- rep(NA_real_, nrow(xMatrix))
for(ci in crossValPlan) {
nrounds <- 1000
cv <- xgb.cv(data= xMatrix[ci$train, ],
label= outcomeV[ci$train],
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0,
nfold= 5)
#nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression
nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification
model <- xgboost(data= xMatrix[ci$train, ],
label= outcomeV[ci$train],
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0)
preds[ci$app] <- predict(model, xMatrix[ci$app, ])
}
preds
}
```
Our preferred way to encode data is to use the [`vtreat`](https://CRAN.R-project.org/package=vtreat) package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).
```{r vtreatZ}
set.seed(4623762)
tplan <- vtreat::designTreatmentsZ(dTrain, vars,
minFraction= 0,
verbose=FALSE)
# restrict to common varaibles types
# see vignette('vtreatVariableTypes', package = 'vtreat') for details
sf <- tplan$scoreFrame
newvars <- sf$varName[sf$code %in% c("lev", "clean", "isBAD")]
trainVtreat <- as.matrix(vtreat::prepare(tplan, dTrain,
varRestriction = newvars))
print(dim(trainVtreat))
print(colnames(trainVtreat))
dTrain$predVtreatZ <- evaluateModelingProcedure(trainVtreat,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predVtreatZ',
outcome, target)
WVPlots::ROCPlot(dTrain,
'predVtreatZ',
outcome, target,
'vtreat encoder performance')
```
Model matrix can perform similar encoding *when* we only have a single data set.
```{r modelmatrix}
set.seed(4623762)
f <- paste('~ 0 + ', paste(vars, collapse = ' + '))
# model matrix skips rows with NAs by default,
# get control of this through an option
oldOpt <- getOption('na.action')
options(na.action='na.pass')
trainModelMatrix <- stats::model.matrix(as.formula(f),
dTrain)
# note model.matrix does not conveniently store the encoding
# plan, so you may run into difficulty if you were to encode
# new data which didn't have all the levels seen in the training
# data.
options(na.action=oldOpt)
print(dim(trainModelMatrix))
print(colnames(trainModelMatrix))
dTrain$predModelMatrix <- evaluateModelingProcedure(trainModelMatrix,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predModelMatrix',
outcome, target)
WVPlots::ROCPlot(dTrain,
'predModelMatrix',
outcome, target,
'model.matrix encoder performance')
```
The `caret` package also supplies an encoding functionality properly split between training (`caret::dummyVars()`) and application (called `predict()`).
```{r caret}
library("caret")
set.seed(4623762)
f <- paste('~', paste(vars, collapse = ' + '))
encoder <- caret::dummyVars(as.formula(f), dTrain)
trainCaret <- predict(encoder, dTrain)
print(dim(trainCaret))
print(colnames(trainCaret))
dTrain$predCaret <- evaluateModelingProcedure(trainCaret,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predCaret',
outcome, target)
WVPlots::ROCPlot(dTrain,
'predCaret',
outcome, target,
'caret encoder performance')
```
We usually forget to teach `vtreat::designTreatmentsZ()` as it is often dominated by the
more powerful y-aware methods `vtreat` supplies (though not for this simple example).
`vtreat::designTreatmentsZ` has a number of useful properties:
* Does not look at the outcome values, so does not require extra care in cross-validation.
* Saves its encoding, so can be used correctly on new data.
The above two properties are shared with `caret::dummyVars()`. Additional features
of `vtreat::designTreatmentsZ` (that differ from `caret::dummyVars()`'s choices) include:
* No `NA` values are passed through by `vtreat::prepare()`.
* `NA` presence is added as an additional informative column.
* A few derived columns (such as pooling of rare levels are made available).
* Rare dummy variables are pruned (under a user-controlled threshold) to prevent encoding explosion.
* Novel levels (levels that occur during test or application, but not during training) are deliberately passed through as "no training level activated" by `vtreat::prepare()` (`caret::dummyVars()` considers this an error).
The `vtreat` y-aware methods include proper nested modeling and y-aware dimension reduction.
`vtreat` is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases `vtreat` works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach `vtreat` sore of "bimodally" in both a ["fire and forget" mode](http://www.win-vector.com/blog/2017/03/vtreat-prepare-data/) and a ["all the details on deck" mode](https://arxiv.org/abs/1611.09477) (suitable for formal citation). Either way `vtreat` can make your modeling procedures stronger, more reliable, and easier.
##################
You could also try y-aware encoding, but it isn't adding anything positive in this situation as
we have not introduced any high-cardinality categorical variables into this modeling example
(the main place y-aware encoding helps).
```{r vtreatC}
set.seed(4623762)
# for y aware evaluation must cross-validate whole procedure, designing
# on data you intend to score on can leak information.
preds <- rep(NA_real_, nrow(dTrain))
for(ci in crossValPlan) {
cfe <- vtreat::mkCrossFrameCExperiment(dTrain[ci$train, , drop=FALSE],
minFraction= 0,
vars,
outcome, target)
tplan <- cfe$treatments
sf <- tplan$scoreFrame
newvars <- sf$varName[sf$sig < 1/nrow(sf)]
trainVtreat <- cfe$crossFrame[ , c(newvars, outcome), drop=FALSE]
nrounds <- 1000
cv <- xgb.cv(data= as.matrix(trainVtreat[, newvars, drop=FALSE]),
label= trainVtreat[[outcome]]==target,
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0,
nfold= 5)
#nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression
nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification
model <- xgboost(data= as.matrix(trainVtreat[, newvars, drop=FALSE]),
label= trainVtreat[[outcome]]==target,
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0)
appVtreat <- vtreat::prepare(tplan,
dTrain[ci$app, , drop=FALSE],
varRestriction = newvars)
preds[ci$app] <- predict(model,
as.matrix(appVtreat[, newvars, drop=FALSE]))
}
dTrain$predVtreatC <- preds
sigr::permTestAUC(dTrain,
'predVtreatC',
outcome, target)
WVPlots::ROCPlot(dTrain,
'predVtreatC',
outcome, target,
'vtreat y-aware encoder performance')
```