Examples/Encoding/Example.Rmd at master · srutis/Examples

History

274 lines (237 loc) · 12.6 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

---

title: "EncodingExample"

author: "Win-Vector LLC"

date: "4/15/2017"

output:

md_document:

variant: markdown_github

---

[`R`](https://cran.r-project.org) has "one-hot" encoding hidden in most of its modeling paths. Asking an `R`

user where one-hot encoding is used is like asking a fish where there is water; they can't point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

```{r modelenc}

dTrain <- data.frame(x= c('a','b','b', 'c'),

y= c(1, 2, 1, 2))

summary(lm(y~x, data= dTrain))

```

Much of the encoding in `R` is essentially based on "contrasts" implemented in

`stats::model.matrix()` Note: do not use `base::data.matrix()` or use [hashing](http://www.win-vector.com/blog/2014/12/a-comment-on-preparing-data-for-classifiers/) before

modeling- you might get away with them (especially with tree based methods), but they are [not in general

good technique](http://www.win-vector.com/blog/2014/12/a-comment-on-preparing-data-for-classifiers/) as we show below:

```{r datamatrix}

data.matrix(dTrain)

```

`stats::model.matrix()` does not store its one-hot plan in a convenient manner (it

can be inferred by pulling the "`contrasts`" attribute plus examining

the column names of the first encoding, but the levels identified are not conveniently

represented).

When directly applying `stats::model.matrix()` you can not

safely assume the same formula applied to two different data sets (say train

and application or test) are using the same encoding! We demonstrate this below:

```{r modelmatrixexample}

dTrain <- data.frame(x= c('a','b','c'),

stringsAsFactors = FALSE)

encTrain <- stats::model.matrix(~x, dTrain)

print(encTrain)

dTest <- data.frame(x= c('b','c'),

stringsAsFactors = FALSE)

stats::model.matrix(~x, dTest)

```

The above mal-coding can be a critical flaw when you are building a model and then later using the model on new

data (be it cross-validation data, test data, or future application data). Many

`R` users are not familiar with the above issue as encoding is hidden in model training,

and how to encode new data is stored as part of the model. `Python` `scikit-learn` users coming

to `R` often ask "where is the one-hot encoder" (as it isn't discussed as much in `R` as it

is in `scikit-learn`) and even supply a number of (low quality) one-off packages "porting one-hot encoding to `R`."

The main place an `R` user needs a proper encoder (and that is an encoder that stores its encoding

plan in a conveniently re-usable form, which many of the "one-off ported from `Python`" packages actually fail to do) is

when using a machine learning implementation that isn't completely `R`-centric. One such system is

[`xgboost`](https://github.com/dmlc/xgboost) which requires (as is typical of machine learning in `scikit-learn`)

data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a `data.frame`).

This requires explicit conversion on the part of the `R` user, and many `R` users get it wrong (fail to store the

encoding plan somewhere). To make this concrete let's work a simple example.

Let's try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into `xgboost`. As we said: `xgboost` requires a numeric matrix for its input, so unlike many `R` modeling methods we must manage the data encoding ourselves (instead of leaving that to `R` which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).

We bring in our data:

```{r modelingexample}

# set up example data set

library("titanic")

data(titanic_train)

str(titanic_train)

summary(titanic_train)

outcome <- 'Survived'

target <- 1

shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch')

for(v in shouldBeCategorical) {

titanic_train[[v]] <- as.factor(titanic_train[[v]])

}

tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId")

vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed))

dTrain <- titanic_train

```

And design our cross-validated modeling experiment:

```{r setup}

library("xgboost")

library("sigr")

library("WVPlots")

library("vtreat")

set.seed(4623762)

crossValPlan <- vtreat::kWayStratifiedY(nrow(dTrain),

10,

dTrain,

dTrain[[outcome]])

evaluateModelingProcedure <- function(xMatrix, outcomeV, crossValPlan) {

preds <- rep(NA_real_, nrow(xMatrix))

for(ci in crossValPlan) {

nrounds <- 1000

cv <- xgb.cv(data= xMatrix[ci$train, ],

label= outcomeV[ci$train],

objective= 'binary:logistic',

nrounds= nrounds,

verbose= 0,

nfold= 5)

#nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression

nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification

model <- xgboost(data= xMatrix[ci$train, ],

label= outcomeV[ci$train],

objective= 'binary:logistic',

nrounds= nrounds,

verbose= 0)

preds[ci$app] <- predict(model, xMatrix[ci$app, ])

}

preds

}

```

Our preferred way to encode data is to use the [`vtreat`](https://CRAN.R-project.org/package=vtreat) package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).

```{r vtreatZ}

set.seed(4623762)

tplan <- vtreat::designTreatmentsZ(dTrain, vars,

minFraction= 0,

verbose=FALSE)

# restrict to common varaibles types

# see vignette('vtreatVariableTypes', package = 'vtreat') for details

sf <- tplan$scoreFrame

newvars <- sf$varName[sf$code %in% c("lev", "clean", "isBAD")]

trainVtreat <- as.matrix(vtreat::prepare(tplan, dTrain,

varRestriction = newvars))

print(dim(trainVtreat))

print(colnames(trainVtreat))

dTrain$predVtreatZ <- evaluateModelingProcedure(trainVtreat,

dTrain[[outcome]]==target,

crossValPlan)

sigr::permTestAUC(dTrain,

'predVtreatZ',

outcome, target)

WVPlots::ROCPlot(dTrain,

'predVtreatZ',

outcome, target,

'vtreat encoder performance')

```

Model matrix can perform similar encoding *when* we only have a single data set.

```{r modelmatrix}

set.seed(4623762)

f <- paste('~ 0 + ', paste(vars, collapse = ' + '))

# model matrix skips rows with NAs by default,

# get control of this through an option

oldOpt <- getOption('na.action')

options(na.action='na.pass')

trainModelMatrix <- stats::model.matrix(as.formula(f),

dTrain)

# note model.matrix does not conveniently store the encoding

# plan, so you may run into difficulty if you were to encode

# new data which didn't have all the levels seen in the training

# data.

options(na.action=oldOpt)

print(dim(trainModelMatrix))

print(colnames(trainModelMatrix))

dTrain$predModelMatrix <- evaluateModelingProcedure(trainModelMatrix,

dTrain[[outcome]]==target,

crossValPlan)

sigr::permTestAUC(dTrain,

'predModelMatrix',

outcome, target)

WVPlots::ROCPlot(dTrain,

'predModelMatrix',

outcome, target,

'model.matrix encoder performance')

```

The `caret` package also supplies an encoding functionality properly split between training (`caret::dummyVars()`) and application (called `predict()`).

```{r caret}

library("caret")

set.seed(4623762)

f <- paste('~', paste(vars, collapse = ' + '))

encoder <- caret::dummyVars(as.formula(f), dTrain)

trainCaret <- predict(encoder, dTrain)

print(dim(trainCaret))

print(colnames(trainCaret))

dTrain$predCaret <- evaluateModelingProcedure(trainCaret,

dTrain[[outcome]]==target,

crossValPlan)

sigr::permTestAUC(dTrain,

'predCaret',

outcome, target)

WVPlots::ROCPlot(dTrain,

'predCaret',

outcome, target,

'caret encoder performance')

```

We usually forget to teach `vtreat::designTreatmentsZ()` as it is often dominated by the

more powerful y-aware methods `vtreat` supplies (though not for this simple example).

`vtreat::designTreatmentsZ` has a number of useful properties:

* Does not look at the outcome values, so does not require extra care in cross-validation.

* Saves its encoding, so can be used correctly on new data.

The above two properties are shared with `caret::dummyVars()`. Additional features

of `vtreat::designTreatmentsZ` (that differ from `caret::dummyVars()`'s choices) include:

* No `NA` values are passed through by `vtreat::prepare()`.

* `NA` presence is added as an additional informative column.

* A few derived columns (such as pooling of rare levels are made available).

* Rare dummy variables are pruned (under a user-controlled threshold) to prevent encoding explosion.

* Novel levels (levels that occur during test or application, but not during training) are deliberately passed through as "no training level activated" by `vtreat::prepare()` (`caret::dummyVars()` considers this an error).

The `vtreat` y-aware methods include proper nested modeling and y-aware dimension reduction.

`vtreat` is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases `vtreat` works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach `vtreat` sore of "bimodally" in both a ["fire and forget" mode](http://www.win-vector.com/blog/2017/03/vtreat-prepare-data/) and a ["all the details on deck" mode](https://arxiv.org/abs/1611.09477) (suitable for formal citation). Either way `vtreat` can make your modeling procedures stronger, more reliable, and easier.

##################

You could also try y-aware encoding, but it isn't adding anything positive in this situation as

we have not introduced any high-cardinality categorical variables into this modeling example

(the main place y-aware encoding helps).

```{r vtreatC}

set.seed(4623762)

# for y aware evaluation must cross-validate whole procedure, designing

# on data you intend to score on can leak information.

preds <- rep(NA_real_, nrow(dTrain))

for(ci in crossValPlan) {

cfe <- vtreat::mkCrossFrameCExperiment(dTrain[ci$train, , drop=FALSE],

minFraction= 0,

vars,

outcome, target)

tplan <- cfe$treatments

sf <- tplan$scoreFrame

newvars <- sf$varName[sf$sig < 1/nrow(sf)]

trainVtreat <- cfe$crossFrame[ , c(newvars, outcome), drop=FALSE]

nrounds <- 1000

cv <- xgb.cv(data= as.matrix(trainVtreat[, newvars, drop=FALSE]),

label= trainVtreat[[outcome]]==target,

objective= 'binary:logistic',

nrounds= nrounds,

verbose= 0,

nfold= 5)

#nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression

nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification

model <- xgboost(data= as.matrix(trainVtreat[, newvars, drop=FALSE]),

label= trainVtreat[[outcome]]==target,

objective= 'binary:logistic',

nrounds= nrounds,

verbose= 0)

appVtreat <- vtreat::prepare(tplan,

dTrain[ci$app, , drop=FALSE],

varRestriction = newvars)

preds[ci$app] <- predict(model,

as.matrix(appVtreat[, newvars, drop=FALSE]))

}

dTrain$predVtreatC <- preds

sigr::permTestAUC(dTrain,

'predVtreatC',

outcome, target)

WVPlots::ROCPlot(dTrain,

'predVtreatC',

outcome, target,

'vtreat y-aware encoder performance')

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand file tree

Search code, repositories, users, issues, pull requests...

FilesExpand file tree

Example.Rmd

Latest commit

History

Example.Rmd

File metadata and controls

Expand file tree