Variable Selection using Python — Vote based approach
Variable selection is one of the key process in predictive modeling process. It is an art. To put is simple terms, variable selection is like picking a soccer team to win the World cup. You need to have the best player in each position and you don’t want two or many players who plays the same position.
In python, we have different techniques to select variables. Some of them include Recursive feature elimination, Tree based selection, L1 based feature selection. The scikit-learn documentation provided below walks through these techniques.
Vote based approach for variable selection
The idea here is to apply a variety of techniques to select variables. When a algorithm picks a variable, we give a vote for the variable. At the end, we calculate the total votes for each variables and then pick the best ones based on votes. This way, we end up picking the best variables with minimum effort in the variable selection process.
Github Code
The following process happens during variable selection
- Information Value using Weight of evidence
- Variable Importance using Random Forest
- Recursive Feature Elimination
- Variable Importance using Extra trees classifier
- Chi Square best variables
- L1 based feature selection
Once these process are completed, we pick the best variables selected by each algorithm (vote). We count the total number of votes and then perform multicollinearity check on the variables selected. We can extend the process further to include other variable selection techniques before we count the vote.
Have fun!
Also, please look at my other article which uses this code in a end to end python modeling framework.
I released a python package which does variable selection using vote based approach. If you are interested to use the package version read the article below.