Description
Describe the workflow you want to enable
I want to introduce support for orthogonal polynomial features via QR decomposition in PolynomialFeatures
, closely mirroring the behavior of R's poly()
function.
In regression modeling, using orthogonal polynomials can often lead to improved numerical stability and reduced multi-collinearity among polynomial terms
As an example of what the difference looks like in R,
#fits raw polynomial data without an orthogonal basis model_raw <- lm(y ~ I(x) + I(x^2) + I(x^3), data = data) #model_raw <- lm(y ~poly(x,3,raw=TRUE), data = data) #fits the same degree-3 polynomial using an orthogonal basis model_poly <- lm(y ~ poly(x, 3), data = data)
This behavior cannot currently be replicated with scikit-learn
's PolynomialFeatures
, which only produces the raw monomial terms. As a result transitioning from R to Python often leads to discrepancies in model behavior and performance.
Describe your proposed solution
I propose extending PolynomialFeatures
with a new parameter:
PolynomialFeatures(..., method="raw")
Accepted values:
"raw"
(default): retains existing behavior, returning standard raw terms"qr"
: applies QR decomposition to each feature to generate orthogonal polynomial features.
Because R's poly()
only operates on 1D input vectors, my thought was to apply QR decomposition feature by feature when the input is multi-dimensional. Each column is processed independently, mirroring R's approach.
This feature would interact with other parameters as follows:
-
include_bias
: Whenmethod="qr"
, The orthogonal polynomial basis inherently includes a transformed first column. However, this column is not a plain column of ones. Therefore, the concept ofinclude_bias=True
(which appends a column of ones) becomes redundant or misleading in this context. One option is to always setinclude_bias=False
ifmethod=qr
and always return orthogonal columns only, or raise a warning. -
interaction_only
: This would be incompatible withmethod="qr"
since the QR-based transformation does not naturally support selective inclusion of interaction terms.
Describe alternatives you've considered, if relevant
Currently, users must implement QR decomposition manually when orthogonal polynomials are needed. This is a common pattern in statistical workflows but lacks "off the shelf" support in any major python library. This feature would eliminate the need to do this decomposition manually and would improve workflows for researchers who are used to R's statistical tools.
Additional context
This idea stemmed from a broader effort to convert statistical modeling pipelines from R to python, where discrepencies in regression results were traced to the lack of orthogonal polynomial support in PolynomialFeatures
.
I have drafted and tested a 1D implementation of this feature but wanted feedback on whether this idea aligns with scikit-learn
's scope before moving on. In particular, I'd appreciate input on
- Acceptability of feature-wise orthogonalization for multi-feature input.
- Preferred parameter naming (e.g.,
method="qr"
vs.orthogonal=True
). - Compatibility decisions around parameters like
include_bias
andinteraction_only
.