As I promised, I have produced a blog post by the end of July. Get ready for regular blog postings that may be documentation for using our upcoming cloud product scheduled for release in a couple of months, or these blog posts just might be my own ranting about whatever happens to be on my mind.

In this inaugural blog posting, I would just like to review why I developed Reduced Error Logistic Regression (RELR) over the past twenty years of my career. When I was on the faculty at USC in the early 1990’s, I saw social scientists like psychologists, political scientists, economists, and sociologists who built logistic regression models and regarded these models as reflecting real relationships that predicted outcomes related to health, poverty, crime, divorce, etc.. The problem was that they hand-picked the predictor features that were utilized. What I realized was that the regression coefficients and corresponding measures like odds ratios or relative risks could change markedly with different hand-picked variables. So, the relationships gotten from their logistic regression models were largely a reflection of how they thought that they world worked, as opposed to how the world actually worked.

Many social scientists and many data scientists and statisticians in business still use logistic regression in a similar manner to describe relative risks and/or odds ratios based upon manually built logistic regression models. Logistic regression not only suffers from the problem of arbitrary predictor selection, but also gives very unstable regression coefficients with large standard errors when there is multicollinearity in selected predictors. The effect of such instability is that quite different regression coefficients with possibly even opposite signs may be found across independent random observation samples from the same population. The problem is that multicollinearity is a quite general problem and likely will be found in most data with even more than a few predictor variables, especially if interaction and/or nonlinear effects are considered. Even with very large sample sizes, error related to multicollinearity is usually still a problem with higher dimension predictors from which to select a model.

Substantial bias and error problems are certainly not unique to logistic regression, but instead are seen in all widely used, traditional regression and predictive analytic methods. For example, even ensemble methods that were designed to overcome bias and error problems that utilize bootstrap sampling like Random Forests are now known to have significant bias problems if the original sample utilized in the bootstrap is biased at all. This is because the effect of the bootstrapping is simply to exacerbate the originally biased sample. The problem is that all samples have some biases, so Random Forests may not have good external validity when it exacerbates biases. This Random Forest problem is especially noticeable with correlated predictor variables, category predictor variables with more than a few categories, and in problems with a mix of both category and continuous predictor variables. The only known fix to the Random Forests problem is no longer to use bootstrap samples, but instead use independent samples, as demonstrated in the simulation work of Strobl. However, this may not be a practical solution in most applications because there might not be enough independent samples to build a reliable ensemble model.

An ensemble method that is based upon averaging hundreds or more models built by different algorithms and modelers may overcome these bias and error problems. This is the stacked ensemble methodology utilized in IBM Watson’s Jeopardy performance. Yet, stacked ensemble modeling becomes a very difficult challenge due to having to build so many different models that are required to be different from one another, so that the ensemble average does not exhibit biases unique to individual modelers or algorithms. For this reason, stacked ensemble models are not really widely used today, as it really can not be automated and represents enormous investment to build and update models.

RELR avoids all of these problems. That is, RELR represents a completely automated machine learning methodology which naturally removes the traditional predictor selection bias, along with all error problems related to sampling, multicollinearity, overfitting, and high dimension data.

I will begin to get into the specifics of how RELR avoids these problems in my next blog posting…. In the meantime, if you have any comments you would like to share privately with us, please let us know with the form below.