One of the most troubling aspects of widely used machine learning, data mining, and predictive modeling is that replication may not occur except in a minority of the cases. This has been highlighted by Stanford Professor John Ioannidis whose work analyzing medical studies shows that most larger randomized trials do produce predictions that replicate, whereas most predictions that are not based upon larger randomized experimental trials do not replicate very well as reviewed by a Notre Dame professor in this NY Times article.
Predictive modeling, machine learning, and data mining all refer to the most uncontrolled, non-randomized methods from which to extract predictions from data. So we would expect that these “machine learning” methods would have far worse replication rates than the 20% rate that Ioannidis reports overall for non-randomized methods. This is because some non-randomized methods such as quasi-experimental and repeated measures methods are not data mining/data dredging, so they at least afford some degree of control for confounds related to human and data bias and error. Thus, data mining, predictive modeling and machine learning purely based upon convenience samples in medical, business and social science may have horrible replication rates and far worse than this 20% average rate.
Unfortunately, we do not know what the replication rates of today’s standard machine learning and predictive modeling applications in medicine, business, and social science might be as replication in the sense of independent model builders generating similar predictions at the individual level is not performed as a component of the standard model validation. And outside of medicine, there are seldom any attempts at such replication that are reported. Yet, we can expect that many if not most widely applied machine learning and predictive modeling methods do not replicate well in terms of two or more independent modelers generating similar predictions at the individual observation level from all studies that have looked at this question.
I review some of this evidence for very poor replication in my book Calculus of Thought (Elsevier, 2014). For example, some high profile evidence for this lack of replication in business and medical machine learning/predictive modeling at the individual observation level was offered by the Head of NIH Dr. Francis Collins who appeared on a PBS Nova documentary in 2012. He showed that three different commercially available genomic predictive tests for major diseases gave largely different and even contradictory predictions. Based upon such evidence for predictions that are not replicated, the FDA has now regulated this industry more heavily and ordered that such tests not be sold unless better reliability could be shown as described in this link. However, in isolated instances where good reliability and causal interpretation is shown such as in some rare diseases, the FDA does allow such information to be sold as in the example reported here.
Long before I knew about the work of Ioannidis or the many other examples of non-replication like provided by Collins, I have been interested in machine learning methods that produce results that replicate substantially better than the widely used standard predictive modeling methods. This interest is what led me to develop the Reduced Error Logistic Regression (RELR) methodology. RELR models can be expected to generate predictions that replicate better than other machine learning methods because RELR does not have any arbitrary or user-based tuning parameters to influence predictions. So even though RELR is based upon the logistic regression method that many would say is not “machine learning” because it is not a black box, RELR is also completely automated. And RELR’s automation is unlike the automated Stepwise Regression or Decision Trees or Support Vector Machines (SVM) sold by some software companies in that this is not based upon completely arbitrary parameters picked by the automated software designer, where different automated software designers are likely to pick different arbitrary modeling parameters that often generate widely different predictions as we know happens with these methods from the research literature that is reviewed in Calculus of Thought. For this reason, RELR is arguably machine learning in the truest sense of artificial intelligence that does not require human intervention with all of its biased, arbitrary or subjective human choices. That is why we call RELR a machine learning method. In addition, RELR automatically controls the sources of error related to multicollinearity, sampling and other noise to generate predictions that depend much less on the observation sample than other machine learning. This article will introduce readers to RELR machine learning and present data on the kinds of replicated and reliable machine learning results that we are obtaining with these methods.
As reviewed in Calculus of Thought. RELR produces two distinct types of machine learning models. These are called Implicit and Explicit RELR and are designed to model the brain’s implicit and explicit cognitive learning systems. Implicit Learning is unconscious learning where we do not interpret what our brains are doing in terms of how it is making its predictions that result from this learning. Examples of implicitly learned memories are very basic reflexive motor tasks such as hand gestures and eye movements along with more complex automated movements such as typing, riding a bike, or driving a car. In all cases, our brains may not have conscious awareness of these automatic memories and how they are able to predict correct and accurate responses to changing environmental inputs. Explicit Learning is conscious learning where we do interpret what our brains are doing in terms of being aware of an explanation for our behaviors. Telling a story from memory is an example of explicit cognition where we are aware of all the details in the story and also how previous details seems to cue our brains and help predict the next detail. In order to understand causality in the form of how events are connected causally in our stream of consciousness, we need to utilize explicit learning.
Explicit and implicit learning and memory have been intensively studied by cognitive neuroscience since the 1950s. For example, they each have well known neural and neurological bases. Explicit learning and memory is now thought to require feedback in neural circuitry that involves the medial temporal lobe and hippocampus. When these temporal lobe circuits begin to go awry early in Alzheimer’s, explicit learning is disrupted so we have problems in learning and recalling recent events. Yet, implicit learning seems to remain normal early in Alzheimer’s disease, and the explicit memory for more remotely learned facts and events seems to be much less disrupted early in Alzheimer’s disease.
Usually explicit and implicit learning are always happening together in the human brain as we can walk, chew gum, use facial and hand gestures, and remember and tell stories at the same time. Still, we can characterize Implicit Learning as the type of learning that does not seem to require feedback and is not required to be interpretable in the sense of being conscious, whereas Explicit Learning requires feedback and is interpretable to our brain’s stream of consciousness. The RELR methodology is an attempt to model Implicit and Explicit RELR in terms of two distinct learning methods. As I review in Calculus of Thought, these two RELR Implicit and Explicit learning methods correspond to the predictive vs. explanatory modeling that characterizes Breiman’s two distinct cultures of predictive analytics.
Implicit RELR is our purely predictive method that generates complex predictive models with many predictive features that usually cannot be interpreted. Implicit RELR also does not require any feedback to generate its feature selection learning, so it can be programmed as a parallel processing algorithm across parallel processors as is implemented in our SkyRELR machine learning product. Explicit RELR generates extremely parsimonious predictive models that may be interpreted as causal hypotheses because they are so parsimonious. Explicit RELR does require feedback in its feature selection learning, so Explicit RELR can only do this in non-parallel, sequential processing. Indeed, our explicit stream of conscious also seems to be a sequential process, as we only can hold one conscious thought at any one time. These implicit and explicit cognitive operations that RELR models are the fast and slow learning that cognitive scientist Daniel Kahneman highlights in his well received book titled Thinking, Fast and Slow. Whether these Implicit and Explicit RELR learning methods are an accurate model of real implicit and explicit learning in the brain is an open question, but they do have these and many other similarities.
Another important potential proposed similarity is the basic thesis of my book Calculus of Thought. This is that the brain’s neurons may rely upon logistic regression in learning at the individual neural level. Unlike the “neural network” models introduced in the 1980’s, this book reviews how RELR’s learning mechanisms are based upon known neural mechanisms that operate at the individual neuron level and not at the network level. This is a much easier problem to start with what is known at the individual neural level, as much is known there unlike at the network level.
There is a basic mathematical equivalence between the maximum entropy and maximum likelihood solutions in binary logistic regression. Calculus of Thought reviews this equivalence and how well known physical mechanisms that lead to most probable maximum entropy behavior at the level of the neuron under the Second Law of Thermodynamics would be what is needed to have neural computation that naturally generates maximum entropy/maximum likelihood binary logistic regression. This neural logistic regression is proposed to determine the probability of a neuron generating a binary “yes” signal given its input features just as logistic regression works in machine learning practice with one major caveat. This is that the RELR formulation includes constraints to handle error due to multicollinearity and sampling in a mechanism that assumes symmetrical error in the sense that the probability of positive and negative error are equal. Such a mechanism is proposed to operate in the real neuron in its summation mechanism that determines its binary firing at the axon hillock. In the usage of binary logistic regression in the real world, this symmetrical error always can be forced by balancing the samples in either binary or ordinal logistic regression and then correcting the intercepts for any unbalance after the learning has occurred.
Yet another important aspect to the brain’s neural learning is that it is able to learn based upon relatively small sample observations. And the inputs to this learning at the individual neural level from other neurons are very high dimension and highly multicollinear and seem to compute interactions between these inputs from other neurons and also seem to be able to model nonlinear effects. In spite of such very high dimension, multicollinear and noisy inputs and very small training samples, all evidence is that the brain’s learning is very reliable and replicates well as different people all learn the same thing when given comparable training.
Some mechanism must operate within neurons to get rid of the noise and error. Whether the RELR symmetrical error mechanism or something similar is that actual mechanism remains an empirical question. Yet, like the brain’s neural learning, Implicit and Explicit RELR also allow reliable learning with very high dimension, multicollinear features based upon interaction and nonlinear effects in very small sample learning in data with substantial noise and sampling error. And RELR does appear to model this neural learning capacity to reject error.
This article will provide basic tests of RELR’s Implicit and Explicit Learning methods with a focus on how well the predictions that arise from this learning replicate with such high dimension and small sample noisy data. It will also try to replicate a basic finding reviewed in Calculus of Thought and consistent with the distinction between explicit and implicit learning as occurs in the mammalian brain. In the mammalian brain, implicit learning appears to be superior with smaller training observations, whereas accurate explicit learning becomes possible with more training observations. Likewise, previous research has indicated Implicit RELR models are superior with very small training samples, whereas Explicit RELR models are likely to be superior with larger training samples. But the current research will try to replicate this basic finding.
This work used a widely downloaded data set available from the UC-Irvine Machine Learning Repository through this link – Bank Marketing Data. In this case, we will use the first set of data which is bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010). The conversion rate percentage, in terms of converting to sales, was roughly 11%, so these are highly imbalanced data in terms of target and non-target responses. As indicated in the description of the data, these data are related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where more than one contact to the same client was often required. The goal of the marketing campaign was to assess if the service (bank term deposit) would or would not be bought as indicated by a ‘yes’ or ‘no’ response in the y field which is the target variable field in their csv file.
A very small amount of data cleaning was performed on those data as indicated in the above table. In addition, the original y field which was renamed to yconverted was moved to be the second field right after the newly created Index field. SkyRELR always looks for the target variable field as the second column/field and requires that the first field is named Index with a consecutive numerical ordering of rows starting with 0 just as was done here. Missing values in fields need to be handled by placing blanks in those fields, so this is why the text values unknown were replaced by blanks in this csv file that was input into SkyRELR. Note that text values for the fields of month, day_of_week and education were replaced by numerical values in this data cleaning. In general, RELR will perform much better if any variable that has values that can be interpreted as numbers has those values transformed into numbers. SkyRELR also assumes that any field that has a name starting with ‘ordinal’, as in how pdays was renamed ordinalpdays, has values that are interpreted as ordinal or rank-ordered values. Variables which have values that can be interpreted as interval level in measurement like monthnumber and weekdaynumber should not have names starting with ‘ordinal’. However, there was actually an error in the data cleaning here of the variable education, as its transformed variable should have been called ordinaleducationlevel instead of educationlevel as this is a variable that cannot be interpreted as an interval level measurement because it has values that need to be interpreted as ranked values. This error was discovered after we had already analyzed the results. Because this variable has a small number of levels, this error is unlikely to have a large effect. So, we decided not to fix it. But it is a good example of human error and we leave it as an exercise for others to determine if our predictive modeling results would change when this is error is corrected.
The design of this analysis was very simple. We wanted to compare models generated with the two types of RELR learning called Implicit and Explicit RELR in terms of average predictive accuracy and reliability/replication of individual-level predicted target variable probabilities. This comparison was made across two independent swapped training/validation samples using small and large sample sizes. We built all the models using SkyRELR – our automated cloud machine learning application written in Python and available later this month (Sept. 2015) through this www.skyrelr.com website. These models were built with a 2 CPU core implementation of SkyRELR using the C3.large compute optimized instance offered by Amazon Web Services, which runs roughly 10 times faster than the 2 CPU cores that run on my two year old Asus notebook PC which has Intel core processors that run at 2.2 GHz. We generated roughly 10,000 candidate features based upon the original 20 variables that included binary coded category features, binary coded missing value status features, all main effect features, all twoway interaction features, all threeway interaction features, and all nonlinear effect features up to the quartic power. SkyRELR automatically generates all of these features and handles missing value imputation. SkyRELR also automatically handles the feature reduction, feature selection, and model building for its Implicit and Explicit RELR learning algorithms.
This SkyRELR application also allows different samples to be drawn from the same input dataset with a random seed parameter and this random seed was set to 100 in all cases. Yet it also allows a user to swap the initial training and validation conditions. This swapping of the initial training and validation conditions was used to effectively allow two distinct models to be built in all cases – a model built from the initial training sample and a model built from the initial validation sample when the initial training and validation samples were swapped in terms of their roles. This swapping allows for independent replication to be assessed comparable to if two different and independent modeling teams who were given independent samples built models with our RELR methods, as there are no arbitrary or tuning parameters in our RELR methods.
The performance results are summarized in Table 2.
Note that larger magnitude Brier Scores in above Table 2 reflect greater error or worse performance, as Brier Scores are simply the Mean Squared Error (MSE) in the RELR probability predictions for the binary target across all observations in the given condition. On the other hand, larger magnitude Matthew Correlation Coefficients (MCC) reflect stronger correlation or better performance. The MCC is the Pearson Product Moment Correlation between the binary coded predicted target response based upon the RELR predicted outcome probabilities with a threshold of .5 in this case and the binary target outcomes across all observations in the given condition. All Brier and MCC validation sample score conditions shown in Table 2 in green were very significantly (p<.001 in the worst case even when Bonferroni corrections were imposed for multiple comparisons) larger than comparable validation sample score conditions shown in yellow as assessed by paired sample Student’s t-tests in the case of the Brier Scores and Zou’s method to compare Pearson Product Moment correlations in the case of MCC.
Table 3 compares the replication reliability as measured by the Pearson Product Moment Correlation in the predicted probability of target outcomes when models are built across the completely independent initial and swapped training. With very small training samples in terms of number of target responses in the low 200s, the Implicit RELR learning generates predictions that correlate well here (r=0.942) across the completely independent models built across completely independent training samples. However, that correlation drops (r=0.833) when the Explicit RELR learning is used. On the other hand, with larger training samples in terms of the number of target responses (N=2290), Explicit RELR shows better correlation/replication of predicted probabilities across the two independently build initial and swapped conditions than Implicit RELR here. This parallels the Table 2 results where Implicit RELR performs better with the smaller training sample condition, but Explicit RELR performs better with the larger training sample.
Table 4 shows how closely the target outcome predicted probabilities match when models are built with the Implicit vs. Explicit RELR learning. The sample sizes indicated in that table reflect the number of target responses, but the correlations that are reported are Pearson Product Moment Correlations across the validation samples with sample sizes shown in Table 2. Notice that at the larger target response conditions here (Target N=2290), the Implicit RELR and Explicit RELR probabilities are more highly correlated compared to the smaller target response conditions.
The present pattern of results was that Implicit RELR performs better with very small training samples, whereas Explicit RELR performs better with larger samples. This replicates our previously reported findings as reviewed in Calculus of Thought.
Humans do not even have explicit learning until after the first several years of life, as we cannot consciously remember anything from our earliest years. Yet, substantial implicit learning is obviously taking place in our earliest years when we learn to walk, and talk and use hand and facial movements. In the human brain as in other animals that exhibit explicit cognition, explicit learning appears to require larger amounts of learning than implicit learning. The present pattern of results suggest that RELR’s implicit and explicit learning algorithms parallel this distinguishing aspect of the brain’s implicit vs. explicit learning where small sample learning characterizes implicit learning and larger training samples are necessary for explicit learning.
The fact that good correlations that are larger than .9 are shown in RELR implicit learning with very small training samples makes this approach ideal for predictive problems where we have small training samples and want reliable individual level predictions. Many problems in business, medicine and social science are like this today, as larger sample data are not as common today as one would like. Still the ideal would be to move beyond the predictive and obtain causal understanding, predictive models may only be accurate for very short periods of time unless they are based upon the causal drivers. The Explicit RELR methodology often may be useful to generate causal hypotheses.
In previous controlled studies as reviewed in Calculus of Thought, RELR has been reported to have better accuracy than commonly used algorithms including Bagging Regression similar to Random Forests in its bagging, Artificial Neural Networks, Support Vector Machines, LASSO/L1 and L2 regularized regression, decision trees, stepwise regression, amongst others. This was research either conducted by us or by our users who gave public presentations of their findings, and much other research by our users has found similar results but has not been published. As mentioned above, Implicit RELR was most accurate at smaller training samples in previous work, whereas Explicit RELR was superior at larger training samples just like was found here. This current study did not compare to other predictive modeling algorithms, as we did not know how to compare for replication/reliability across swapped samples given that all these other algorithms require arbitrary or subjective tuning decisions. Yet, we expect that when skilled modelers are blinded to one another, they will not completely replicate one another in terms of these tuning or modeling parameter decisions when given independent swapped samples. This inference is based upon what is known about how predictive models built using traditional algorithms that require human arbitrary or tuning decisions do not often replicate one another in terms of predictions when these models are based upon high dimension observation/convenience sample data. But the data set used is one of the sample data sets that ship with our SkyRELR product, so we are hopeful that our users will make comparisons to other algorithms in their own research to determine if this inference is accurate with the presently used data.
In any case, this current research does show that RELR models replicate well in terms of generating almost the same predictions across independent and fairly small training samples. It should be repeated that neither Implicit or Explicit RELR require any arbitrary or tuning decisions on the part of modelers, so replication in predictions across independent samples should generalize to cases where independent, blinded modelers are given independent samples from the same population from which to develop models. This case with blinded, independent modelers needing to develop similar predictions is a good test for the real world requirement that predictions in machine learning should not be an arbitrary artifact of the modelers biases or subjective choices.
The current research paper did not report on the selected features and the interpretation of these very parsimonious features in the case of Explicit RELR. A companion white paper on this topic will be provided here on this SkyRELR blog in the next week or two.