Deep machine learning methods have become increasingly popular in recent years. This has been based upon the reports that so called artificial neural networks (ANNs) that model higher order interactions are significantly more accurate than more shallow ANNs and other traditional data mining methods. The critical component of deeper learning compared to shallow learning is that higher order interactions and nonlinear effects are included as potential effects. These deep learning effects are hidden when low dimension simpler, typically main effect or simpler interactions are evaluated as in generic regression modeling like standard logistic regression or LASSO. The deep learning effects also are not well modeled by decision trees which are also limited to lower dimension features, and which may be too biased by arbitrary cutpoints in generating the if-then rules. Deeper ANNs appear to be more accurate than many other machine learning algorithms often because they add an enormous number of new potentially predictive features as interactions and nonlinear effects. The evaluation of high dimension features traditionally was limited by the curse of dimensionality at work in serial computers, but today’s much faster parallel processing systems have eliminated this curse for many practical problems.
Years ago in the 1980’s and earlier 1990’s, ANNs were also shown to exhibit surprising accuracy in many observation-based studies, but these results were always tempered by their black box nature and by the fact that ANNs exhibit substantial overfitting. In fact, many scientists simply did not trust ANNs because of the fear that their overlearning and overfitting were biased by confounding factors that would prevent good generalization of predictive accuracy. These fears were shown to be justified on more than a few occasions as in this linked article that describes a well known industry example about how ANNs incorrectly learned that gray skies were indicators of a hidden tank in a field because the model development data had been biased to include this random, confounded association.
Today’s deep ANNs exhibit even more surprisingly accurate performance, such as the performance of deep reinforcement learning in human games of chess and 1980’s Atari electronic games. However, today’s deep learning ANN’s also may be even more prone to learning random confounds than their predecessors because these models are based upon even more complex interactions between input variables and this should cause even more overfitting. Evidence for such overfitting is seen in the adversarial learning artifacts. An example is where identical pictures of cats to human eyes have one image classified as a cat and another identified as a dog by the deep learning ANN. The problem arises with very small amounts of noise in photos as may happen when they are digitized and published online, so then photos that look identical to humans may be overfit in the ANN learning. When this occurs, confounding features not associated with the category of cat may be learned instead of the features inherent to cats.
Unlike face or object recognition which has limited small training samples typically, parlor games like chess and 1980’s Atari games can be replayed and repeated to generate very large training samples in the many millions that allow ANNs to avoid their classic overfitting problems. Also, unlike parlor games such as chess and 1980’s Atari electronic games with relatively small numbers of moves and positions, there are a vastly greater number of possible moves and positions in human financial markets. And the available training data in real world financial markets is quite limited. For example, we cannot replay famous financial market crashes like the 2008-2010 crash millions of times to generate the requisite training data for a deep learning ANN without having the ill effect of overfitting these training data which leads to poor generalization.
Another problem with deep learning ANNs is that they require a large number of human judgment calls in the model building process. Different human modelers and even expert modelers are likely to make different choices and thus return different predictive models often even when given the exact same data. And, the model building process may be very difficult and time consuming, which may not be suited for production environments where models need to be built very rapidly. As discussed in this linked article, there are now some researchers trying to improve deep learning ANNs so that they are not so dependent upon human modeling decisions. Still another problem is that once a model is built, there is a large effort to maintain the model because of the large manual effort involved in updating and building models. Until improvements are made, these human factors issues will be a significant limitation in deep learning ANNs and are possibly as limiting as the black box, overfitting and confoundedness problems.
Reduced Error Logistic Regression (RELR) offers an alternative way to generate deep learning, as its feature generation candidate space can include high dimension hidden interaction and nonlinear effects that may easily be in the hundreds of thousands and even in the millions in cloud applications today. This high dimensionality in the candidate feature space in RELR is substantially more than most deep learning ANN implementations today like those referenced above to learn chess and Atari games. RELR is a much simpler computational procedure than ANNs and uses much less memory in its computations, so this is why RELR is able to model much higher dimensions. Also, RELR does not utilize ensemble modeling to average out the noise, which many deeper ANNs require today, so this is also why RELR is a much more efficient computation process. In principle, RELR could model interactions and nonlinear effects to very high orders of complexity, but our second generation Python SkyRELR implementation only models up to threeway interactions and up to 4th power nonlinear effects. To date, much simpler and parsimonious RELR solutions are typically returned than the most complex possible solutions that could be returned in these implementations. This suggests that RELR’s solutions may have a strong tendency to be simple rather than complex, so the current limitation of only allowing up to threeway interactions and 4th power terms as candidate predictive features may be more than enough potential complexity at least in most applications.
RELR is completely automated and thus avoids the human factors problems inherent in deep learning ANNs. Unlike ANNs which are based upon 1980’s speculative attempts to model complex neural networks, RELR is based upon the more tractable problem of trying to model neural computation in the single neuron. Whereas very little is still known about real neural networks in real brains so all attempts even today still have to be almost entirely speculative, there is a substantial amount known about neural computation in the single neuron. RELR is an attempt to model this computational behavior of individual neurons, though RELR is also necessarily speculative in parts. As reviewed in my book Calculus of Thought, RELR models both the implicit or purely predictive capacity of neural learning as well as the explicit or explanatory aspect of neural learning. These are the Implicit and Explicit RELR algorithms which were introduced in my previous blog article here.
In contrast to ANNs, RELR shows relatively little overfitting even with small sample data and returns very parsimonious features through its Explicit RELR feature selection algorithm that are relatively easy for a skilled statistician or scientist to visualize and interpret. In the previous blog article, example RELR models based upon the Bank Marketing data available through the UC-Irvine Machine Learning Repository were reviewed in terms of how RELR can generate highly reliable predictions that replicate across swapped validation and training samples. In the swapped sampling paradigm, two independent models are developed – one based upon the original training sample and the other based upon the original validation sample which has been swapped to be the training sample for the second model. These swapped samples are roughly equally sized. Once these models are developed, the predictions and selected features are compared to determine how well these models replicate one another. Because these RELR models were so reliable and replicate well with correlations of .99 in the case of Explicit RELR, the interpretation of these models is possible. This interpretation is much more than just interpreting predictions, but instead it is possible to interpret the selected parsimonious predictive features in Explicit RELR in terms of causal hypotheses.
The swap sample Explicit RELR models that were introduced in the previous blog and that will be interpreted here were based upon marketing campaigns conducted by a Portuguese bank that took place between May 2008 to November 2010 to try to sell a term deposit service. Term deposits are short term deposits that may have terms as short as one week. The interest rates offered in European term deposits at continental banks are closely related to the euribor 3 mo. rate which is a daily reference rate based upon an average of the interest rate that banks lend money to other banks. This article will show that the Explicit RELR features that predict the sale of this term deposit service are stable and so can be interpreted in terms of causal hypotheses. The fact that this marketing campaign took place between 2008-2010 is very crucial to interpreting the Explicit RELR features as meaningful hypothesized causal features that can be explained by the behavior of investors in the financial markets during this famous financial crash. The euribor 3 mo. rate showed enormous volatility during this period between 2008-2010, peaking in later 2008 and early 2009 at around 5 and then falling rapidly eventually to below 1 in later 2009 and throughout most of 2010.
As reviewed in my book Calculus of Thought, RELR models may return very more simple features that do not select more than one or two interaction or nonlinear effects. For example, the Explicit RELR 2004 US Presidential Election model was one such very simple model, which had 9 selected effects, but almost all were simple linear main effects. Those simpler linear main effects can be interpreted without any visualization. But sometimes RELR does return models mostly composed of complex interaction and nonlinear effects. The Explicit RELR models reviewed in this present article are more complex in this sense of having interactions and nonlinear effects as the selected features. Yet, in spite of this complexity, it will be apparent that these models are relatively easy to interpret when their feature selection effects are properly visualized using methods known to all skilled statisticians/data scientists.
METHODS AND RESULTS
Selected Features and Replication
Table 1a shows the selected features, their estimated regression coefficients and their estimate standard error for the Initial Explicit RELR model built from the initial training sample (N=20594) described in the previous blog post. Table1b shows the selected features and the same estimates for the Swapped Explicit RELR model (was built from the validation sample (N=20594) for the Initial model). Table 1c shows this same information for a new predictive model that was built based upon the entire original sample (N=41188) that combined the initial training and its swapped validation sample. In all cases, these models were very parsimonious, only selecting 6 features from roughly 10,000 candidate features. More details on our methods are reviewed in the previous blog article. A corrected intercept is shown in these table below and is necessary because the model training occurs with balanced data with an equal number of target and non-target responses, but intercept correction is then applied as is now standard in logistic regression modeling. Many of these selected features are actually compound features, including in many cases interaction and nonlinear effects which are built from simpler elementary features.
Across the 41188 observations, the target outcome predicted probabilities in this Final full predictive model training sample showed approximately a correlation of .97 with the average of the predicted target probabilities across all training and validation samples from the initial and swapped models. So, this was less than the .99 correlation observed across the two swapped sample models shown as Table 1a and Table1b and as reported in the last blog post. This suggests that some new learning is taking place with more training observations as the training sample was doubled in the full sample Final model.
At first glance these selected Explicit RELR features in Table 1 may appear to be complex features that are hard to interpret because some involve interactions and nonlinear effects. In fact, they will be relatively simple to interpret when visualized as shown below. But prior to interpreting them, it is a very good idea to have some confidence that they are not obviously spurious. A good test that would rule out obvious spurious features is to see whether those features were replicated across the swapped modeling samples shown as Tables 1a and 1b.
Table 2 compares the selected features from each model with the best matching feature from other models to determine how well the feature selection was replicated. This comparison evaluated the selected features to determine how many of the same features or closely correlated substitutes were selected and how well their associated regression coefficients matched in these cases to get an assessment of the degree of replication. Regression coefficients are shown in parentheses for each feature. The Correlation column shown in Table 2 reports a Pearson correlation that compares how closely the standardized feature values across all observations for the middle column and right columns match. Negative correlations are possible, but a match still may happen when the signs of the regression coefficient are opposite. So the magnitude of the correlation is the indicator of a good match here rather than the sign of the correlation.
As seen in the top pane, all of the selected features in the model built from the initial training sample had a correlation magnitude of .89 or higher with the model built from the swapped sample that had been the original validation sample. In addition, all of the regression coefficients have consistent signs in the sense that when there is a negative Pearson correlation shown in the far left column there is also a sign mismatch in the regression coefficients in the matching features, whereas when there is a positive Pearson correlation those signs match. All the magnitudes of these matching features’ regression coefficient may be at least close enough to be in the ball park although not always within 2 standard errors of the estimated regression coefficients. Notice that two of the substitutes are the very same feature – ordinalpdaysBypoutcome_failureBypoutcome_success. This is possible because RELR can select very highly correlated features, as its error modeling handles the classic problems with multicollinearity in predictive modeling. In this case, these features are negatively correlated, but unlike the classic multicollinearity problems seen in standard logistic regression they do not have regression coefficients that are inflated in magnitude and their standard errors are reasonably tight. The middle pane shows this same comparison when the model built from the swapped sample (original validation sample) has its features used as the standard and they are compared to those selected in the initial training sample. The degree of replication is potentially order dependent because substitutes must be found for features that are not exactly the same, so in this second comparison shown in the middle pane the degree of replication is slightly worse than that in the first pane as one of the Pearson correlations is -0.77, which is substantially lower in magnitude than any of those reported in the top panel. A lower magnitude correlation likely reflects a spurious matching here, so any interpretation of either of these selected features that correlate this poorly would need to be made very cautiously if they are selected in a final full model.
The process of comparing selected features was then repeated to determine how well the same features or closely correlated substitutes selected in the two swapped sample models were replicated in this final full sample model in the bottom pane of Table 2. This involved selecting the best matching features from the Initial/Swapped models with the Final full model features terms of those that had the best match. A best match was defined as that feature from the set selected in the two swapped sample models which had the highest correlation with a consistent sign in the sense defined above. Consistency of sign was not enforced in the first two matchings described in the top two panes of Table 2 as it did not have to be enforced to get matches there, as simply taking the highest magnitude correlation worked in those cases. Yet, in this last matching shown in the bottom pane of Table 2, it was necessary in the case where a match that only had a correlation of -.61 was selected. Notice that the substitute in this match in the far right column is ordinalpdays which is one of the features which was flagged as potentially spurious in the matching involving the correlation of -.77 described in the previous paragraph. But this feature was not selected in the Final Full model. The new learning that took place might have involved getting a more reliable feature here in what was selected, but since there is no good replication (r=-.61) to base this selection on, this particular feature should be interpreted cautiously especially if it does not have good face validity in terms of business meaning. Notice that all other features selected in the Final Full model were at least reasonable or even very good replicates of features that had been selected in one or the other swapped sample models. All of these replicates had Pearson correlations that were .96 or higher, although the second feature eribor3mBypoutcome_nonexistent^3 did miss the mark somewhat on having its regression coefficient replicated by its substitute (.28 05 vs. .4945). So this second effect also should be interpreted more cautiously in the visualization results.
Visualizing and Interpreting Final Full Model Selected Features
This target outcome in these models was whether a prospect has subscribed to the term deposit service, which is simply called ‘sale’ in these visualizations.
The features that are selected in the Final full sample model to predict prob(sale) were visualized to determine whether they make any sense. This visualization used the Python Matplotlib module. The Python code used for this visualization is open source code and is included with the SkyRELR cloud product in an IPython Notebook called MatplotlibExample. It is not shown here because it is customized to query and merge the output from SkyRELR into a form suited for showing these results with Matplotlib, so it would be unlikely to have general appear beyond SkyRELR users. But a very important point is that although RELR does necessarily have to process data in complex ways such as producing standardized features and handling imputation and interaction and nonlinear effects, the ultimate interpretation should be based upon the raw input variables and their relationship to RELR’s estimated target outcome. So we are interested in the probability of sale or prob(sale) as function of the effects that are selected in the Final model. In particular, we break the selected features into more elementary constituent simple features and plot the raw data for these elementary features in relationship to this prob(sale) prediction. Because this visualization is based upon raw input data values for the predictor features, it becomes relatively simply to interpret. Once again, Python scripts are sent with the SkyRELR product that run in IPython Notebook that merge SkyRELR’s output predictions with the raw input data to produce these Matplotlib charts.
Note that prob(sale) for each given effect is generated by holding all other effects to zero except the selected effect along with the intercept. Given that the intercept alone while holding all effects to zero would produce predicted target probabilities of roughly .112 (.112=exp(-2.0639)/(1+exp(-2.0639)), this visualization of prob(sale) given only the elementary features in a potentially more complex selected feature or effect allows us to have a thought experiment and ask how prob(sale) varies in a controlled way for each effect and across the features within that complex effect while holding all other effects zero.
Figure 1 shows the prob(sale) as a function of the emp.var.rate feature shown in Table 1c. This effect is obviously very straightforward to understand because it is a simple linear effect in the logit. In this UC-Irvine data set, emp.var.rate is defined as “employment variation rate – quarterly indicator (numeric)”. This is an indicator of the employment in the bank’s service area. As can be seen with worse employment as measured by this indicator, the model predicts better sales of this term deposit service. Note that in all of these figures, filled dark green discs imply higher values of the target variable, filled light green discs imply middle range values, and open non-filled discs imply lower values of the target probability that is visualized.
Figure 2 shows the effect of (euribor3mBYpoutcome_nonexistent)^3. This is a compound effect that is a nonlinear interaction between euribor3m and poutcome_nonexistent. The euribor is the interest rate measure and 3m refers to this 3 month value; it has been previously explained. The poutcome measure here is a categorical variable that reflects the previous marketing campaign outcome pertaining to each given prospect. Success means that a sale was made previously, Failure means that no sale was made, and Nonexistent means that no previous record is available. Note that Nonexistent was not interpreted to mean missing values here, but instead was interpreted to be its own category in the RELR model’s data preparation described in detail in the last blog. Nonexistent here appears to refer to people that have never been marketed to in a previous marketing campaign.
When SkyRELR encounters a nominal/categorical raw variable like poutcome, it first breaks each of the constituent categories into binary features using dummy binary coding. Then it standardizes these features to have means of 0 and standard deviations of 1 just like with all of its other candidate predictors. So poutcome_nonexistent is the originally dummy coded feature where 1 had reflected a Nonexistent category and 0 had reflected not a Nonexistent category of prospect. Even though this particular effect is specific to the Nonexistent category, it is still helpful to display all of the nominal categories from the poutcome variable as is done in Figure 2. Note that this is an interaction between the euribor3m and poutcome_nonexistent features first and this interaction is raised to be a cubic power effect. The precise details of how these interaction and interaction effects are computed are beyond the scope here, but they are reviewed in Calculus of Thought.
With that meaning, Figure 2 indicates that when either a previous marketing effort had been successful or had failed and when interest rates are low there was better subscription to this short term deposit product, as opposed to when interest rates were higher or if there had been no previous marketing in the case of the Nonexistent category. This indicates that the purchase of these deposits by the Success and Failure subgroups might be closely connected to the low euribor rates, whereas this was not the case for the Nonexistent subgroup. Why would some people be more prone to put money in a short term bank deposit when interest rates are low? This question will not be answered until later when the Figure 6 results are presented.
Although this effect was well replicated in terms of being highly correlated to another substitute feature as shown in Table 2, its regression coefficient was not well replicated. This effect is the most complex effect of all as is apparent from the visualization. More complex effects will not be well estimated until enough training data are present, so it is possible that this effect will continue to change somewhat in terms of its regression coefficient with more training data. Because data error might be causing this effect to be biased high or low, caution should be exercised in any interpretation that puts too much stock in the reliability of the regression coefficient in this Final Full sample model. Although this actual regression coefficient might change somewhat, the actual selected effect would not appear to be spurious because it is a replicate of another effect as shown in the lower pane of Table 2.
Figure 3 shows the effect of (euribor3mBynr.employed)^3. In the Bank Marketing UC-Irvine data, the nr.employed variable is defined as “number of employees – quarterly indicator”. So this appears to be another indicator of employment in the bank’s service area. As shown in that figure, when eubibor 3 mo. rates are low and when the number of employees is also low, the probability of closing a sale to this short term deposit service is substantially higher than otherwise. This effect concurs with the first two effects. Recall that the first effect predicted that as employment drops, the sales will be better. And the second effect predicted that as euribor 3 mo. interest rates drop, sales also will be more likely to close. This effect appears to combine the first two effects and say that sales also will be likely when both employment and euribor interest rates are low. Why would people subscribe to these term deposits when the economy looks to be in dire conditions as measured by low employment and low interest rates? That question will not be answered until the Figure 6 results are presented.
Figure 4 displays the effect of ordinalpdaysBYpoutcome_nonexistentBYpoutcome_success. The pdays variable is defined in the UC-Irvine data set as “number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)”. In the data preparation, this was interpreted as an ordinal or ranked variable where the 999 means infinity or never since the last the contact. So, this ordinalpdays variable was created in the data preparation by converting the pdays input variable into a ranked variable. Any input variable that begins with having a name ordinal in the csv input file to SkyRELR is automatically converted into a ranked variable using the average ranking method, as this method ensures that the t value related to this effect in SkyRELR will be based upon a meaningful and valid Pearson Product Moment Correlation, which is what the RELR algorithm requires for its error modeling. This is because the Pearson Product Moment Correlation involving ranked variables using the average ranking method is equivalent to the Spearman Ranked-Order correlation which is a valid approach to ranked variable correlations. In practice, Pearson correlations involving non-ranked variables are always very close to those involving ranked variables unless strong outliers are present.
Figure 4 shows that this is a very focal and simple effect, which is that when poutcome_success is present (a prospect who had previously subscribed to this service), the probability of a sale will be much higher provided that there has been just a few days since last contact. Table 2 above had shown that poutcome_success is a perfect substitute for this effect. So, when people who had recently subscribed to this service are recontacted within a few days, they are highly likely to buy again. It is possible that people are guarded to invest a lot of money at first since these are telemarketing sales, but after they see that the first transaction was secure they may invest more money.
Note that there is a parsimony precedence hierarchy to how SkyRELR selects features and normally simple main effect linear features like poutcome_success are selected instead of the more complex interactions when each can be well substituted for the other. But this is an anomaly because this feature had many missing values, which caused it to have more error in relation to the target outcome and lower precedence in this SkyRELR hierarchy. This may happen, but it also may not be extremely likely unless variables are present that have enough missing values. If desired, poutcome_success could replace this more complex interaction effect in a manually customized model, as SkyRELR allows users always to include or exclude specific selected Explicit RELR features. This was not done here. It might often be easier just to go with the final full model SkyRELR selected features even when they are not as parsimonious as substitutes because the more complex features are still easy to interpret as in this example.
Figure 5 shows the effect of contact_cellularBYcontact_telephoneBYordinalpdays. This indicates that when prospects were contacted by telephone, they were much likely to purchase than when contacted by cell phones, but only when they had recently purchased a subscription in terms of few days since last contact as measured by the variable ordinalpdays. This effect also makes very good sense. It is reasonable that people would not like to be called by salespeople on their cell phones. It is also reasonable that people who are called by telephone are most likely to buy within a short time of their last contact; this also makes sense in light of the effect just described in Figure 4 that relates to ordinalpdays similarly.
Figure 6 shows the effect of (poutcome_failureBYpoutcome_nonexistentBYpoutcome_success)^2, which is the last effect reported in Table 1c. This interaction was formed across all three nominal categories in the poutcome variable; it was then raised to the quadratic power This interaction is possible because RELR uses all the binary coded dummy features from a nominal variable and also forms interactions between them as candidate features. Unlike other predictive modeling algorithms, RELR does not have to drop one of them to avoid multicollinearity problems although when there are just two it will drop one of them because they are then perfectly negatively correlated.
At first glance, this effect would appear to be a contradiction of the other effects related to poutcome in the Figure 2 and Figure 4 above. This is because this effect results in the Nonexistent subgroup having greater probability of purchasing than the Failure and Success subgroups. Yet, this effect is a reasonable replicate of an effect seen in the Initial model as shown in the lowest pane of Table 2, so it would not appear to be obviously spurious although it may have a regression coefficient that is not yet completely stable and which still needs more training data for such stability. Thus, it would seem that it is not a function of obviously spurious random associations, though like all of these effects reported in this model it still could be shown to be spurious in controlled experiments.
In any case, an important piece of information to help resolve this seeming contradiction is to notice that all of the Nonexistent observations had ordinalpdays of 999, whereas all of the Success observations shown in Figure 4 had ordinalpdays values that were very small. In other words, there is a confound between ordinalpdays and poutcome_success, so any dramatically better response related to poutcome_success instead could be related to the small ordinalpdays values. This is because as noted previously, it appears that they always followed up a successful sale with another attempt at a sale within a few days in the poutcome_success subgroup.
Other pieces of information to help resolve this seeming contradiction are that Figure 4 shows that there was not really much difference between the Failure and Nonexistent groups in terms of the Figure 4 effect. And Figure 2 above shows that the Nonexistent group actually had a comparable or even slightly better prob(sale) estimate when euribor 3m interest rates were high than the Failure and Success groups. So when all the other effects are taken into account and controlled, this Figure 6 result below suggests that the Nonexistent subgroup has a higher prob(sale) by itself than the Failure and Success groups. That is, the prediction here is that any superiority of the poutcome success condition in terms of prob(sale) is a function of the re-contact within a few days interaction effect as shown in Figure 4 and the low euribor 3m rate interaction effect shown in Figure 2 and not anything that is independent of those effects. This prediction is that unless the people who are in the poutcome success group are re-contacted within a relatively short period of being initially sold and unless interest rates are very low, they have a lower chance of being sold this financial service than those in the poutcome_nonexistent group. Likewise, unless the poutcome_failure group prospects are contacted when euribor interest rates are very low, they also have a lower chance of being sold this financial service product than the poutcome_nonexistent group prospects.
Perhaps these poutcome_success and poutcome_failure groups are people who were involved in the “flight to quality” that was observed in 2009 as many fearful investors left the stock markets in droves and took safe short term cash positions in term deposits. Even though these term deposits paid very poor interest rates, they might have appeared much safer than keeping money in stocks at that time. That interpretation makes sense because the poutcome_success and poutcome_failure groups were targeted in previous marketing campaigns, so they must have been known to be purchasers of these financial instruments and this may be because they made similar movements of money in the past like during the stock market crash in the early 2000s. Possibly when stock prices are higher and they are not so fearful, they keep their money invested in the stock market and are not interested in term deposits which pay much lower returns than their expected stock market returns. That is one interpretation; others may be possible. But this interpretation does resolve any apparent conflicts with other results. It also resolves the question concerning why people are more likely to put money in term deposits when interest rates are lower instead of higher and when the economy is doing poorly as described earlier.
This is a follow-up article to the piece from earlier this month that showed that the RELR models were well replicated in terms of their predictions across the swap sampling conditions. Replication is absolutely necessary to go the next step and start interperting RELR models. One next step is simply to interpret the predictions knowing that the predicted probabilities are highly reliable and not an artifactual result of data and human bias/error. Because the RELR procedures of Implicit and Explicit RELR generate reliable predictions that correlate well in terms of models developed from independent swapped training/validation samples and given what is not a large training data sample, it is perfectly reasonable just to interpret the predictions and not to try to interpret a potentially valid causal reason behind these predictions.
However, many business executives want predictions that make sense from a causal perspective. The Implicit RELR procedure does not suffer from the black box disadvantage of ANNs in that it does return transparent models that can be understood as regression models, but Implicit RELR selects too many features for any reasonable interpretation in terms of putative causal hypotheses. On the other hand, Explicit RELR usually generates extremely parsimonious and accurate models given a minimal sample size that can be interpreted in terms of causal hypotheses if they replicate across swapped samples. And the Explicit RELR models showed very good replication with this UC-Irvine Bank Marketing data as reported in the previous companion article. Hence, this Explicit RELR model of financial market behavior during 2008-2010 at the individual account holder level can be interpreted both in terms of their predictions and the putative causal reasons behind those predictions.
The present analyses demonstrated that at least 5 of the 6 selected effects from the companion article swapped sample models are selected across swap sample models as either the identical feature or as a closely correlated substitute. Hence, we expect that there would be even better stability with the Final Full model and this is the model that we interpreted. The selected features in that Final model were mostly composed of features that replicated well with those selected in the swapped sample models with only one exception. This is further evidence that these selected features are stable enough to warrant interpretation in terms of causal hypotheses. And the one feature that was not a good replicate of what had been observed in the smaller swap sample models is shown in Figure 5, which shows that those followed up by telephone within a few days of a previous contact as in the ordinalpdays measure were highly likely to purchase. This effect needs to be viewed more cautiously as a possible spurious effect because it does not have a good previous replicate. But at least it does make sense at least in terms of the Figure 4 effect which also suggested that those re-contacted within a few days as measured by ordinalpdays were highly likely to purchase if they were in the Success subgroup.
The effects that were reported and interpreted through Figures 1-6 could be putative causal effects by themselves, or one or more could be manifestations of a latent effect that is driving this behavior. In fact, all effects with the exception of the effect reported in Figure 6 could be interpreted without any other information and thus can be interpreted in and by themselves in terms of causal hypotheses. Based upon the Figure 1 effect, we could hypothesize that low employment figures are a causal factor in the decision to buy this term deposit product, but this would not seem to make sense as a causal factor. Based upon the Figure 2 effect, we could hypothesize that the low euribor 3 mo. interest is only a causal factor in those prospects who were previously targeted in marketing campaigns. But why would anyone want to subscribe to a bank deposit only when interest rates are low? Based upon the Figure 3 effect, we could hypothesize that the combination of low euribor 3 mo. rates and low employment numbers are a causal factor also in the decision to purchase these term deposits. But why? Based upon Figure 4, we could hypothesize that re-contacting those who bought the term deposits within several days is a causal factor in the new purchase of term deposit services by these same customers. That is a possible good causal reason, but it does not help to answer the questions about the causal interpretation in the first 3 figures. Based upon Figure 5, we could hypothesize that calling prospects through their telephone rather than their cell phone and very quickly after they recently purchased a term deposit is an important causal factor in their repurchasing. Like the Figure 4 effect, this does seem to be a reasonable causal effect, but it also does not answer the questions about the first three effects shown in the first three figures.
The Figure 6 effect is a much flatter and less focal effect than the other effects in terms of its optimal focal targeting effect, but it occurs across a much larger number of observations. So it is an important effect potentially. It suggests that it without any other information and only knowing the average effects of the other effects, it would be better to target the group that had the least average response overall in a univariate cross-tabulation in this case which is the Nonexistent group people that were never previously targeted for marketing. This is because the average univariate cross-tabulation response rates of these groups were Failure=.142, Success=.651, and Nonexistent=.088.
How can this be? Perhaps other explanations are possible, but the only explanation that makes any immediate sense here is that the Failure and Success groups were buying this term deposit service as a part of the “flight to quality” during the stock market crash of 2009. This also explains the effects shown in Figures 1-3. Low employment and low interest rates per se would not appear to be the causal reason that these people were putting their money in these deposits. Instead, it appears that it might have been their fear of what was going on in the stock markets that drove them to these deposits. A short term cash deposit is one of the safest places for money during a stock market crash. So, most of these results from this Explicit RELR model may be explained by the latent causal factor of Fear, as the selected effects in Figure 1-3 and Figure 6 seem to be direct or indirect manifestations of that latent causal factor. This is just a working hypothesis which could be falsified the next time that the variables come into play that were at play between 2008-2010 in the form of a stock market crash. So we will have to wait until then to test this hypothesis.
The deep interaction and nonlinear effects here were absolutely crucial to the overall interpretation. Without them and only with shallow, linear main effects, it could have been possible to conclude erroneously that the Success and Failure groups always should be given precedence in the targeted marketing. In fact, nothing would appear to be further from the truth. The only time that these groups were targeted with much success was during the most dire economic period when interest rates completely tanked later in 2009. But they were targeted with enormous success during that specific period only. This highlights the danger with most explanatory models used in business that are not based upon deep interaction and nonlinear effects, and instead are based upon low dimension candidates. So the Explicit RELR method reviewed here may not only offer a remedy to the black box nature of ANNs, but also may offer a remedy to the risky causal interpretation of models based upon Stepwise Regression, LASSO, Decision Trees that do not replicate well and do not estimate deep interaction and nonlinear effects.