This is a basic tutorial written for the quantitative scientists and other analysts who wish to use SkyRELR.
If you are new or need a refresher to SkyRELR, this tutorial should be helpful. SkyRELR is based upon a form of logistic regression called Reduced Error Logistic Regression (RELR). SkyRELR allows predictions that include the probability of binary categorical outcomes, along with point estimates as in classical linear regression. These point estimates are possible throusgh the Ordinal RELR implementation which weights each ordinal category by its probability and multiplies this by the mean value of any categorized interval to form a point estimate for a target outcome that may be used as a prediction. Multinomial/multi-category target predictions are also possible by making a separate binary logistic regression model for each category. SkyRELR provides both Softmax modeling for higher dimension multi-category targets as is used in reinforcement and facial recognition learning, along with classic multinomial logistic regression where one category can be viewed as an appropriate reference because it is a control group or control condition. However, unlike standard logistic regression and linear regression, RELR allows automatic and very deep learning of predictive features that include higher order interaction and nonlinear effects. Unlike the deep learning in black box neural networks, RELR’s deep learning returns transparent and interpretable predictive features which can be used to generate reliable causal hypotheses. This is possible through the parsimonious Explicit RELR feature selection. RELR also allows less parsimonious, purely predictive models that cannot be interpreted through its Implicit RELR feature selection. Explicit and Implicit RELR have largely non-overlapping applications, which together provide good coverage for a wide range of real world applications.
Unlike many of today’s automated machine learning products, SkyRELR is not designed for the masses of business users in the democratization of analytics. Instead, it is designed for the skilled scientist, statistician, or data scientist who wishes to work alongside business executives and users and ensure that they do not fall into one of the many traps that is possible with predictive modeling. The skilled analytics person who is likely to find SkyRELR useful is someone who wants to use artificial intelligence in model building and also build automated artificial intelligence applications for reliable prediction and putative causal explanations for their audience of business users. In fact, SkyRELR has been developed by me as a scientist for fellow scientists, statisticians and artificial intelligence experts, as it is based upon methods to generate reliable predictions and insights that I have been using for a very long time as described in my book Calculus of Thought. A number of these methods like the generation of deep features up to three way interactions, the usage of standardized features to avoid scaling and marginality issues, and the fundamental use of replication as a check on the quality of a prediction and feature selection were critical to our discovery that Alzheimer’s has a 10 year pre-dementia period caused by degeneration in the brain’s temporal lobe regions that process explicit learning. In the hands of the right scientists and statisticians who have the ability to think creatively about what is causing the reliable effects that they observe with SkyRELR, many new and important causal discoveries could be possible. We expect that many of these discoveries will be made for businesses, so they may not be published. But they should be important nonetheless for the life of a business. SkyRELR is like a new telescope in the sense of allowing reliable, replicable predictive/explanatory features to be discovered from enormously high dimension data, so this is why new discoveries are likely.
RELR is best understood as a very simple fix to standard logistic regression to avoid the multicollinearity and error problems inherent to standard logistic regression. A key assumption in RELR that is also found in standard logistic regression is that all observations are independent. In general, when one builds models based upon independent observations that are not from sequential repeated measures of the same person or entities over time, it is reasonable to assume that they are independent in the sense that each observation’s outcome does not depend upon any other observation. (Note that RELR also allows incremental learning over time in the case of sequential data, but this is not yet implemented in SkyRELR.) Other assumptions in RELR are fairly benign also, and are listed on pp. 45-46 of my book Calculus of Thought. The most important assumption that departs from standard logistic regression concerns the form of the error that is estimated for each of the predictive features. A full discussion of the argument for why the error modeling in RELR works is beyond the scope here and I refer readers to pp. 45-47 in Calculus of Thought. However, suffice it to say here that RELR’s error form for each predictive feature is inversely proportional to the Student’s t value measured effect on the target variable. And that error estimate is not arbitrary because it fits all known relationships in logistic regression, although the Student’s t is only a good approximation to the standardized logistic distribution.
Standard logistic regression does not estimate any error from any predictive features and instead assumes that this error is zero. For this reason, standard logistic regression fails when the predictors have substantial error as when they are from small samples and high dimension data, and when the predictors are multicollinear. That is, unless one has fairly uncorrelated predictors or correlated predictors that are low dimension in relation to the sample size, the estimates in standard logistic regression cannot be trusted because standard logistic regression is so corrupted by error. RELR avoids the inherent problems of standard logistic regression with multicollinearity, but RELR forces a new assumption in the process. This assumption is that the probabilities of positive and negative error across all predictors relative to a mean error level are equal with no inherent bias for positive or negative errors to be more likely in odd powered vs. even powered predictive features. This assumption may be reasonable with balanced target variables in RELR. Balanced target variables are always possible by proper stratified random sampling to balance the target category, as intercepts can be adjusted for actual imbalanced target categories after a balanced model is built. This is how RELR forces the balanced target categories with binary targets. SkyRELR does not do such automatic balancing with post-modeling intercept correction for its Ordinal RELR implementation, so users need to ensure that they are dealing with relatively balanced target categories in these cases or else the RELR error model assumption may not be valid. SkyRELR does however automatically convert numeric target variables into reasonably balanced ordinal categories and yields a point estimate using Ordinal RELR, as explained in more detail below. This may be useful for one of the more common application needs which is to build point estimates as predictions given a numeric target variable similar to how linear regression works.
Data Preparation and Research Design
The Implicit and Explicit RELR machine learning algorithms are entirely automated with no real arbitrary modeling choices to be made by users other than the necessary data preparation and research design decisions that have to be made in every machine learning project. These include what input variables to include as as a basis to form candidates for predictor features, what data sample to use, what type of model to build, and how to prepare the data. In addition, there is interpretation of the meaning of results, such as whether predictions may be trusted when there might be greater error such as at the individual level. Another example where interpretation always will come into play is with respect to the putative causal insights that are gotten. Data preparation and research design decisions, along with predictive and explanatory interpretations need to be made by skilled scientific and statistical practitioners. The actual pressing of the GUI buttons or coding of any Python accessory scripts can be done by others, but a skilled analytics professional always should be fairly intimately involved.
All data preparation and research design choices should be made prior to building a model for a real world test. To help determine which input variables to use as potential predictors, preliminary research needs to be done that interviews subject-matter-experts about what they believe are good input variables to include. In this preliminary research, it is a very good idea to get diversity of expert opinion and blind these experts from each other so that they are not biased by each other’s suggestions. In this way, your model will be less likely to be biased by the particular choice of experts who guided you in choosing the candidate predictors. After consulting with these experts and adding whatever variables the members of your own team including the analytic professionals and business users think might also need to be included, you should then include all of these variables if that is possible and let RELR make the final feature selection decision. RELR allows very high dimension candidates in its modeling and efficient processing to capture the most important candidate features. The limiting factor in the dimensionality of the candidates will be the time and speed of processing. Even with the starter Two Week SkyRELR and Four Week SkyRELR instances which only have 2 CPU cores, you should be able to handle models with candidate features in the tens of thousands in a reasonable period of time like perhaps in well less than a day if your training sample size is perhaps less than 100,000 observations. But this will also depend upon how unbalanced your data might be and other considerations. You can also scale up to be substantially faster with the 32-40 CPU cores that are possible in higher end SkyRELR instances that will be only initially offered to users who have already used the starter product.
As in all of predictive modeling, your sample of observations should be constructed to be as representative as possible of the population to which you wish to generalize. RELR does not require an enormous sample size, as it avoids multicollinearity problems in regression coefficients even at small samples. But RELR does require a large enough sample for reliable correlations between predictors and the target variable. Unfortunately, these errors are data dependent and cannot be known in advance. Though, the reliability of RELR’s feature reductions and selections across independent swapped validation and training samples can be evaluated through SkyRELR. In this way, users will at least have a good idea about split sample reliability, along with which effects selected in smaller split sub-sample models also are selected in full sample models.
Student’s t values that reflect the extent to which correlations between predictors and the target outcome differ from zero are used by RELR to reduce the total number of features to yield a manageable dataset for feature selection. This feature reduction based upon t-values is possible because RELR’s regression coefficients are directly proportional to t-values as reviewed in Calculus of Thought, and this follows directly from the algebra of the RELR error model. This relationship is only true though when there are larger numbers of features, but this is usually seen even with as few as 30-50 selected features. A large enough feature reduction subset is needed so that the final model after subsequent Implicit or Explicit feature selection is reliably similar to what would be obtained with the entire set of predictive features. In general, this requires a large enough feature reduction subset to include all selected most important features and get good replication of those features across independently built models. Practical experience to date suggests that using 400 features in a feature reduction subset is adequate, as these are 400 independent features that are not duplicates of any others. Starting SkyRELR’s model building with feature reduction subsets of 400 will almost always ensure that the large majority of those 400 features will have relatively lower t values than those that are selected in the final model. This is true for both Explicit RELR, which returns extremely parsimonious feature selections, and for Implicit RELR, which returns more features. However, you may see occasional Implicit RELR models that select close the 400 initial features in the final model. But since Implicit RELR models are purely predictive, a larger value of initial feature reduction subset will be unlikely to matter in terms of substantially different predictions of individual level target outcome observations. We do allow users to experiment with larger feature reduction set sizes than this 400 value in SkyRELR, but again our experience has been that this value is more than adequate in all data that we have seen to date.
The Form of the Source Input CSV File
SkyRELR is driven by the csv input file and the modeling parameters that are selected in the MainGUI page. The csv files are assumed to be constructed with the Excel dialect. We include three sample input csv files in the SkyRELR cloud product that can be downloaded from the Download Files tab shown in the above figure. All three of them can be downloaded also here by clicking on their links. These are called binary4.csv , binary4ordtargetacorrbin.csv , and directmarketingbanksample.csv. The binary4.csv and binary4ordtargetacorrbin.csv files were originally based upon a public domain file available at the UCLA website, but fake data was then added just to demonstrate how SkyRELR handles all the various types of variables and inputs including missing data. The directmarketingbanksample.csv file was the used to generate the sample models in these two recent blog postings: article on replicating predictions and article on interpreting deep Explicit RELR models.
As in these sample files, SkyRELR assumes that the first row is a list of the variable names. It also assumes that the first field has the name ‘index’ without quotes. It further assumes that all records in ‘index’ have values that are ordered sequentially beginning with 0 where each new record has the next whole number ordered value. In SkyRELR, the second field is always assumed to be the target variable. This variable can be called whatever you wish. However, it cannot have any missing values and all values need to be numbers, as no textual character-based values are accepted in this target variable field other than in the first row, which is the name of the variable. If the values have two levels, such as 0 and 1, SkyRELR will build a binary logistic regression model. If there are more than two levels in this target variable, SkyRELR will build an ordinal logistic regression. SkyRELR also allows models based upon qualitative non-ordered target variables with multiple levels. Depending upon the research design, softmax or multinomial may be most appropriate for such qualitative categories. When users click on the IPython Notebook tab shown above, IPython Notebook is opened and a number of sample notebooks with Python scripts are provided. One of these is called SoftmaxMultinomial.ipynb. It provides samples scripts to build softmax and multinomial logistic regression models for situations where you have a categorical variable as a target variable that is non-ordered. These scripts run the basic RELR API provided in SkyRELR and build separate binary logistic regression models for each target level comparison, and then impose the softmax or multinomial logic at the very end. These same scripts also show how Python Pandas can be used to create binary-coded dummy indicators from textual-based categories, as again users do need to do such pre-processing of their target variable to ensure that its values do not have any textual characters, but SkyRELR will automatically perform such processing on all the predictor fields.
SkyRELR also handles target outcomes that are ordered and have continuous values or many interval level values. That is, if the target field levels are not ordered in a whole number sequence starting from 0 and include more than 4 levels, then SkyRELR has an Automatic Binning of Target MainGUI and RELR API parameter that you may select that will form an ordered target variable by binning these input values into four bins that are as equal in terms of number of observations as is possible. Hence, if the second field’s input values are from a ratio level variable like Income, this MainGUI parameter can be used to bin this variable into a reasonable ordered target variable for RELR’s ordinal logistic regression.
Unlike RELR’s binary logistic regression which has been used in production for over 5 years, RELR’s ordinal logistic regression has not been used in a production environment previously. Hence, this Ordinal RELR should be viewed as experimental and users are cautioned to be especially careful in testing it. Yet, Ordinal RELR does fulfill a critical need to have a regression method capable of target outcome point estimates which is able to avoid multicollinearity problems and work with high dimension data and deep learning. Our experience is that RELR’s ordinal logistic regression will be most effective with relatively balanced target variables and with four categorical levels, as this RELR implementation only models polynomial effects to the 4th power term. Four ordinal levels are needed to model up to 4th power effects in Ordinal RELR’s ‘predictedtargetpoint’ output values that are the interpolated predicted target point estimates. Greater numbers of target categories or imbalanced target categories that result in small samples within one or more categories have greater likelihood of error, as RELR’s error modeling has no effect on the intercepts and more ordinal levels implies more of these error-prone intercepts.
As mentioned above, SkyRELR does provide an Automatic Binning of Target variable that transforms numeric target variables into ordinal target variables. We suggest that you should not depend upon this Automatic Binning of Target feature in Ordinal RELR if you have relatively few levels in your target variable. Instead, you should manually merge and bin levels yourself in terms of what makes the most sense. In addition, you always can have more than four levels if they are ordered sequentially in whole number steps starting with 0, but four equally balanced levels should be optimal although this is not always possible. For example, many data originate from Likert scales which have 5-10 levels, and it may make more sense to keep these original 5 or more ratings scales in the data that are input into Ordinal RELR models. However, users should be warned when that is done that RELR’s symmetrical error assumptions may not be valid, so they should also track the performance of models in well constructed tests prior to implementing.
All other fields of the input csv file are assumed to be input variables used to construct predictors. If a field contains text values like the ‘state’ and ‘zipinquotes’ variables in these example files, these values will be automatically used to form binary dummy coded predictor variables for each category. If a numeric valued field has a name that starts with the keyword ‘ordinal’ with no quotes as in ‘ordinalrank’ in these example files, SkyRELR will assume that it is an ordinal level variable and form an appropriate ranking transformation. If a numeric valued field starts with the keyword ‘category’ as with the ‘categoryfour’ field in the binary4ordtargetabinord.csv file, SkyRELR will assume that this is a categorical input and form binary predictors from all levels of this field’s values. SkyRELR does pretty extensive error checking of the csv input file. It should give you very specific feedback about what is wrong with your csv input file when format errors are discovered.
MainGUI and RELR API Parameters
As briefly hinted above, you can either build models using the point and click MainGUI or using Python code in IPython Notebook that calls the RELR API.
Once the csv input file is built correctly, these SkyRELR’s MainGUI and/or RELR API parameters allow you to define your training sample vs. validation sample size, to define whether the feature selection will be explicit vs. implicit, and to define all other parameters. Each MainGUI parameter includes comments on what it does and how it should be used out in the SkyRELR MainGUI tab. These same comments can be applied to understand the equivalent RELR API parameters if you use Python code in IPython Notebook to call this API. One parameter – Maximum Feature Reduction Set Size – deserves a few comments here. It controls how many features are evaluated by RELR in the logistic regression model build. This is the feature reduction set size that was mentioned above. A setting of 200 corresponds to 200 linear/cubic features and 200 quadratric/quartic features or 400 total features which is the recommended value.
A second MainGUI/RELR API parameter called ‘Minimum t Value Magnitude for Candidate Feature’ also deserves a comment here. It is a filter that determines the minimum t value for predictors to be considered as candidates. This setting can save much in the way of memory consumption and processing, as RELR automatically discards all predictors that have t values below this threshold. In addition, SkyRELR provides a report in the Progress of Run tab on what t values were used as thresholds to select different levels of interactions. In its processing, RELR automatically adjusts this parameter to be as efficient as possible. But after seeing this report a few times, a user also may be able to set this parameter to save a lot more processing from the start in a more memory heavy model that has a large number of input variables and associated interaction effects.
Outputs and Scoring
Once all of this preliminary work is done, the actual model building task is quite simple as SkyRELR automatically handles missing values, conversion to categorical binary dummy variables, missing value status dummy predictors, interactions, nonlinear effects, feature reduction, feature selection and output reporting. Everything that it does can be followed in the Progress of Run tab while a model is being built. Also, once a model completes, output files can be downloaded from Download Files that include a copy of the Progress of Run report. There is also an output csv file which shows the predictive features that RELR selected and the transformed values that it used, as RELR uses transformed standardized predictive features. This output file also shows the predicted values for the target variable. Once a model is built, new data can be scored with this model by using the MainGUI/RELR API Scoring process. This is self-explanatory, but does require the csv file that holds the data to be scored to be structured identically to the original csv source file used to build a model in terms of the names of the fields and similar types of field values with the exception that the target variable field may have all blank values, as these are ignored in a Scoring run. In addition, this csv file needs to be named in the following form ‘scoring[name of original csv file used to build model]. So if you wanted to score data based upon a model that used the binary4.csv file, this input file name to be scored should be ‘scoringbinary4.csv’.
Memory Management, Backup of Output Files and Other Considerations when using a Temporary Instance
The current product that you will use as a starter version of SkyRELR runs on a C3.large Amazon EC2 instance. This is only a dual core CPU processor and only has 3.75 GiB of RAM. This starter SkyRELR product is not designed to be a heavy repeat usage production system, as instead it is designed for one-off or occasional usage exploratory and confirmatory model building and scoring in a temporary and intermittent usage system. At best, it is useful for intermittent and light production work. In this sense, it is designed to be what many people currently use R or traditional statistical package software to do, which is for exploratory and confirmatory research and model building, as many organizations are now moving Python for production implementations especially in the Internet of Things and the Cloud. Users who wish to use a larger instance of SkyRELR with 32-40 CPU cores and 60-240 GiB of RAM in the form of Amazon EC2 instances may purchase such access from us and should contact us to get details.
In this temporary usage environment that is the SkyRELR Two Week and SkyRELR Four Week starter products, if you wish to score models that had been developed in a previous access period to SkyRELR, you will need to rebuild the models. And Models developed earlier in the same access period can be used to automatically score new data, but only as long as you do not write over the saved model files by using input csv model building files with the same name. That is, SkyRELR automatically overwrites any saved files and that overwrite includes the saved files needed for scoring. So it is a good idea to rename each new model source csv input file with a different name. Because you will not be able to score previous models built from previous access periods, we strongly recommend that you save outputs and especially reports from SkyRELR’s models so you easily may see the parameters that were used and so you may reconstruct the same model building process if you wish to score future data with that model in a future SkyRELR access period.
Note that the initial implementation of SkyRELR in the form of the starter instance does make use of the autoscaling and load balancing feature within AWS which can automatically save instances and replace them when and if they crash. Yet, we do not save the data that is stored alongside that instance in what is comparable to a hard drive storage system. Amazon has very good uptime with their instances that approaches 100%. For example, we have never had an instance crash in our experiences in developing SkyRELR. Yet, users need to know that having an instance crash and losing all stored data and ongoing processing is a possibility that could eventually happen. If an instance crash would ever occur, there will be no ability to recover lost processing or data and we will not reimburse users for any lost data or processing. So, users need to understand that protecting against an instance crash is their responsibility, and they should constantly download or otherwise save output files from SkyRELR to other safer locations in terms of long term storage. In addition, users also need to understand that the only access to their instance is from the IP address that they provided to us when they first registered the instance after they bought it. That is, this starter version of SkyRELR is designed as a single user machine similar to a personal computer that only the user from that IP address may access. But please do not think of any data storage in the SkyRELR instance in the outputs of SkyRELR models or scoring as long term storage in the sense of a computer’s disk storage. Instead think of it as temporary storage. So files always need to be put in a safer place like through the Download Files tab that is provided in SkyRELR to store files on a user’s local machine. Or, files may be processed through scripts that users may write in IPython Notebook to store SkyRELR’s output files in their AWS S3 location or in another cloud storage area. Note that users may reboot their instance and this does not affect their file storage system.
This starter C3.large implementation does have limited RAM storage. So you will want to track your RAM usage as your models run. We do offer a view of RAM usage in the Progress of Run reports, which should make it easy to track. You can also use a basic Linux command that we show in the IPython Notebook called LinuxCommands.ipynb to check on all running processes. This will show RAM usage also. Python’s memory management system does not immediately clear RAM usage from no longer needed resources, so you may notice that the peak RAM usage does build up for some time before it is curtailed. However, if you ever click on Kill Run shown in the above figure, this will immediately clear all memory usage from all previous runs since the last reboot or Kill Run as it does shut down the Linux process completely. Because of this, never click on Kill Run during an actual run unless you really do wish to kill that run. Instead, you can click on this after a modeling or scoring run is over if you just wish to reset the RAM usage back to the reboot stage. If you ever need to reboot, we also offer that capability as noted above and as shown by the tab by that name in the above SkyRELR figure. This might be helpful when your Python scripts that you run in IPython Notebook need to be halted and IPython Notebook is unresponsive to the Restart Kernel command.
The End Result: RELR’s Deep Explicit and Implicit Feature Learning
Deep learning implies that you are going beyond the obvious linear effects that we can all observe without any data analytics or machine learning. Examples of linear effects are that as the weather temperature rises, people drink more beer or spend more time at the beach. Deep learning implies that predictive features are being learned that are hidden and not obvious. In classical statistical analysis, these hidden effects are called interactions. Interactions tell you how an outcome would change differentially across different features. The difficulty with modeling interactions was shown by the late John Nelder, a well known statistician in the last century, who pointed out that interaction effects in regression models may give completely arbitrary predictions that change as a function of a relative scale of predictors like Currency or Temperature. So Currency scaled in Euros and Temperature scaled in Celsius can give substantially different predictions compared to Dollars and Fahrenheit degrees as units of measurement in regression models. RELR avoids these arbitrary relative scale problems by scaling all predictive features in standardized units using standardized variables. All RELR features are standardized to have a mean of 0 and a standard deviation of 1. Because of this, the prediction that RELR makes does not depend upon the original relative scale of the variable and will be the same without regard to whether Dollars vs. Euros were used as original units.
To get an understanding of RELR’s interaction effects, let’s assume that older adults visit an international chain of convenience stores more often from April through September, whereas younger adults visit more often from October through March. This effect would be recorded as an AgeByMonth two-way interaction effect with RELR, but would not be observable otherwise as more simple Age or Month main effects. This is because the young and old groups would cancel, so there is no difference related to age on average in terms of shopping behavior. Likewise, if we just looked at Month, any effects across different months would also cancel. This shopping behavior target outcome pattern described in this example is only apparent when you look at the combination of Age and Month as an interaction.
RELR models up to three way interactions and up to 4th power polynomial terms. RELR forms all interactions as products between simpler standardized variables to avoid the marginality scaling problems that John Nelder pointed out. The details are beyond the scope here, but are provided in Calculus of Thought. Unlike deep learning neural networks, RELR’s features are completely transparent. In addition, RELR automatically selects the optimal final architecture which may be very simple with few if any deeper interaction and/or nonlinear features. RELR’s final feature set also may have very few total features with RELR’s Explicit Feature Selection learning, which is useful to generate interpretable, parsimonious models that can be used to form causal hypotheses. The only way to interpret more complex features like nonlinear interactions is to visualize them. SkyRELR provides a sample Python script in its MatplotlibExample.ipynb IPython Notebook that shows how to process SkyRELR’s output data and visualize these features. This article on interpreting deep Explicit RELR models gives examples of how this visualization of Explicit RELR models aids enormously in interpreting the models.
RELR’s Implicit Feature Selection learning is useful to form very stable ensemble-like models which may have large numbers of more complex selected features where a simple causal hypothesis cannot be formed. These Implicit RELR models are only “ensemble-like” because they are so stable, but unlike real ensemble models they are not developed by averaging many different elementary models with many different arbitrary or possibly biased choices in a very time consuming and manual process. Instead, Implicit RELR models are automatically developed and return regression coefficients that would be expected in very large samples in an ensemble average of many elementary models when the variability across those models is considered. This is because the regression coefficients in Implicit RELR are proportional to Student’s t-values that reflect the reliability of the relationship between the predictive feature and the target variable. As mentioned above, RELR’s Implicit Feature Selection is most useful with very small training samples and when a model is only required to generate a prediction. With larger training samples, RELR’s Explicit Feature Selection is likely to be more accurate and especially when it discovers parsimonious causal effects.
If you would like more in-depth information about RELR modeling, we recommend that you read Calculus of Thought (Elsevier: Academic Press, 2014). This book is written by me, Daniel M. Rice, the inventor of this machine learning methodology and the artchitect of SkyRELR. This book gives a general theoretical background to the RELR machine learning methods, which would be a good supplement to the hands-on experience in using SkyRELR.