A New Google Paper Shows Deeper Learning Gives Less Accurate Speech Recognition

Deep Learning has been all the rage over the past few years, as there are probably stories in major media almost every day now about how the entire future of artificial intelligence will be dependent upon much more complex and deeper learning.   A case in point is this recent Wired article, which stated that “a technology called deep learning has proven so adept at identifying images, recognizing spoken words, and translating from one language to another, the titans of Silicon Valley are eager to push the state of the art even further”.  This Wired article describes how Facebook has designed a new open system that will allow for faster deep learning.

Deep learning here means the learning of patterns that reflect interactions and nonlinear effects that are not seen by more shallow algorithms.  A classic example is the XOR function, which Minsky and Papert (Perceptrons: An Introduction to Computational Geometry, The MIT Press, 1969) showed cannot be learned by Perceptrons.  Perceptrons are the equivalent of standard logistic regression that does not include interactions and nonlinear effects.  Today it is very easy to model higher order interactions and nonlinear effects in logistic regression and overcome the curse of dimensionality provided that one has a way to avoid overfitting and multicollinearity problems.  This is how Reduced Error Logistic Regression (RELR) is able to engage in deep learning and easily learn XOR functions very generally.

Traditional algorithms like Stepwise Logistic Regression, Decision Trees, and Random Forests also may be sensitive to higher order interactions, but they still may miss deeper patterns as observed in XOR function learning.  This is because only deep learning effects that are built from strong shallow effects are considered as good candidates for deep learning in these algorithms which would miss XOR patterns where the shallow effects are not strong effects.

The Multilayer Perceptrons known as Artificial Neural Networks (ANNs) that were introduced in the 1980s are sensitive to deeper effects.  Like RELR, they do not choose which deeper effects to learn based upon how strong the shallow effects might be.  So like RELR, ANNs can learn XOR functions in general and overcome the problem posed by Minsky and Papert.    However, unlike RELR,  ANNs require enormously large training data.   Also unlike RELR, ANNs require a modeler to manually experiment with whether deeper levels are more effective.   In addition, ANNs do not handle overfitting and multicollinearity error very well, so ANNs also may require ensemble averaging even when built with very large sample sizes or other experimental trials like with randomly dropping features, whereas RELR does not require any ensemble modeling or experimental trials involving random dropout of features.  For all of these reasons, ANNs require enormous processing power, especially when deeper learning patterns are tested, whereas RELR requires substantially less processing.

Our current introductory Two Week and Four Week SkyRELR  products use only 1/20th of the potential parallel processing available on the AWS cloud.  Our customers can easily scale to the full processing power if they purchase a longer term license where this full processing power will be 20 times greater for RELR’s deep learning.  This will be desirable for ultra high dimension problems like full human genome modeling, but is not necessary for many standard business problems which tend to be lower dimensional if users can tolerate longer processing times for models.  Note that the new hardware at Facebook referenced above only will double the processing power of ANNs.  Yet, a substantial amount of time in ANNs is the manual model building which does require experimentations like with different layers, ensemble models, and random dropouts of features.  And the new Facebook architecture will not remove this substantial and time consuming human manual testing requirement in the building of ANN models, nor will the new Facebook architecture remove the requirement for very large sample training data in these ANN algorithms from the 1980s.

Our current SkyRELR implementation models interaction effects up to three way interactions.  This level of interaction only would be seen in higher order ANNs such as Four Layer Networks.  But RELR automatically selects which interaction and nonlinear effects are even necessary for the best model.  Over the years in standard business applications, we have seen that three way interactions are selected very rarely in RELR models.   For this reason, we have not implemented any higher order interaction modeling to date. However, advanced users who purchase SkyRELR Amazon Machine Images for more complex problems that are hierarchical by nature like image recognition may stack independent RELR models to get deeper layers. But this is a well-defined problem where hierarchical division of models into different content specialization such as facial and lexical can be an advantage .   With higher and higher levels of interactions (seen in deeper layer ANNs), the number of observations that are substantially affected becomes smaller and smaller.  For this reason, higher order interactions or deeper learning may not make a large difference even when there are real effects present unless those effects are substantial as in image recognition.  Because the higher order interactions concern very few observations relatively, these deeper learning effects are much more susceptible to noise and error.  So, higher order interactions and deeper learning may actually hurt the accuracy of the model’s learning if they are ill-defined.

A very good example of poorer learning with deeper layers of learning is demonstrated in this new paper by a group at Google.  They use an ANN algorithm called Long Short Term Memory (LSTM), which is a Recurrent Neural Network (RNN) algorithm developed in the 1990’s for sequential learning.   Whether LSTM works in any way like how human short term memory works is debatable, as LSTM tries to predict the next word, character, or phoneme in a sequence based upon all previous words, characters, or phonemes.   Possibly parrots may learn to speak in this kind of rote stimulus-response causal chain process, but humans understand what their words mean.  So some kind of higher order and possibly hierarchical processing is likely taking place in human language.  In any case, LSTM is the most popular ANN algorithm today for many natural language processing tasks, but RELR also allows sequential  processing and learning as reviewed in Calculus of  Thought – this is only available to advanced users who purchase our longer term licensing.  My only point in reference to this Google paper is that they clearly show that deeper learning is not more effective than very shallow LSTM networks.  As they show that their 5L and 7L (Five and Seven Layer) deeper learning networks actually have worse performance than the 2L and 3L networks in their Table 1, where the Two and Three Layer LSTM RNN networks have very similar performance that topped all the others.  The authors write “We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.”

To reiterate, we need to be careful about all the hype today that deeper learning is always better.  In fact, deep learning can be substantially worse.  In RELR learning, we do not perform manual experiments like what is needed to build ANN deep learning.  Instead, we automatically select optimal features.    But just like in this Google report concerning LTSM learning, we also rarely see in RELR that very high order interactions are necessary and thus are selected in our RELR models.  In RELR, we do more often see two way interactions selected as in this blog posting about a SkyRELR explanatory model of financial behavior during the 2009 financial crisis.  Even the rare three way interactions that were selected in this SkyRELR example were highly correlated with what could be interpreted as lower order effects and so gave similar predictions and identical explanatory interpretations as the models that only selected lower order effects.

The famous ANN researcher Michael Jordan has warned in this IEEE interview that the same hype and over-complexity that brought down ANNs and artificial intelligence in the 1980’s and 1990’s in leading to what is referred to as the Winter Period for ANNs when nobody trusted ANNs also could be happening again because of all the Big Data and Deep Learning hype today.   In contrast to all the new hype, deeper is not always better.  Because much deeper and more complex networks are possible today than in the 1990’s, much of the time of ANN researchers today will be spent manually testing their networks only to discover that relatively shallow networks work perfectly well or are even much more effective.   A large advantage of RELR learning is that this testing is automated once the candidate predictors and type of model is established, as RELR will automatically discover what should be obvious which is that the simplest model that fits the data well is best.  And this simplest model is often what could be learned in the most shallow learning.

Faster processing is certainly always better, so RELR will benefit from innovations in faster parallel processing over the coming years.  On this point, Intel is introducing a new Knight’s Landing Xeon Phi co-processor next year that may have advantages over NVidia’s GPU architecture being used at Facebook.  But unlike ANNs, substantial innovations in parallel processor speed are not necessary for RELR to solve many important high dimension problems.  Instead, RELR can solve these problems today, but simply will become more user-friendly with faster parallel processing.

 

 

Leave a reply