Friday, February 17, 2017

Cloud Spanner, Apache Kudu, TensorFlow; Deep Learning is a Hammer in Search of a Nail ( and Why That is Okay )

It is only February, yet we already have a slew of important announcements in Big Data arena. Google made its monster Spanner database available to everybody via Cloud Spanner API. Spanner has a potential to make obsolete armies of inhouse IT staff maintaining primary and DR sites, copying, synchronizing, backing up and restoring, Golden Gating and generally being involved in  futile attempts to serve consistent data across the enterprise.
Spanner is the ultimate database - automatically georeplicated and monstrously scalable, ACID compliant, relational ( SQL support ), in other words it is a true cloud relational database. It might finally push hesitating enterprises into a public cloud arena, as advantages far outweigh general concerns about security and lack of control over data and infrastructure. Spanner is a generational leap over established RDBMS vendor offerings ( Oracle, IBM, Microsoft ).

Another Google driven development is a release of TensorFlow 1.0. It confirms TensorFlow traction and fast pace of innovation, as well as Google's commitment to it, with major performance ( compiler ) and usability ( layers, packaged, prebuilt models, new APIs, integration of Keras with TensorFlow announcement, Estimator API).

In another development Cloudera/Apache announced Apache Kudu 1.0 release. Kudu fills an important gap in Hadoop eco system - it is SQL/ACID compliant engine with random read/write capability. Hadoop/HDFS files can only be appended to ( with the exception of MapR distribution ), which ripples and causes problems through the whole ecosystem ( no updates, indexes, or ugly workarounds for the same). HBase has the ability for random RW, but never caught on, partially because of the lack of SQL interfrace ( Phoenix tried to solve it, but now we are talkiing about scaffold that is too flimsy and fractured, even by Hadoop standards ).
With Kudu we can expect further Hadoop ecosystem onslaught on Teradata, Netezza, Exadata turfs.

In this little post we are also addressing critiques that Machine / Deep Learning is a solution in search of a problem. Yes, it is, and no, it is not the first time that approach is taken, with great outcome. Transistor, computer, satellites were also invented first, then found its many applications. Just imagine how would top down, pick a problem, then find a solution approach work, if banks, for example, said: we have a problem handling all this data; why don't we create a project where we will invent transistor, create a computer, compilers, languages and databases to help manage our business.
So yes, Machine Learning is bottom up approach. ML/ DL is  a foundational technology that is becoming a solution for many problems ( for example, an RNN/LSTM algorithm can be used as generic computer that can compute anything a conventional computer can compute, and is thus a basic building block ). We have a hammer searching for nails it can successfully hit, and it looks like there are many out there.

Friday, January 27, 2017

Enterprise Grade Risk Modeling Using Machine Learning

All pieces of a puzzle are now in place for a productive and successful large scale risk modeling using Machine Learning. A slew of recent software and hardware announcements means that we finally have a full, brand new stack of components to have a shot at productive enterprise class Machine Learning risk modeling exercise. Google's TensorFlow makes it possible to quickly train, test and run predictive models on a variety of target devices ( CPU, GPU ). Latest TensorFlow releases incorporate tf.learn library that makes it much easier to extract features and pass datasets on to train, test and predict modules. This trend towards ease of use will continue with announced Keras incorporation into TensorFlow build. On a hardware side, IBM just released Power AI platform/appliance that, aside from TensorFlow, also incorporates Nvidia hardware and software ( GPUs, Cuda, NVLink ).

Monday, January 02, 2017

A Comical Break: Moody's Prefers Intuition or Economic Theory to Machine Learning Because It Is Important to Have Theoretical Underpinnings

Here is Moody's ( 2012) take on variable ( feature ) selection in the context of risk analysis ( Methodology for Forecasting and Stress-Testing U.S. Vehicles ABS Deals ):

A key aspect of model development is variable selection-identifying which credit and economic
variables best explain the dynamic behavior of the dependent variable in question. Aligned with principles  of modern econometrics, we prefer to choose the variables based on a combination of economic theory or intuition, together with a consideration of the statistical properties of the estimated model.
We believe models built using pure data-mining techniques or principles such as machine learning, though they may fit the existing data well, are more likely to fail in a changing external environment because they lack theoretical underpinnings. The best prediction models employ a combination of statistical rigor with a healthy dose of economic principle. Models built this way enjoy the additional benefit of ease of interpretation.

I am not sure how they can claim the above with the straight face. Moody's is one of agencies that completely failed to predict 2008 housing originated crash. There are not known, scientific, or even commonly agreed upon "theoretical principles" or economic theory. It is now clearer why agencies have a problem with prediction, which is hard, especially about the future.  The FCIC commission found that agencies' credit ratings were influenced by "flawed computer models, ...". Yet they stick to the same practices. Continuing:

Adding each economic variable helps the model improve predictive power.Generally speaking, the economic variables should be useful in both producing accurate out-of sample forecasts and providing good in-sample fit. However, we sometimes have to make tradeoff decisions to balance out between these two goals when they are conflicting. If the
discrepancy is unavoidable and very significant, we prioritize forecast accuracy rather than in-sample fit, as forecasts are end results of our models. 

Translated: but when the above practice fails - we fudge by taking whatever works better - exactly an approach they ( Moody's ) dismissed earlier.

Here they finally convince us it is actually alchemy approach, based on art and intuition ( which doesn't prevent them from sprinkling some scary looking math - just for the artistic impression:

And here Moody's finally leaves no shade of doubt we are dealing with artists, entertainers and illusionists :
Variable selection is more art than science. The criteria mentioned above are not black or white.
The bottom line is to build a theoretically sound and empirically workable model and get reasonable and
consistent forecasts that are supported by both economic intuition and statistical significance.

To their credit, and unlike many inhouse modeling practices, Moody's actually checks how model performs, but they rarely admit model is wrong:

The consistency check is the comparison of model performance across different production runs. We keep track of the model performance by comparing the forecast statistics over time. The results of the analysis may suggest revisions to the model. However, differences do not necessarily indicate that the model is in error. We should look into what causes the discrepancy and how this affects the end results. If the statistics get really worse and fall into an unacceptable range, we should modify the original model to accommodate revised performance data and changing economic conditions and make sure that the model reflects the most recent development in the auto ABS market.