Friday, July 18, 2014

Big Data/Hadoop and Incumbent RDBMS Vendors - Saga Continues

All major RDBMS vendors ( Oracle, IBM, Teradata ) are developing and executing on strategies on how to cope with the rise of Hadoop, as well as on how to ride Big Data wave.
As far as Hadoop is concerned, initial idea was to use simple RDBMS-Hadoop connectors and loaders and thus contain or reduce Hadoop's role to storage or perhaps ETL platform. We are now witnessing next stage of Hadoop related strategies -   proliferation of federated query engines like Teradata QueryGrid, SAP Hana Smart Data Access, recently announced Oracle Big Data SQL etc. 
Oracle Big Data SQL, Teradata QueryGrid and other  federated query approaches over heterogeneous data ( Composite software and other data virtualization vendors don't belong to this category ) have Hadoop in cross-hairs i.e. they are legacy vendor attempts to cope with inevitable rise of Hadoop as centerpiece of Big Data initiatives.
Federated query engines typically originate queries from respective  legacy vendor software and/or hardware platforms. Oracle Big Data SQL, for example, runs on custom hardware only for now ( it doesn't, actually - it is still vaporware as of this date, in customary, time-honored Oracle manner ); Hadoop related part is essentially Exadata cellsrv software port to Hadoop datanodes. Distributed queries are executed across heterogeneous data sources ( Hadoop, databases etc ) with varied degrees of intelligence ( predicate pushdown, local data processing via smart scan in case of Oracle), query optimization and performance. 
Fundamental problem with this approach is that it centers Big Data activities in wrong place.
Hadoop is synonymous with Big Data initiative and is the hub around which other data sources will revolve i.e. Hadoop is a system of record for Big Data. Big Data activities should not be centered around legacy data platforms like Oracle, Teradata etc., which is exactly what above mentioned products enforce.
Federated query solutions like Oracle Big Data SQL only cover minor Big Data use cases (even if they deliver on performance area, which in itself is a tough problem to solve in heterogeneous environments ). 
This class of products should be viewed as legacy vendors attempt to defend and expand their turf by leveraging large installed base.
Some of these products are fairly advanced as they build on decades of experience in data management and are backed by huge financial and other resources of legacy vendors. The old guard can thus innovate on Hadoop platform quite fast. Hadoop was not initially built for BI or corporate data management. Relatively inexperienced ( at least in enterprise database management software development arena ) dedicated Hadoop vendors like Cloudera are tiptoeing around basic DBMS concepts and rediscovering tricks that legacy vendors mastered over decades of experience. 
Not surprisingly, Teradata was one of the first vendors to release such federated query solution ( first called SQL-H, now it is QueryGrid )  - probably because potential Hadoop squeeze is felt the strongest in their high end, very large data warehouses niche, which also happens to be Hadoop's entry point into DBMS market. 
Hadoop and Big Data are new approaches to building completely new analytic infrastructure and develop whole new class of applications based on nearly infinite scalability and near zero storage prices. While we can borrow some concepts and technologies from the old world, Big Data folks are also experimenting with newish concepts like schema-on-read that will redefine how we deal with all aspects of analytics pipeline.

No comments: