For several decades, SQL and relational databases have provided a solution to two quite different categories of data-intensive computing - online transaction processing (OLTP) and data mining (DM).
OLTP requires real-time responsiveness and guaranteed concurrency control, but it is not a “big data” or a “big state” application. In most cases, for example in credit/debit handling, OLTP is dealing with relatively small amounts of data and small amounts of state per user, although there may be very demanding atomicity, consistency, and fault tolerance requirements (ACID compliance) that necessitate complex locking procedures. So we can classify OLTP as Big Concurrency and Low Latency, but not Big Data or Big State.
DM, on the other hand, deals with big data, but there is no live, real-time state that needs to be maintained, and no continuously changing data. DM, whether carried out using a SQL database, or MapReduce/Hadoop, or some SQL/Hadoop hybrid such as CloudBase/Hive/Aster/Greenplum/Cascading, is essentially an offline process, with no complex state management or complex concurrency requirements. So it’s Big Data, but not Big Concurrency, Big State or Low Latency.
The requirements of OLTP and DM are so different that it is remarkable that, until recently, relational databases and SQL were almost exclusively used for both categories of problems. Today, however, as the scale of data-intensive computing is growing rapidly, many of the SQL vendors are only targeting their products at the DM area, where they are now in serious competition with MapReduce/Hadoop and other flat-file/non-database alternatives.
Over the past few years, a new “third category” of extremely challenging data-intensive problems have emerged for which neither OLTP nor DM provide any kind of solution. Given the huge commercial importance of this area, and the explosive rate at which it is growing, it is astounding that virtually nothing has been written about it, in contrast to, say, what has been written about MapReduce, a simple data-parallel programming model. In the absence of any name for this new area I will refer to it as “ultraparallel computing” (UPC).
I’ve written about the challenge of UPC a number of times in this blog over the past two years. For example, in Cloud Dataflow I wrote
“apps that do complex analysis on the live data streaming out from the 20 million simultaneous users on a social networking site (social graph info, recent communications and actions, current location,…), or from the real-time market data, news and blog information streaming out on 5000 public companies, or from the software monitoring a national telco network carrying 10 million simultaneous calls”
So, what is ultraparallel computing? UPC refers to a broad, and rapidly growing, category of continuously running data-intensive computations that are characterized by four requirements:
- Big Data. Need to handle torrential streams of live and historical data.
- Big Concurrency. Need to handle huge numbers (thousands, millions, billions) of concurrent data generators or concurrent entities, e.g.twenty million concurrent Facebook users. Other types of data generators/entities might be Amazon shoppers, Twitter users, IP addresses, sensors, Live Mesh devices, or NYSE companies.
- Big State. Need to handle live, complex, evolving state for each data generator or entity, e.g. live state for each of the twenty million concurrent Facebook users might include current physical location, social graph profile - likes, dislikes, friends, interests, recent messages sent/received, recent news items and pages viewed, customer history, search history,..
- Low Latency. Need to run continuously, with instant, intelligent response to real-time changes in the live data. Examples: (a) detect an opportunity for high-impact mobile advertising, using data on current location and live profile of a specific iPhone service user (one of ten million concurrent service users), respond with optimally targeted ad to user within eight seconds, (b) detect threat of web crime via live analysis of complex patterns in data from millions of IP addresses, sensors and datacenter monitoring tools, respond and stop threat within five seconds, (c) detect opportunity to profit from buying stock A and selling stock B, based on live analysis of market data on 8000 companies together with sentiment analysis of live datafeeds carrying posts on traded companies from millions of blogs and hundreds of mainstream news services, respond with trade within three seconds.
The table opposite summarizes the characteristics of the three categories of data-intensive computation.
|
OLTP |
DM
|
UPC
|
| Big Data |
N |
Y |
Y |
| Big Concurrency |
Y |
N |
Y |
| Big State |
N |
N |
Y |
| Low Latency |
Y |
N |
Y |
In future blogs I’ll discuss various approaches to tackling this major new challenge in data-intensive computing. I’ll look at two possibilities where we attempt to teach a couple of old dogs, relational databases and MapReduce systems, some new tricks, e.g. using huge numbers (millions?) of ultra-lightweight databases for ultraparallel computing, or using a new generation of online low-latency MapReduce architectures. We’ll consider some of the issues involved in getting those approaches to work with the performance and scale required. I’ll also introduce Cloudscale’s dataflow approach to delivering ultraparallel computing to the mass market - a form of “consumer ultraparallelism” that is easy to use, easy to scale, and is specifically designed from the ground up to handle many of the most demanding problems in this new category of ultraparallel computing. Cloudscale’s mission is to enable millions of computer users, for example the Excel power users, to become “ultraparallel programmers” without them really noticing that they’ve acquired that skillset.
In the meantime, if you want to see the broad range of applications across business, web and government that require ultraparallel computing, then take a look at the introductory video on Cloudscale Applications here.