Jun 10

A central theme of this blog over the past few years has been the transition of business and the web to real-time, and the need for a new generation of powerful stream computing technologies to enable that transition. In recent months a large number of articles have been written about the real-time web in general, and about Twitter and Facebook in particular. Nova Spivack has written a long and interesting blog on “The Stream” which covers a number of the points I too would make. The full article is here. The following are just a few of the points he makes:

  • “The Stream is the next phase of the Internet’s evolution… Perhaps the best and most current example of the Stream is the rise of Twitter, Facebook and other microblogging tools.
  • “The era of the Web was mostly about the past - pages that were published months, weeks, days or at least hours before we looked for them. Search engines indexed the past for us to make it accessible… But in the era of the Stream, everything is shifting to the present - we can see new posts as they appear and conversations emerge around them, live, while we watch.”
  • “But how can we all keep up with this ever growing onslaught of information effectively? Will we each be knocked over by our own personal firehose, or will tools emerge to help us filter our streams down to managable levels? … Human attention is a tremendous bottleneck in the world of the Stream…The ability to view different streams for different contexts is very important and enables us to filter and focus our attention effectively. As a result, it’s unlikely there will be a single activity stream — we’ll have many, many streams. And we’ll have to find ways to cope with this reality.”
  • “A Stream oriented Internet also offers new opportunities for monetization. For example, new ad distribution networks could form to enable advertisers to buy impressions in near-real time across URLs that are trending up in the Stream, or within various slices of it. For example, an advertiser could distribute their ad across dozens of pages that are getting heavily retweeted right now. As those pages begin to decline in RT’s per minute, the ads might begin to move over to different URLs that are starting to gain… Ad networks that do a good job of measuring real-time attention trends may be able to capitalize on these trends faster and provide better results to advertisers.”
Jun 10

Over the past few years we’ve watched as single-core chips were replaced by multicore, and we’re now well on our way to the manycore age. Similarly with servers we’ve watched as single processor servers were replaced by multiprocessor servers and now with massively parallel servers.

These changes are causing us to rethink what we mean by a “computer system”. Back in the the heyday of Sun Microsystems their defining phrase was “The Network Is The Computer”. This phrase helped explain the important trend that was taking place at that time in networked computing. In recent years, Google has shown that we are now in a new phase where “The Datacenter Is The Computer”. Google’s MapReduce, and open source clones such as Hadoop, have shown that we can now program with whole datacenters.

What’s next?

We’re rapidly heading to a world in which there will be three main types of clouds: general-purpose clouds, special-purpose clouds, and private clouds. Within five years we might have dozens of general-purpose clouds, hundreds of special-purpose clouds, and thousands of private clouds. General-purpose clouds will be offered by major vendors such as Amazon, Microsoft, Google, HP and others. Special-purpose clouds will created to provide the specific data and services required in particular vertical markets, in particular public sector areas, and in particular areas of the web. We can expect to see rapid growth in financial clouds, small business clouds, healthcare clouds, education clouds, energy clouds, government clouds, defense and intelligence clouds, social web clouds, sensor clouds, research clouds etc. The biggest growth area however will undoubtedly be in the area of private clouds, as major companies and government agencies around the world deploy cloud computing architectures internally within their organization.

At Cloudscale we see the emergence of a new style of computing which we might call “Multicloud Computing” or “Manycloud Computing”. As in the case of manycore programming and datacenter-level programming, we need new programming models and architectures that enable large numbers of users to develop and deploy manycloud computing applications. Just as MapReduce has enabled large numbers of users to easily write programs that run across thousands of servers and whole datacenters, we need simple mechanisms that allow users to develop apps that run across multiple clouds. In cloud computing it’s vital that the computation is located close to the data, so since data will be spread across many clouds, we will require computation to be spread across those clouds too.

The simple and seamless coordination of stream computing and database/datastore computing across multiple clouds presents an exciting and important set of new challenges and opportunities for cloud computing. I’ll write some more on this in the coming weeks.

Jun 09

At Cloudscale we’re always looking to hire great software developers to our team. The following was just posted on our website and may be of interest to some of the readers of this blog.

“Cloudscale is looking to complement its outstanding Linux development team with new hires that will enable us to bring the full power of the Cloudscale platform to Windows users. If you’re a software systems developer with experience of Visual Studio, .NET, C#, VSTO and interested in working on a breakthrough cloud analytics platform for Windows Azure then send your resume and details of your dev skills in one or more of the above areas to jobs@cloudscale.com

To recruitment agencies: Cloudscale doesn’t accept agency resumes, so please don’t send any. Cloudscale is not responsible for any fees associated with unsolicited resumes.”

Jun 09

Being involved with a company at the intersection of cloud computing, the real-time web, and scalable data analytics has made it tough to find much time for blogging so far in 2009. Given the incredible amount that’s going on in this space at the moment I’m looking to get back into my stride again at CloudN over the next few weeks.  Focus will be on some exciting areas of next-generation cloud computing that are still “under the radar” at the moment. I’ll be touching on new directions in areas such as multicloud/manycloud computing, multistream analytics, smart infrastructure, real-time government clouds, and desktop-cloud fusion.

Jan 21

From Cloudscale News.

I’ll be speaking at two important conferences in the next couple of months - at “Ahead In The Clouds” in San Diego, where the focus is on the technologies and platforms that will power the next generation of cloud computing, and in the Hot Topics Session at the Cloud Computing Expo, New York.

Dec 30

For several decades, SQL and relational databases have provided a solution to two quite different categories of data-intensive computing - online transaction processing (OLTP) and data mining (DM).

OLTP requires real-time responsiveness and guaranteed concurrency control, but it is not a “big data” or a “big state” application. In most cases, for example in credit/debit handling, OLTP is dealing with relatively small amounts of data and small amounts of state per user, although there may be very demanding atomicity, consistency, and fault tolerance requirements (ACID compliance) that necessitate complex locking procedures. So we can classify OLTP as Big Concurrency and Low Latency, but not Big Data or Big State.

DM, on the other hand, deals with big data, but there is no live, real-time state that needs to be maintained, and no continuously changing data. DM, whether carried out using a SQL database, or MapReduce/Hadoop, or some SQL/Hadoop hybrid such as CloudBase/Hive/Aster/Greenplum/Cascading, is essentially an offline process, with no complex state management or complex concurrency requirements. So it’s Big Data, but not Big Concurrency, Big State or Low Latency.

The requirements of OLTP and DM are so different that it is remarkable that, until recently, relational databases and SQL were almost exclusively used for both categories of problems. Today, however, as the scale of data-intensive computing is growing rapidly, many of the SQL vendors are only targeting their products at the DM area, where they are now in serious competition with MapReduce/Hadoop and other flat-file/non-database alternatives.

Over the past few years, a new “third category” of extremely challenging data-intensive problems have emerged for which neither OLTP nor DM provide any kind of solution. Given the huge commercial importance of this area, and the explosive rate at which it is growing, it is astounding that virtually nothing has been written about it, in contrast to, say, what has been written about MapReduce, a simple data-parallel programming model. In the absence of any name for this new area I will refer to it as “ultraparallel computing” (UPC).

I’ve written about the challenge of UPC a number of times in this blog over the past two years. For example, in Cloud Dataflow I wrote

“apps that do complex analysis on the live data streaming out from the 20 million simultaneous users on a social networking site (social graph info, recent communications and actions, current location,…), or from the real-time market data, news and blog information streaming out on 5000 public companies, or from the software monitoring a national telco network carrying 10 million simultaneous calls”

So, what is ultraparallel computing? UPC refers to a broad, and rapidly growing, category of continuously running data-intensive computations that are characterized by four requirements:

  • Big Data. Need to handle torrential streams of live and historical data.
  • Big Concurrency. Need to handle huge numbers (thousands, millions, billions) of concurrent data generators or concurrent entities, e.g.twenty million concurrent Facebook users. Other types of data generators/entities might be Amazon shoppers, Twitter users, IP addresses, sensors, Live Mesh devices, or NYSE companies.
  • Big State. Need to handle live, complex, evolving state for each data generator or entity, e.g. live state for each of the twenty million concurrent Facebook users might include current physical location, social graph profile - likes, dislikes, friends, interests, recent messages sent/received, recent news items and pages viewed, customer history, search history,..
  • Low Latency. Need to run continuously, with instant, intelligent response to real-time changes in the live data. Examples: (a) detect an opportunity for high-impact mobile advertising, using data on current location and live profile of a specific iPhone service user (one of ten million concurrent service users), respond with optimally targeted ad to user within eight seconds, (b) detect threat of web crime via live analysis of complex patterns in data from millions of IP addresses, sensors and datacenter monitoring tools, respond and stop threat within five seconds, (c) detect opportunity to profit from buying stock A and selling stock B, based on live analysis of market data on 8000 companies together with sentiment analysis of live datafeeds carrying posts on traded companies from millions of blogs and hundreds of mainstream news services, respond with trade within three seconds.

The table opposite summarizes the characteristics of the three categories of data-intensive computation.

OLTP DM
UPC
Big Data N Y Y
Big Concurrency Y N Y
Big State N N Y
Low Latency Y N Y

In future blogs I’ll discuss various approaches to tackling this major new challenge in data-intensive computing. I’ll look at two possibilities where we attempt to teach a couple of old dogs, relational databases and MapReduce systems, some new tricks, e.g. using huge numbers (millions?) of ultra-lightweight databases for ultraparallel computing, or using a new generation of online low-latency MapReduce architectures. We’ll consider some of the issues involved in getting those approaches to work with the performance and scale required. I’ll also introduce Cloudscale’s dataflow approach to delivering ultraparallel computing to the mass market - a form of “consumer ultraparallelism” that is easy to use, easy to scale, and is specifically designed from the ground up to handle many of the most demanding problems in this new category of ultraparallel computing. Cloudscale’s mission is to enable millions of computer users, for example the Excel power users, to become “ultraparallel programmers” without them really noticing that they’ve acquired that skillset.

In the meantime, if you want to see the broad range of applications across business, web and government that require ultraparallel computing, then take a look at the introductory video on Cloudscale Applications here.

Dec 28

It’s the time of year to look ahead, the time for predictions. Here’s mine…

In recent years, information overload has come to be regarded as a massive and growing problem for all of us, both in our personal lives and in our work. Keeping up with emails, blogs, status updates, Twitter streams and newsfeeds has become exhausting for many, and some have given up, declaring “email bankruptcy” and deleting the contents of their Inbox, with its 10000 urgent, high priority but unread messages. Businesses are in a similar situation, facing torrential streams of raw data about customers, marketing, advertising, sales and distribution from a growing array of enterprise software systems and data warehouses. Like the rest of us, businesses are just letting most of this data fall on the floor, as they have no means of handling it all. Many corporate data warehouses have become essentially a “write-only” part of the infrastructure. Data goes in every minute of every day, in exponentially increasing quantities, but nothing much of real actionable business value ever comes out.

My prediction for 2009 is that it will be the “Year Of Data”, and that we will begin to regard huge volumes of data not as a huge problem, but as a huge opportunity, both in our personal lives and in our work. As the scale of this opportunity is increasingly recognized, we will see that cloud computing offers the means of maximizing this opportunity, and start to see a major shift away from in-house databases and data warehouses, and towards a world in which almost all large-scale data processing moves to the cloud. Here’s why…

Data is revolutionizing how we live and work. It’s the energy powering modern business, and it’s growing exponentially, from exabytes to zettabytes. For thirty years, in-house databases and data warehouses have been used to extract, store and query structured data, but they are no longer able to deliver what’s required today in business, web, science and government. Businesses now need to continuously analyze and process exploding volumes of live data in real-time in order to be able to act immediately on opportunities and threats. Latency is a killer. Downtime is deadly. In delivering new web experiences, in marketing intelligence, and in new forms of mobile advertising, we also need to continuously harness and exploit massive volumes of real-time data from social networks, location devices, lifestreams, newsfeeds and blogs, in order to achieve maximum impact. If data is the energy powering business, then live data is the most powerful form of that energy.

The shift from in-house to cloud computing, and the exponential growth of parallel processing, from multicore to manycore, presents us with a fantastic opportunity to radically rethink how we design a new generation of cloud-based IT architectures that are much easier to use, easier to scale, and can process the torrential streams of live data now flooding out from the web, enterprise software, social networks and sensors, and process it while the data is still hot. At Cloudscale (www.cloudscale.com) we’re developing the first products that will enable customers to “activate their data” with this new kind of massively parallel cloud computing architecture, and enable them to discover the magic of live data analytics and what it can do for competitive advantage.

Nov 26

Twenty tears ago, “advanced information retrieval” was a tiny niche activity, carried out by IT experts on behalf of users. Today, Google and other search engines provide self-service advanced information retrieval to billions of users around the world. Advanced business intelligence is similarly about to undergo dramatic change and democratization over the next few years. The new BI will be:

  • Self-service, not controlled by IT departments.
  • Cloud-based.
  • Powered by a new generation of massively parallel in-memory architectures.

Earlier this year, Gartner summarized the reasons for this disruptive new direction in BI:

  • “By 2012, emerging technologies will make it easier to build and consume analytical applications… Individuals and workgroups will be less dependent on central IT departments to meet their BI requirements.”
  • “BI is used aggressively by just 15 to 20 percent of business users. For the BI sector to thrive, it needs to overcome the fact that most business users feel BI tools are hard to use… Other technologies, such as personal productivity, collaboration and Internet search have been widely adopted by mainstream users in both their business and personal lives. BI has the same opportunity for massive adoption, but it must overcome its well-earned reputation of being difficult to use.
  • “Because BI explores huge amounts of data, it has traditionally relied on IT to build aggregate and summary tables to optimize performance on disc-based data storage. This requirement to build a performance layer impeded self-service BI. Falling memory prices and the prevalence of 64-bit computing is making in memory analytics a more attractive alternative. With this approach, business users no longer require IT to build a performance layer.”
  • “Smaller companies that lack the base of investments in BI systems will increasingly turn to service companies to deliver services that integrate, analyze and report on data from numerous systems. Wider adoption of SaaS business models will make analytical applications more widely used, particularly among midsize companies… The increasing trend toward business process outsourcing and cloud computing will only accelerate this trend, enabling the delivery of BI-related information and analysis for particular subject area domains via the SaaS model.”

Qliktech, an early leader in the self-service desktop BI space, now faces the prospect of serious competition from Microsoft with its Gemini and Madison projects. As scale, parallelism, and real-time response become critical in the BI space, other companies like Cloudscale are also looking to take advantage of the looming disruption in this huge ten billion dollar market.

Nov 25

Jeff Jarvis has an interesting article in The Guardian on the new post-meltdown digital economy that is now about to unfold. He argues that Google displays many of the key requirements for success in this new economy, and can be a role model for other companies looking to achieve rapid growth:

  • “In this crisis, we are witnessing more than the failure of mortgages, derivatives, banks, and regulation. We are also seeing the dawn of a new economy; one best viewed and understood through the lens of Google, the one company that – by design or by luck – is built for the emerging world order.”
  • “Google itself is built on a derivative: its data on data. Like the derivatives that got us into this mess, Google’s are based on creating abundance. But unlike those corrupted financial products, Google’s metaknowledge creates new and real value.”
  • “To succeed like Google, companies will build networks and platforms as it does.”

In my post yesterday I remarked that

  • “Data is the energy powering modern business… and live data is the most powerful form of that energy.”

Following Jeff’s lead, I can now describe my new company succintly in a simple phrase: Cloudscale is built on a derivative: its live data on live data. It has a kind of ring to it! A bit like Cisco’s original tagline “We network networks”.

Leaving Cloudscale and cute phrases to one side for a moment, the important point Jeff is making is that knowledge about data (both live and historical) is now the key driver for success in this new economy. Whatever business you’re in, you ought to be thinking hard about how you can extract every piece of knowledge and insight, as fast as possible, from the torrents of data that you now have pouring out everywhere in your organization, i.e. thinking hard about how you can quickly become more Google-like in harnessing that deluge. [If you're already thinking along those lines then it might be a good idea to get in touch with us at Cloudscale.]

Nov 25

The recent explosion of interest in cloud computing has triggered a long overdue re-examination of all aspects of computing. One of the most important of these is the way in which we interact with IT systems. The consumerization of software, with its drive to produce interfaces that are much easier to learn and to use, is also forcing us to revisit how we design user interfaces.

Today, the desktop is still the standard metaphor for human computer interaction. Like some kind of pre-internet librarian or administrator, desktop interfaces require us to operate in a world of files, folders and directories, keeping track at all times of where (mostly static) information can be found.

Is the desktop an appropriate metaphor for cloud computing in a world of massive information overload? That seems unlikely. I have blogged here in the past on the shift from information to attention in IT, and I had planned to write again about the challenge of developing a new generation of “Cloud User Interfaces”, but Nova Spivack has saved me the effort. A few months ago he wrote a long post on the future of the desktop that covered several of the main points I would also have made. While you may not agree with all of his many points, I certainly recommend reading the article as an interesting and thought provoking personal statement on the subject. The following two general points from the article are ones I have also made in the past, but which nevertheless merit repetition:

The focus of the desktop will shift from information to attention.

“we will see a shift from organizing information spatially (directories, folders, desktops, etc.) to organizing information temporally (feeds, lifestreams, microblogs, timelines, etc.). The Web is constantly changing and the biggest challenge is not finding information, it is keeping up with it.

The desktop of the future is… going to feel more like an RSS feed reader or a social news site than a directory. The focus will be on helping the user to manage and keep up with all the stuff flowing in and out of the their environment.”

Users are going to shift from acting as librarians to acting as daytraders.

“the scarcest resources will no longer be storage or bandwidth, it will be attention… we are going to increasingly rely on tools that help us manage our attention more productively — rather than tools that simply help us manage our information.

It is a shift from the mindset of being librarians to that of being daytraders. In the PC era we… were acting as librarians. Filing things was a big hassle, and finding them was just as difficult. But today filing information is really not the problem: Google has made search so powerful and ubiquitous that many Web users don’t bother to file anything anymore - instead they just search again when they need it.

Instead we are now struggling to cope with a different problem - the problem of filtering for what is really important or relevant now and in the near-future. With limited time and attention, we have to be careful what we look for and what we pay attention to. This is the mindset of the daytrader. Bet wrong and you could end up wasting your precious resources, bet right and you could find the motherlode before the rest of the world and gain valuable advantages by being first. Daytraders are focused on discovering and keeping track of trends. It’s a very different focus and activity from being a librarian, and it’s what we are all moving towards.”