There is remarkable unanimity among industry analysts and thought-leaders on the nature of big data. Rather than a phenomenon of volume alone, big data is almost universally described as having the dimensions of velocity and variety as well. The term velocity recognizes the speed with which many types of big data, such as sensor output or social network interactions, are generated. Variety recognizes the many forms that big data can take, from very compact “blips” from sensors or clickstreams to text documents to multi-gigabyte geospatial images.
A fourth “v” that many analysts also recognize is “variability.” The variability dimension is a somewhat subtle concept that recognizes that data not only takes different forms, but also varies according to where it is managed, whether its use and distribution is restricted or public, who owns it, whether or not it is time-sensitive, and many other factors. Variability is especially important in large enterprises with information in many different applications. It is less important when all of your data comes from a single source, such as a social network, web clicks or sensors.
So big data is neither uniform nor homogenous, and “lives” in different places. And of course the questions we want to answer using big data differ radically. This theme came to the forefront earlier this month when Vivisimo held a “Tech Day” gathering of customers, partners and technologists from around the Washington DC area to focus on a number of topics, including big data. As part of the Tech Day exercise, we took a close look at the ecosystem tools that are available to manage and extract value from big data. It turns out that, just as it took a while for most people to recognize that big data can’t be measured by volume alone, most are still coming to grips with the array of tools and techniques available to exploit big data and learning when and how to use them. These tools need to be selected and applied differently depending on each organization’s big data and what they hope to gain from it. It helps to evaluate an organization’s requirements along the four dimensions volume, velocity, variety and variability.
Today any discussion of big data inevitably centers on Hadoop, the open source project named after a loveable elephant from children’s fiction. While Hadoop does seem to occupy the center of the big data universe, it’s not a panacea. There are many pieces of the puzzle, including some within the Apache Hadoop project and some from the commercial world that handle varied demands of big data.
Getting back to our Tech Day discussion, we looked at how a number of different tools—some open source and some commercial—address the four Vs. As part of the exercise we mapped those on a set of diagrams to show how each tool measures up against each of the four dimensions. This is covered in detail in an upcoming Vivisimo white paper, but here’s a preview based on our Tech Day exercise.
For starters, take Hadoop, and throw into the mix its complementary components MapReduce and the Hadoop Distributed File System (HDFS). The Hadoop/MapReduce/HDFS triumvirate scores high on handling volume and variety, but low on velocity and variability. In other words, while Hadoop provides the capability to distribute and process large amounts of heterogeneous data, as a batch-oriented system it is not designed to handle velocity in the form of rapid ingest of data or interactive analysis. It also lacks connectors or interfaces to various enterprise systems and security models, and as a result is not strong when it comes to variability.
On the other hand, traditional relational database management systems are designed to rapidly process transactions involving structured data in neat rows and columns, joined by keys that associate different tables of data, so they address the need for speed for certain types of data. However traditional RDBMS essentially “break” when called upon to process data sets with many columns in a single row, no fixed schema, and varying data types. This is where the so-called NoSQL (a contraction of “not only SQL”) databases such as HBase (another Apache project in the Hadoop family) come into play. By their nature, databases fall short on the variability axis, since they are designed to process data stored inside the database, or in similar databases in a network or distributed configuration.
Now suppose you need fast answers from an assortment of distributed data sets, some representing high volume and variety, some with high variability and velocity, such as a company’s supply chain and e-mail systems. This situation calls for a search platform that can access and process data from many different back-end systems, in addition to data you may have processed using Hadoop and stored in HDFS. The ability to maintain and manage indices of content across all of these systems and to query and fuse data from all of them without regard for schemas and formal data structures opens many options for exploiting big data. In short, it allows us to quickly answer some of the most difficult questions and discover relationships that would otherwise be difficult, if not impossible, to detect.
Looking at the ecosystem of tools available for addressing big data from the perspective of the four Vs can help to make sense of the big data ecosystem, and it is the first step toward tapping the enormous value that lies in big data. This is definitely a topic for further exploration and discussion.



Great to see the industry finally get it’s head around the 3Vs (or variants thereof) that I first defined over 10years ago in a Gartner piece. For those who want to see or refer to the original piece: https://www.sugarsync.com/pf/D354224_7061872_35276.
Cheers,
Doug Laney, VP Research, Gartner
Thanks for this link Doug. I had seen mention of this report somewhere but didn’t have a copy. Very interesting read. As a history buff I love jumping into Peabody’s “Wayback Machine” and looking at a trend from the perspective of an earlier time. You really nailed it. Going forward I’ll be able to provide proper attribution for the model.
[...] an earlier blog post titled, “Use the Four V’s to Better Understand the Big Data Ecosystem,” I discussed the concepts of volume, velocity, variety and variability that represent the measurable [...]