"Big Data" is a very broad term, used often - and for a variety of different purposes. But it can be simply defined as any collection of data or datasets so complex or large that traditional data management approaches become unsuitable (see Cielen and Meysman 2016).
This definition gives us a framework for categorizing:
- the types of data that exist
- the business problems the data can be applied to and
- the methods used for capturing data and extracting insight from it
And, in turn, this framework helps us to understand:
- opportunities for leveraging new sources of data
- the risks that co-exist with the opportunities and
- where there are connections to more traditional market research approaches
The characteristics of Big Data
The term "Big Data" is often applied to datasets simply because of their size and/or their complexity. And some definitions are so broad that they lack clarity - for example, in 2014, the Executive Office of the President of the United States issued a report on Big Data that defined it as: "Large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future".
The typical industry description of Big Data rests on the "Three Vs", which has its origins in META Group (now Gartner). The three Vs are the elements that fundamentally define Big Data:
While some commentators say all three elements should be present before data can be considered to be "Big Data", other researchers and clients believe that only one of these elements is needed.
As one might imagine from the use of the word "Big", the size of the data sources is a core characteristic, though there are few numerical measures of how large a dataset must be before it can be considered "Big". What matters is that the data is so large that traditional processing tools and techniques become ineffective.
By some estimates, more than 2.5 exabytes (2,500,000,000,000,000,000 bytes) of data are being produced eachday. And this creates challenges beyond scale – this data comes in many varied forms, including:
- simple datasets
- natural text
- online communication (discussions and blogs in social listening as well as web page content itself)
- sensor data (from road cameras, satellites and other recording devices)
- digital exhaust (the information leftover from transactions and other activities online)
Not only are these things complex on their own, the challenges grow further as we combine them. The different forms of data have different units of analysis (i.e. individual, household, neighborhood etc.), and traditional data processing techniques often cannot help us to gain insight from them.
Modern data is fast, not just because it is being produced quickly, but because it moves quickly and is consumed quickly too. There are more than 500 million tweets every day on Twitter, and Walmart handles over a million customer transactions every hour. So, there is a real need to think about delivering insight in near real-time.
While this speed creates new challenges and opportunities, it also gives rise to questions around the duration of measurement for specific examinations – i.e. how long is long enough in order to discover what we want to know?
Types of Big Data
Ipsos generally classifies data as:
- Active- where respondents are explicitly asked their views in surveys, focus groups etc
- Passive – which is any observational or extracted information where the individual is not queried directly
- Interactive - such as from our social communities where dialogue is the source of the insights
Typically, most Big Data has its origins in the Passive.
Common sources of data are:
- Behavioural - where data is collected through tracking customers and individuals online, from web patterns, mobile activity, and software use, through to the Internet of Things (IoT) where data is gathered by sensors on TVs, refrigerators, cars etc
- Geo-location – the capturing of GPS data - often as part of the 'digital exhaust' of mobile and other services - allows for analysis of geographic analysis and movement patterns. This could be seen as a particular form of behavioral data
- Purchase – such as credit card, point of sale, and online transactions
- Social Listening – allowing unprecedented ability to 'listen in' on actual exchanges emerging from digital discourse, including tweets, Facebook updates, blogs, vlogs, and online discussions
- Unstructured – the sources that are not captured in standard data layouts that allow for straightforward summary and analysis. Typically, this will be photos, video and text created by individuals
- Logistical – resource organization and aggregated data, which can be machine generated, and often provides contextual information like power consumption in national grids, or air traffic in metropolitan areas. It can be valuable on its own or as context for understanding the behaviour of individuals
The risks of Big Data
The same concerns of traditional research also apply to the study of Big Data. We highlight some here, not because they are unique to Big Data, but because they are often overlooked due to assumptions about the data - or in the excitement that surrounds this new and developing field.
Bad Data – to Ipsos, the quality of the data we use is fundamental to all of our projects, and we do not take it for granted when analysing Big Data. What we collect and how it is classified is crucial – and care must be taken when integrating and fusing data, as this can introduce additional measurement and prediction errors which may result in noise or bias.
We understand that data quality is not simply a question of good or bad but a range from one to the other, and understanding biases and the models to deal with them is a critical element that takes us beyond the simple collection and analysis of data
False Positives Nassim Taleb (Beware the Big Errors of Big Data) argued that the number of spurious correlations grows disproportionately to the number of variables in the data. And Big Data studies aren't big just in terms of the number of observations but also the number of variables.
Although there are statistical adjustments that can be made to lessen the risk (e.g. the Bonferroni correction,holdout validation, and cross validation), we must take care to identify relationships that are meaningful both statistically and substantively. Just as with traditional research, we must balance the approaches to deal with the increased risk of false positives with approaches to guard against creating false negatives.
Generalisability – issues around the representativeness of Big Data are sometimes acknowledged, but the full implications are often underplayed. For instance, while it is known that the content of Twitter is unrepresentative, these and other online discourses are often treated as generalizable. While the data they produce is very valuable, we know that those passionate enough to discuss toilet paper or nappies online do not represent a random cross section of potential purchasers.
This extends to situations where companies look at all the current users of their products and services. We are mindful that this information overlooks those who previously used and those who are potential purchasers but not current subscribers. This is where blending more traditional active data sources with Big (and other passive) Data is helpful.
In a historical parallel, the Literary Digest poll for the 1936 U.S. Presidential election had an unprecedented sample of 2 million respondents. In spite of this, their prediction was abysmally wrong due to the lack of generalizability of the sample. This lesson that larger does not always mean more representative extends to Big Data today.
Non-representative data is common in market research, so understanding and modeling the biases is crucial if we are to make generalisations. We are careful not to mistake size for representativeness.
Stability Over Time - with so many variables, the possibility for shifts over time should be considered. For example - the Google Flu Trends project initially found that looking at people's search patterns could reveal where outbreaks were happening much more quickly than traditional approaches. But when the exercise was repeated the model led to dramatic over-predictions, because people's behaviour had changed in the meantime.
Underlying Causes - much of the early value of Big Data is in uncovering surprising relations (e.g. the sales of Strawberry Pop-Tart at Wal-Mart rising in advance of hurricanes) and deciding what to do with these insights can still be difficult. Understanding what is behind the patterns is usually important to make the most of the opportunity they present. As we said in Quirk's, one area where market researchers can add value to the dramatic changes technology have brought us is showing "how to act as a result of what the data tells us".
Privacy - even blinded Big Data can be used to identify individuals, as seen in the competition Netflix ran to create an improved recommendation engine. Ipsos has direct experience of these issues from partnering with EE in the UK. So we apply the same standard of care as we do when protecting the rights of research participants in 'active data' studies.
Ipsos Point Of View:
The Ipsos method is about integrating different types of data, and is founded on our deep understanding of the risks and opportunities of each. Our knowledge of data quality, analytical techniques, and potential biases is invaluable in making all of this work, so that we can extract actionable insights for our clients.
Two more Vs
We think two qualities of Big Data are important, its veracity and its value. Unlike the three original vs (volume, variety, and velocity), they do not describe what Big Data is, but we believe they are critical to consider when moving a discussion about Big Data from theory to actionable insight.
The accuracy of Big Data is important, and this is true whether examining a single data source or integrating or fusing different sources. As we have already said, traditional research issues of bias and quality are just as important when studying Big Data.
Any study that we do focuses on providing meaningful and useful insights. So we apply all of our knowledge to avoiding the pitfalls and risks listed earlier.
Some of the promises made about Big Data have been dramatic and ambitious, raising huge expectations. In turn this has created something of a backlash and some disillusionment.
For example, when the U.S. Executive Office of the President says that "Big Data is saving lives… making the economy work better… making government work better and saving taxpayer dollars", it is not surprising that many are wondering why they aren't benefiting from the remarkable opportunities.
Our view remains that Big Data is very real, and that there are significant insights to be gained from examining it, many of which are unique to it.
But it isn't always clear:
- what data exists
- what questions can be answered with it
- what its limitations are
- how one can access and mine it
So, while there are technical challenges, we think the main ones are actually more about research in a more general sense. Big Data is not a black box beyond the understanding of ordinary people or decision makers.
We believe that value from Big Data only comes when expectations are realistic, and proper objectives are set.
Knowing what v knowing why
A significant area of debate is whether the discovery of correlations within Big Data is where most future value will be, and that past attempts to understand causal relationships in the data will become much less important. Put more simply, "knowing what, not why, is good enough" (Mayer-Schönberger & Cukier,Big Data).
Of course, there is a spectrum of views about this. At one extreme is the view that "Petabytes allow us to say: Correlation is enough" (Anderson, The End of Theory), but a more moderate position is provided by Mayer-Schönberger & Cukier, namely that "Causality won't be discarded, but it is being knocked off its pedestal as the primary fountain of meaning".
Sometimes, knowing what is enough for action. As mentioned above, the classic case of Wal-Mart examining its own sales data in advance of hurricanes and finding spikes in sales of Strawberry Pop-Tarts allowed it to further increase sales. But even this analysis was built by starting with a business question.
But often companies need to understand the why to really understand what they should do in response to what is discovered in the data. Even the Wal-Mart example above would benefit by understanding why so that it could find out whether other products could be promoted to tap into the same underlying impulses.
Big Data needs big Theory
Our view is that there are 3 components needed to get the maximum value from Big Data:
- Business Questions and Theory
- Statistical Considerations and
- appropriate Analytic Techniques
Having only 2 of the 3 creates risks:
- lacking theory may lead to the 'so what' findings that have driven some companies to question the value of Big Data
- lacking statistical considerations may lead to invalid findings and inferences
- lacking advanced analytics may lead to missing the patterns and insights themselves
With all three components present, we can understand the data, use increasingly sophisticated tools to identify patterns, and determine what is meaningful. Without all three components, results may lack value or, worse, be misleading.
We agree that there is some merit in just knowing what (not why), but believe that even the basic decision about which data source(s) to examine and how to capture the data requires a focus on a business question.
Ipsos has been growing its expertise and capability across every aspect of Big Data – from capture, through analysis, to insight. And we are committed to innovation and staying at the forefront of trends in research including Big Data and Big Data Analytics. Although this is changing the balance of research that we do, we will continue to leverage our expertise across all types of data, including traditional forms, and make sure we use the most appropriate tools and resources to best position our clients for success.
So, we do not see Big Data as a simple alternative to traditional methods like surveys. Our view is that the different types of data possess distinct merits for strategic and tactical questions. Depending on the question, these can be employed separately or in coordination.
As discussed earlier, Big Data includes mostly 'passive' data where individuals are not explicitly engaged to answer questions or otherwise interact with researchers.
While this provides new opportunities to examine what individuals are doing and saying; often pairing this with more traditional active and interactive sources provides the deepest and most actionable insights.
How we analyse
While many units in Ipsos have worked with clients and partners to extract the data, we are developing our own internal skills and resources through acquiring social listening data to working with data sources in the hundreds of gigabytes that require new cutting edge tools. This work is being led by the Ipsos Science Centre, which was formed specifically to be a resource of innovation in Data Science.
The tools of Data Science are often discussed interchangeably with Big Data Analytics (Cielen and Meysman 2016) and Ipsos has been growing its capability by linking computational modeling with statistical analysis in traditional research domains as well as Big Data. In addition to developing new tools and techniques, the Ipsos Science Centre has used tools to help with massive parallel processing of data sets in the hundreds of gigabytes and sometimes terabytes (trillions of bytes).
At the end of the process, results must be communicated with clients in a way that helps them understand what opportunities exist for them. Understanding leads to successful action, and that is our measure of success.
Chris Anderson in Wired, June 2008: "The End of Theory: The Data Deluge Makes the Scientic Method Obsolete" (note: this is not consistent with Ipsos POV, as it strongly endores the atheretic approach)
Davy Cielen and Arno D.B. Meysman, 2016: Introducing Data Science: Big Data, Machine Learning, and more, using Python Tools
Mike Egner and Rich Timpone in The Oracle, July 2015. "Forecasting with Big Data"
Seth Grimes in Information Week, August 2013: "Big Data: Avoid 'Wanna V' Confusion" (note: this focuses only on the 3 Vs that define Big Data - we agree with the defintion but feel the added 2 are critical for studies to be useful)
Constance L Hays in New York Times, November 2004: "What Wal-Mart Knows About Customers' Habits"
Victor Mayer-Schӧnberger and Kenneth Cukier, 2013: Big Data (note: not entirely consistent with Ipsos POV, but a good broad overview of the potential value of Big Data)
Andrew McAfee and Erik Brynjolfsson in Harvard Business Review, October 2012: Big Data: The Management Revolution
John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdren, and Jeffrey Zients for U.S. Executive Office of the President, May 2014: Big Data: Seizing Opportunities, Preserving Values
Joseph Rydholm in Quirk's Marketing Research, October 2013: Advice to Researchers: Change with a Changing World
Nassim N Taleb in Wired, February 2013: Beware the Errors of 'Big Data'