Data fusion is the practice by which two or more separate data sources are brought together to form a single database that contains all the previously separate information. The purpose of such integration is to obtain a reliable estimate of the true relationship between any set of statistics which are currently unavailable in single-source form.
During the fusion, individuals from one survey are matched to individuals in the other and the two sets of behaviours are jointly ascribed to the matched individuals. For the sake of understanding the process, it is convenient to nominate one survey the donor and the other the recipient.
An important fundamental assumption of fusion is that "hooks" or linking variables contain enough information to describe the correlations between the variables in Donor and Recipient datasets. For example: "If given two datasets, Recipient, consisting of variable sets X and Y, and Donor, consisting of variables X and Z, we can perform fusion under the assumption that Y and Z are independent given X".
It is important to note that fusion is not a single technique – there are different approaches that might be taken depending on the objectives. The principles for the different approaches of data fusion are quite similar and follow these general steps:
- Set the objectives for the data fusion
- Analyse the datasets to select the variables and metrics that are subject to data fusion, and the relationships that need to be preserved.
- Define the universe and common variables.
- Some common variables may be described as critical and absolutely have to be maintained Socio-demographic variables such as Gender (so that males are always fused onto males), Age and Geography are commonly used us critical variables. Other critical variables may relate to other measured behaviour such as media consumption.
- Select matching (non-critical) variables used to further predict or explain the variables being linked within each critical cell. This step should also include assigning the importance weights to each of the non-critical common variables. Various approaches for the selection and weighting of matching variables can be used (e.g. ANOVA, regression, CHAID, Principal Component Analysis).
- Choose an overall data fusion technique – the main techniques used are:
- ;"Row wise" (The entire record of each donor is fused onto a matching recipient or group of recipients. The same set of 'common' variables is used to estimate the true relationship between donor and recipient for all variables.)
- "Column wise" (Datasets are fused variable by variable, or 'column-by-column'. Each variable will have its own set of 'common' variables, the one that best explains the relationship for this particular variable.)
- "Hybrid" (Dataset is divided in blocks of variables with similar characteristics. Within each block, the fusion will be row-based copying all variables from a donor to a recipient for that particular group of variables. However, for each block, the fusion is independent and can take advantage of choosing the different set of common variables that best explains the relationship.)
- Select a single distance metric to be used in the matching process. Usual approaches include Euclidean distance, Mahalanobis distance, or the Manhattan block distance.
- Choose a matching technique – "constrained" or "un-constrained."
- In approaches using "unconstrained fusion", the donor variables are passed across and attached to the recipient's record. Unconstrained matching simply means that a donor can be matched to any number of recipients, or none. This approach has the advantage of permitting the closest possible match of donors with recipients because it does not require for all potential donors to be used. However, there are a number of disadvantages. The most important disadvantage is that there is no guarantee that the marginal and joint distribution of the donor variables in the fused dataset will be identical to the corresponding distributions in the donor dataset. This is partly because each donor is used as often as necessary and partly because the donor's weight is "left behind".
- In the "constrained fusion" approach, all respondents from both surveys have to be used and their data together with their weights are transported into the new synthetic dataset. Respondents from either survey may still be used more than once, but in such cases their weight is shared out. This method has a number of advantages, but the main one is that marginal distributions from both surveys can be preserved, due to the preservation and sharing out of both sets of weights. Though in principle the matching may not be as good as for unconstrained fusion, in practice this is not an issue, especially when the two surveys with large samples are being fused.
- Run the matching (or modelling) process.
- Create a new fused dataset.
- Validate the results and provide fusion diagnostics. For example:
- Evaluation of the matching algorithm to measure the success in finding fused pairs of individuals with similar profiles.
- Comparison of the incidences for key fused variables between the original and fused dataset.
- Assess the effect of "regression-to-the-mean."
- Analyse the fused dataset, combining variables from the donor and the recipient datasets.
Ipsos Point Of View
Marketers want to know everything they can about their target consumer in order to maximise the return on their research investment:
- Who they are (demographics, geo-demographics, psychographics etc.)
- What they think about brands in the category they are asking about
- How they behave (purchasing levels, brand choice etc.)
- What they intend to purchase in the future
Those planning and buying advertising campaigns need to uncover the best ways of reaching and influencing their target audiences:
- Which media do they come into contact with at different times of day (TV programmes, newspapers, magazines, radio stations, web sites, apps, poster panels…)?
- Which media are they more or less attentive to or engaged with at different times of day?
- When is the best time to reach people with an advertising message (message receptiveness, when are they in the market to buy…)?
But no individual respondent will agree to answer such a large number of questions. And many of the questions will be impossible to answer accurately.
This conundrum of needing more information while finding it harder to collect from surveys alone is likely to get harder rather than easier as time passes. One of the statistical techniques used to help address this is data fusion.
The practical application of data science demands a high level of skill and expertise, as well as experience – many of the decisions and choices made in building fusions, for example, are not black and white, demanding judgement and a deep knowledge of the context. Ipsos has been at the forefront of bringing data fusion, and other data integration techniques, into the market research industry. We now have a number of specialised Data Science teams that are capable of performing complex data fusion projects integrating both survey and non-survey based data sources.