So, we have a database and we need to come up with a data visualisation of what it contains. Sound familiar? This may be a straightforward task, but what if the database is not formatted in the way you expect? Or the data is completely unstructured? Sounds like you may need to massage the data.
The term data massaging, also referred to as “data cleansing” or “data scrubbing”, may sound a bit naughty. But, it’s commonly used to describe the process of extracting data to remove -unnecessary information, cleaning up a dataset to make it useable. Databases come in different shapes and sizes and each must be treated as unique. A few data massaging techniques are required to adapt the data to the algorithms we are working with. Common tasks include stripping unwanted characters and whitespace, converting number and date values into desired formats, and organising data into a meaningful structure. Simply put, massaging the data is usually the "transform" step.
Big organisations can automate this process and use some data massaging tools to systematically examine data for flaws by using rules, algorithms, and simple tables. Typically, a database massaging tool includes programs that can correct several specific mistakes, such as adding missing codes, or finding duplicate records. Using these tools can save a data scientist a significant amount of time and can be less costly than manually fixing errors.
The transformational steps
Things we do to massage the data include:
- Change formats from the standard source system emissions to the target system requirements, e.g. change date format from m/d/y to d/m/y.
- Replace missing values with defaults, e.g. "0" when a quantity is not given.
- Filter out records that are not needed in the target system.
- Check validity of records and ignore or report on rows that would cause an error.
- Normalise data to remove variations that should be the same, e.g. replace upper case with lower case, replace "01" with "1".
Beyond the initial hypothesis
Basic Exploratory Analysis and Data Crunching are also included in Data Massaging; these techniques can help us explore the stories behind our data.
Using exploratory analysis, we can summarise the data’s main characteristics often with visual methods. Some statistical modelling can be used, but exploratory techniques are primarily for seeing what the data can tell us beyond the initial hypothesis; it’s the first “taste” of the data.
Data crunching follows the initial exploratory analysis if the data we need to analyse is completely unknown (usually a client’s raw data). So, basically all the techniques are allowed here and the main objective is to find a story in the data that could raise more hypotheses and, consequently, more interesting statistical analysis.
Crunching data also involves creation of a system, which will be implemented to carry out specific analysis. The result is data that is processed, structured, and sorted to have algorithms and program sequences run on it. So, crunched data means data that has already been imported and processed in a system. “Data wrangling” and “munging” are two other words used to describe the same process. The latter two terms are used to refer to the initial semi-manual and manual processing of data.
But, what does this mean in practice? Let’s say analytics for a company’s website shows that it has 10,000 visitors a day. That data, while impressive, is meaningless because it doesn’t tell the company what it’s doing right to achieve those numbers and what it can do to increase them?
Now let’s say the company can separate the number of visitors into those looking for product information, those looking for business partnerships, those looking for jobs, and casual visitors. This begins to make more sense to the company. If it sells directly to consumers, it knows it needs to strengthen its product information and ordering pages. If it depends on resellers, it knows it needs to strengthen its partnership offerings and create a channel for potential business partners to approach it. If it finds that the cost of recruitment through campus interviews, etc. is high, it can tweak the job offers pages so that it gets the right candidates to apply. Doing all this is data crunching in a nutshell.
A bad massage
Unfortunately, there is such a thing as a bad massage. The term “data massaging” is also associated with the practice of “cherry-picking”, selectively excluding or altering data based on what researchers want (or don’t want) it to reflect. This is the worst and most harmful example of being dishonest with data. Cherry-picking changes the message that the final visualisation communicates to the audience. Though it’s illegal there are companies still doing it, mainly with the objective of pleasing clients or to impact publications. For data scientists, facts are sacred and we must always respect what the data is telling us, even though it hurts.
Finally, there is also the less savoury practice of massaging the data by throwing out data (or adjusting the numbers) where they impact on data quality, these are usually outliers. This is absolutely acceptable, but a topic for another article.
As you can see, a lot can be gained from a good massage. Ipsos Connect Data Science is always keen to help and will be running several external data science training sessions during 2018. If you want to know more about Data Science get in touch and we can run sessions specific to your company’s needs.