What Is Data Munging?

Learn about the role data munging plays in data analysis, as well as its significance and processes. Furthermore, learn about the range of tools from which you can choose from for data munging.

8 minute read.
Layered graphics representing data analytics on a computer
Table of Contents
  1. The importance of data munging in the data pipeline
  2. The processes comprising data munging
  3. Challenges to data munging
  4. Automating data munging

Data munging, or data wrangling, is an important step in data analysis. Data munging is about processing the raw data, turning it into a usable form. This article explains some of the basic ideas about data munging.

The importance of data munging in the data pipeline

To understand the role of data munging, let us look first at the data infrastructure needed for smooth data analytics. 

The most important part of the data infrastructure is the data warehouse. A data warehouse is a type of data management system that is solely intended to perform queries and analysis and often contain large amounts of historical data. Data warehouses contain the following elements: 

Data munging enters the picture as an important part of the ELT solution. 

The ELT (or sometimes ETL, where transformation comes first before loading) operations form the so-called data pipeline. The data pipeline delivers data from the sources to the databases comprising the warehouse. Data munging is an important part of the transformation processes of the ELT operations. 

Inconsistencies between the data coming from different sources pose a challenge to data analysis. Data munging is important in the data pipeline because it smoothes out inconsistencies and the variation in formats between the data from different sources. This makes data usable. Therefore, you can also imagine data munging as the preparation stage for data analysis.

The processes comprising data munging

Trifacta lists six steps comprising data munging. We will discuss them briefly and then demonstrate them by using an example data set.

  1. Discovering: this process involves understanding the data that you are about to process. To help you understand the data, you look at its source and the context in which they are created. 

We have the following dataset as an example. As you can see, while the columns were neatly defined, they are separated by spaces, which may pose problems when we blindly load them into an app for data analysis. 

The original data file, with columns separated by spaces. Notice the inconsistency of data in the second and fourth columns.
The original data file, with columns separated by spaces. Notice the inconsistency of data in the second and fourth columns.

There are also inconsistencies in the Branch column and in the Date of birth column in terms of the formats followed. Fortunately, the other columns are pretty consistent.

  1. Structuring: this process organizes the data to prepare them for easier analysis. 
Data structured in a table with cells.
Data structured in a table with cells.

We have sorted these entries into a table that separates them into individual cells. This will make it possible for the analysis software to correctly identify information by identifying the header, thus identifying the variables to be stored. While not shown here, at this point, the variable type is also identified, which is important in data analysis. 

  1. Cleaning: this process irons out possible errors and outliers. The format of the data is standardized.
Data structured in cells, with the inconsistent data rewritten to conform to a standard format.
Data structured in cells, with the inconsistent data rewritten to conform to a standard format.

Using the information from the header of the table, each entry is scanned for inconsistencies with the standard format for data analysis. For example, the entries in the Branch column are cleaned by completing the entries or rewriting them in the standard format (from 5th to 5th Avenue). Some entries in the Date of birth are also cleaned to conform to the standard format (from January 2, 2000 to 1/2/2000). 

  1. Enriching: this process considers whether new data or information can already be derived from the existing data set and identifies them. 

Using the given information we can derive the following information:

  1. Current age
  2. Age of employment
  3. Years employed (the employment contracts begin on June 1 each year)

We can already calculate these given a certain date. Let’s say we calculate them with January 1, 2021 as the given date. The table now looks like the following, with additional columns:

Dataset with additional columns for the current age, age employed, and years of employment added.
Dataset with additional columns for the current age, age employed, and years of employment added. 

  1. Validating: this process cross-checks the dataset for data consistency, quality, and security. This is important to recheck the data for missed inconsistencies. 

This process can involve pre-processing the data to look for inconsistencies among them. For our example, the last two entries were flagged during the validation step because their ages when they started working is way below the lawful minimum age. This means that they have to check again from the source. Fortunately, they are just typo errors. The validated data set now looks like as follows:

Dataset with revised data after revising the flagged entries.
Dataset with revised data after revising the flagged entries.

  1. Publishing:  this process prepares the data for use in analysis. For use by the analysis software, the data must be in CSV format. The data file now looks like as follows:
The dataset saved as CSV file.

Challenges to data munging

Alooma lists three main challenges that affect the ETL processes, which include data munging:

  1. Scaling. The sheer volume of raw data you need to process grows over time even with no addition to the number of data sources. A good data munging setup will be able to adjust to increases in the volume of raw data to process.
  1. Transforming data correctly. Sometimes bad data can slip through your pipeline, affecting the resulting data analysis. Correctly implementing a data munging setup requires careful planning and extensive testing
  1. Handling diversity of data sources. Raw data from different sources need to be given different data munging treatments, all of which are only possible through planning and testing.

These challenges can be solved by studying the raw data, planning how to process it, implementing the plan, and then testing whether the setup works. These take time, but this will free up a lot of time that can be used to analyze the data later on. 

Automating data munging

As you have seen, data munging is not a pretty process; in fact, it is a fairly routine but time-intensive task. Data munging accounts for 50% to 80% of work by data analysts! There is therefore a strong push to automate data munging. 

Fortunately, tools are being developed to help automate at least a portion of the process. Alooma lists the following types of tools:

  1. Batch: designed to move large volumes of data at the same scheduled time
  2. Cloud-native: leverages the expertise and infrastructure of the vendor; they are optimized to work with cloud-native data sources
  3. Open source: can be modified and shared because their design is publicly accessible, and free or lower cost than commercial alternatives
  4. Real-time: processes data in real-time, optimal for data that is streaming, or for data that is associated with time-sensitive decision making 

To help you choose the right tool, ChartIO’s Off the Charts blog came with the following deciding criteria, copied verbatim:

  1. A tool that is both quick and easy for data replication from one source to another 
  2. A cost efficient tool that’s within budget
  3. Finding a tool that does not require heavy set-up/maintenance help from our engineers
  4. Ability to connect to the sources we use and could potentially want 

These are essentially what makes a good tool for data munging. 

I know an app that does that. 

It’s called Lido

In fact, Lido goes beyond data munging, it already includes its own data analysis tools, thus not only freeing you from the potentially laborious process of data munging but also lets you see what happens to your business real-time. You can easily use this without requiring technical expertise. It allows you to gather data from the most common ecommerce tools and services in the market. Most of all, you can afford this. Let our platform do it for you!