Do Data Scientists Spend 80% of Their Time on These Cleaning Tasks?(Practical Data Analysis 6)
Learn how data scientists spend 80% of their time on data cleaning tasks, using Python tools like Pandas to handle missing values, duplicates, and more!
Welcome to the "Practical Data Analysis" Series
Let’s use cooking as an analogy. For many people, the most enjoyable part of cooking is tossing ingredients into hot oil and stir-frying in the wok. However, this process only accounts for 20% of the cooking time. The remaining 80% is spent on preparation, such as grocery shopping, washing, and cutting ingredients.
In data mining, data cleaning is like this preparatory work.
As data scientists, we often encounter various types of data. Before we can analyze them, we must invest significant time and effort to "organize and shape" the data into the desired format.
Why is that?
Because the data we collect often has numerous issues.
Let’s look at an example. Suppose your boss gives you the following dataset for analysis. What’s your first impression upon seeing it?
You might initially feel confused because the data lacks labels. When collecting and organizing data, labeling is essential. Column headers in a dataset are critical.
For instance, this dataset lacks column names, making it unclear what each column represents. Without this understanding, it’s impossible to interpret the values in the context of the business or determine if the data is accurate.
In real-world scenarios, data may also lack proper labeling, as in this case.
Let me explain what the data represents. This is membership data from a clothing store.
The top row shows column coordinates, and the leftmost column displays row coordinates. In the column coordinates:
Column 0 represents the serial number,
Column 1 is the member's name,
Column 2 is their age,
Column 3 is their weight,
Columns 4–6 are the measurements of male members,
Columns 7–9 are the measurements of female members.
Now, looking at the data in detail, you might think, “Why is this data so messy?” There are empty values (NaN) and even empty rows.
Yes, and this is just a portion of the membership data from one store. Even at first glance, we can spot several issues.
In real-world tasks, data complexities increase significantly. Often, we need to track more dimensions—up to 100 indicators—and deal with data volumes in the terabyte (TB) or exabyte (EB) range. As a result, the difficulty of processing data grows exponentially.
At this point, it’s almost impossible to identify issues with the naked eye.
This simple example illustrates why data cleaning is a crucial preparatory step before analysis.
Experienced data analysts know that great analysts are also masters of data cleaning. In fact, data cleaning can take up about 80% of the time and effort in the entire data analysis process.