Top Python Libraries

Top Python Libraries

Share this post

Top Python Libraries
Top Python Libraries
Do Data Scientists Spend 80% of Their Time on These Cleaning Tasks?(Practical Data Analysis 6)

Do Data Scientists Spend 80% of Their Time on These Cleaning Tasks?(Practical Data Analysis 6)

Learn how data scientists spend 80% of their time on data cleaning tasks, using Python tools like Pandas to handle missing values, duplicates, and more!

Meng Li's avatar
Meng Li
Nov 26, 2024
∙ Paid
1

Share this post

Top Python Libraries
Top Python Libraries
Do Data Scientists Spend 80% of Their Time on These Cleaning Tasks?(Practical Data Analysis 6)
1
Share
A COMPREHENSIVE GUIDE TO DATA CLEANING FOR DATA ANALYSTS AND DATA SCIENTISTS  | by Fokoye | Medium

Welcome to the "Practical Data Analysis" Series

Table of Contents

Table of Contents

Meng Li
·
July 12, 2024
Read full story

Let’s use cooking as an analogy. For many people, the most enjoyable part of cooking is tossing ingredients into hot oil and stir-frying in the wok. However, this process only accounts for 20% of the cooking time. The remaining 80% is spent on preparation, such as grocery shopping, washing, and cutting ingredients.

In data mining, data cleaning is like this preparatory work.

As data scientists, we often encounter various types of data. Before we can analyze them, we must invest significant time and effort to "organize and shape" the data into the desired format.

Why is that?

Because the data we collect often has numerous issues.

Let’s look at an example. Suppose your boss gives you the following dataset for analysis. What’s your first impression upon seeing it?

You might initially feel confused because the data lacks labels. When collecting and organizing data, labeling is essential. Column headers in a dataset are critical.

For instance, this dataset lacks column names, making it unclear what each column represents. Without this understanding, it’s impossible to interpret the values in the context of the business or determine if the data is accurate.

In real-world scenarios, data may also lack proper labeling, as in this case.

Let me explain what the data represents. This is membership data from a clothing store.

The top row shows column coordinates, and the leftmost column displays row coordinates. In the column coordinates:

  • Column 0 represents the serial number,

  • Column 1 is the member's name,

  • Column 2 is their age,

  • Column 3 is their weight,

  • Columns 4–6 are the measurements of male members,

  • Columns 7–9 are the measurements of female members.

Now, looking at the data in detail, you might think, “Why is this data so messy?” There are empty values (NaN) and even empty rows.

Yes, and this is just a portion of the membership data from one store. Even at first glance, we can spot several issues.

In real-world tasks, data complexities increase significantly. Often, we need to track more dimensions—up to 100 indicators—and deal with data volumes in the terabyte (TB) or exabyte (EB) range. As a result, the difficulty of processing data grows exponentially.

At this point, it’s almost impossible to identify issues with the naked eye.

This simple example illustrates why data cleaning is a crucial preparatory step before analysis.

Experienced data analysts know that great analysts are also masters of data cleaning. In fact, data cleaning can take up about 80% of the time and effort in the entire data analysis process.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share