Data Collection: How to Automate Data Gathering?(Practical Data Analysis 5)

Discover how to automate data collection using open data sources, web crawling, sensors, and log tracking. Optimize data mining for better insights!

Meng Li

Nov 22, 2024

∙ Paid

What is Data Collection? Its Benefits, Methods, and Challenges

Welcome to the "Practical Data Analysis" Series

Meng Li

July 12, 2024

Read full story

In the previous section, we discussed how to model user personas, and before modeling, data collection is always necessary.

User Profiling: Tagging as the Art of Data Abstraction(Practical Data Analysis 4)

Meng Li

November 20, 2024

Read full story

Data collection forms the foundation of data mining. Without data, mining has no meaning.

Often, the number of data sources, the volume of data, and the quality of the data determine the outcomes of our data mining efforts.

For example, in quantitative investing, you predict future stock fluctuations based on big data and execute buy/sell operations accordingly.

If you have access to all historical stock data, can you build a highly accurate predictive analysis system based on that data?

In reality, if you only have historical stock data, you still cannot understand why stocks experience significant fluctuations.

For instance, a SARS outbreak or a regional war might have occurred at the time, both of which could heavily influence stock prices.

Thus, it’s crucial to recognize that the trends of a dataset are influenced by multiple dimensions.

To achieve high-quality data mining results, we need to collect data from multiple sources to cover as many dimensions as possible while ensuring data quality.

So, from a data collection perspective, what are the main data sources? I categorize data sources into four types:

Open Data Sources
Web Crawling
Sensors
Log Collection

Each of these has its own characteristics. Open data sources are generally databases tailored to specific industries.

For example, the U.S. Census Bureau provides open data on population demographics, regional distributions, and education levels.

Apart from governments, businesses, and universities also release relevant big data. North America performs relatively better in this regard.

Domestically, regions like Guizhou have made bold attempts to build cloud platforms and gradually open data in tourism, transportation, and commerce.

It’s worth noting that many research projects are based on open data sources. Otherwise, there wouldn’t be so many academic papers published annually; researchers need comparable datasets to evaluate algorithm performance.

Web Crawling is generally used for specific websites or apps.

If you want to scrape data from a specific website, such as product reviews on shopping sites, you would need tailored crawling tools.

Sensors, as a data source, typically collect physical information like images, videos, or data on an object’s speed, temperature, and pressure.

Finally, Log Collection involves tracking user operations.

By embedding tracking points in the front end or using scripts in the back end, you can analyze website visits, user bottlenecks, and more.

Top Python Libraries

Table of Contents

User Profiling: Tagging as the Art of Data Abstraction(Practical Data Analysis 4)

Top Python Libraries

Data Collection: How to Automate Data Gathering?(Practical Data Analysis 5)

Discover how to automate data collection using open data sources, web crawling, sensors, and log tracking. Optimize data mining for better insights!

Table of Contents

User Profiling: Tagging as the Art of Data Abstraction(Practical Data Analysis 4)

This post is for paid subscribers