Top Python Libraries

Top Python Libraries

Share this post

Top Python Libraries
Top Python Libraries
Data Collection: How to Automate Data Gathering?(Practical Data Analysis 5)

Data Collection: How to Automate Data Gathering?(Practical Data Analysis 5)

Discover how to automate data collection using open data sources, web crawling, sensors, and log tracking. Optimize data mining for better insights!

Meng Li's avatar
Meng Li
Nov 22, 2024
∙ Paid
1

Share this post

Top Python Libraries
Top Python Libraries
Data Collection: How to Automate Data Gathering?(Practical Data Analysis 5)
1
Share
What is Data Collection? Its Benefits, Methods, and Challenges

Welcome to the "Practical Data Analysis" Series

Table of Contents

Table of Contents

Meng Li
·
July 12, 2024
Read full story

In the previous section, we discussed how to model user personas, and before modeling, data collection is always necessary.

User Profiling: Tagging as the Art of Data Abstraction(Practical Data Analysis 4)

User Profiling: Tagging as the Art of Data Abstraction(Practical Data Analysis 4)

Meng Li
·
November 20, 2024
Read full story

Data collection forms the foundation of data mining. Without data, mining has no meaning.

Often, the number of data sources, the volume of data, and the quality of the data determine the outcomes of our data mining efforts.

For example, in quantitative investing, you predict future stock fluctuations based on big data and execute buy/sell operations accordingly.

If you have access to all historical stock data, can you build a highly accurate predictive analysis system based on that data?

In reality, if you only have historical stock data, you still cannot understand why stocks experience significant fluctuations.

For instance, a SARS outbreak or a regional war might have occurred at the time, both of which could heavily influence stock prices.

Thus, it’s crucial to recognize that the trends of a dataset are influenced by multiple dimensions.

To achieve high-quality data mining results, we need to collect data from multiple sources to cover as many dimensions as possible while ensuring data quality.

So, from a data collection perspective, what are the main data sources? I categorize data sources into four types:

  1. Open Data Sources

  2. Web Crawling

  3. Sensors

  4. Log Collection

Each of these has its own characteristics. Open data sources are generally databases tailored to specific industries.

For example, the U.S. Census Bureau provides open data on population demographics, regional distributions, and education levels.

Apart from governments, businesses, and universities also release relevant big data. North America performs relatively better in this regard.

Domestically, regions like Guizhou have made bold attempts to build cloud platforms and gradually open data in tourism, transportation, and commerce.

It’s worth noting that many research projects are based on open data sources. Otherwise, there wouldn’t be so many academic papers published annually; researchers need comparable datasets to evaluate algorithm performance.

Web Crawling is generally used for specific websites or apps.

If you want to scrape data from a specific website, such as product reviews on shopping sites, you would need tailored crawling tools.

Sensors, as a data source, typically collect physical information like images, videos, or data on an object’s speed, temperature, and pressure.

Finally, Log Collection involves tracking user operations.

By embedding tracking points in the front end or using scripts in the back end, you can analyze website visits, user bottlenecks, and more.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share