9 Projects That Prove Web Scraping is Revolutionizing Research

Web scraping is revolutionizing academic and professional research by enabling the collection of big data.  Advanced collection practices allow higher levels of data extraction at faster rates, enabling new research opportunities in healthcare, finance, ecology, politics, and economics.

The Digital Landscape Makes New Research Possible

New data sources from across the world are continuously being created as people increasingly conduct business, personal, and professional transactions online. As these sources expand, researchers are finding new opportunities to develop their research and obtain new insights.

Advanced insights can also lead to new questions, creating a cycle that drives further research and increases understanding of the subject matter. As a result, researchers improve their findings, derive increasingly accurate conclusions, and produce better solutions to problems affecting people, businesses, and governments.

How Researchers Obtain High-Quality Data

Legacy data sources include journals, purchased data sets, and information collected manually from the internet. Besides being resource-intensive, these methods typically require hours of manual entry into spreadsheets that are tedious, time-consuming, and prone to error.

Today’s research landscape is vastly superior. Researchers now access a trove of online data covering nearly every subject. Examples include financial websites with historical stock information, public databases with clinical drug trials, and online marketplaces with detailed product and pricing information.

Modern data gathering methods enable researchers to extract that information at scale and automatically update their databases. For example, imagine an online resource with thousands of stocks, including historical pricing information, current news, and trading volumes. Web scraping makes it possible to make thousands of data requests from that website per second and deliver the information in a spreadsheet format that analysts can easily read.

How Web Scraping Works

Advanced web scraping requires the creation of scripts (or “bots”) written in a programming language like Python to crawl websites and extract data. Alternatively, smaller or personal data extraction projects can be executed using browser extensions that parse website HTML and export the information in a spreadsheet format.

Another alternative is a web scraping API that can be easily customized. Researchers opting for this solution can quickly extract information at scale and avoid many common process challenges, allowing them to focus on obtaining insights for research purposes.

9 Research Projects Enhanced By Web Scraping

Web scraping enables new research into economics, healthcare, ecology, and politics by allowing researchers to gather data from emerging online resources. Without automation, some of these projects would have been impossible to complete without hundreds of hours of manual data collection, entry, and processing.

Some recent examples include:

Opioid-Related Death Tracking

Oxford researchers downloaded over 3000 PDF documents to study opioid deaths in the United Kingdom. Web scraping made it possible to scale the project considerably so they could focus on other research-related tasks. “We could manually screen and save about 25 case reports every hour,” reads an article in Nature describing the project. “Now, our program can save more than 1,000 cases per hour while we work on other things, a 40-fold time saving.”

Automating data collection also opened up collaboration. By publishing the database and frequently re-running the program, researchers enriched the project by sharing findings with the academic community.

Tracking Clinical Trials

The Oxford researchers studying opioid deaths in the previous example also used web scraping to gather information from clinical-trial registries to further develop their published data set tracking primary-care prescribing in England

GDP Nowcasting

Government entities typically announce gross domestic product (GDP) on a quarterly basis. Web scraping enables researchers to make GDP predictions more frequently.

GDP is calculated by adding consumption, investment, government, and net exports. Most of these components are higher-frequency metrics that can be scraped from online sources, allowing for the creation of models that predict GDP ahead of official announcements.

Reserve banks throughout the world currently leverage this method, including the United States, European Union, South Africa, China, Brazil, and Japan.

COVID-19 Tracking

The Bank of Japan (BOJ) actively uses alternative data - information outside “official” government and corporate reports - to evaluate key economic sectors and develop policy. Recent applications include the collection of mobility data during COVID-19 that revealed pedestrian traffic, financial transactions, and airport visits.

Price Inflation

Researchers from Poland gathered food and non-alcoholic beverage prices from major online retailers and created a framework to estimate inflation rates in the near term (also called a “nowcast”). They demonstrated that accounting for online food prices in a simple, recursively optimized model effectively predicts inflation and even outperforms traditional approaches.

Unemployment Insurance Claims

Unemployment reached all-time highs during COVID-19, highlighting the need to predict jobless rates to estimate unemployment claims. Researchers in a recent paper explored information sets and data structures from the spring of 2020 to predict job losses in the United States and how they can be used to forecast unemployment claims.

Ecological Insights

Environmental researchers are extracting data from Google Trends, news articles, and social media to get insights into species occurrences, behaviours, traits, phenology, functional roles, and abiotic environmental features. Referred to as “iEcology”, this emerging research approach aims to quantify patterns and processes in the natural environment using digital data from public sources.

Political Sentiment

Internet users are becoming increasingly vocal about political matters on social media networks and public forums. Political groups are leveraging this trend by scraping online sources to identify critical issues and using that data to formulate campaign content.

ESG Data

Environment, Social, and Governance (ESG) investment guidelines are designed to address climate change concerns, greenhouse gas emissions, water management, and waste reduction. Investment managers and financial analysts can assess an entity's adherence to these guidelines by scraping online databases containing ESG data.

Discover scraping solutions for your next research project

Online publicly-available sources can be scraped in various ways depending on your project's size and scope. Discover the best solution for your needs by reading our free guide: Choosing the Right Scraping Solution in 2022: Essentials You Need to Know.