Need web data? Here’s how to harvest them
When Ensheng Dong co-created the Johns Hopkins University COVID‑19 Dashboard in January 2020, it was a labor of love. Dong, a systems engineer at the University of Baltimore in Maryland, had friends and family in China, including some in Wuhan, the site of the original outbreak. “I really wanted to see what was happening in their area,” he says. So Dong began collecting public health data in cities known to be affected.
At first, the work was manual. But as the outbreak became a pandemic and the COVID-19 dashboard became the go-to source for governments and scientists seeking information on the spread of the disease, Dong and his colleagues had to hard to follow. In the United States alone, the team was tracking medical reports from more than 3,000 counties, he says. “We were updating at least three to four times a day,” he recalls, and there was no way for the team to maintain this relentless pace manually. Luckily, he and his graduate advisor, systems engineer Lauren Gardner, found a more scalable solution: web scraping.
Scraping algorithms extract relevant information from websites and report it in a spreadsheet or other user-friendly format. Dong and his colleagues developed a system that could capture COVID-19 data from around the world and update the numbers without human intervention. “For the first time in human history, we can follow in real time what is happening with a global pandemic,” he says.
Similar tools collect data from various disciplines. Alex Luscombe, a criminologist at the University of Toronto in Canada, uses scraping to monitor law enforcement practices in Canada; Phill Cassey, a conservation biologist at the University of Adelaide, Australia, follows the global wildlife trade on internet forums; and Georgia Richards, an epidemiologist at the University of Oxford, UK, analyzes coroners’ reports in search of preventable causes of death. The technical skill required is not negligible, but neither is it overwhelming – and the benefits can be immense, allowing researchers to quickly collect large amounts of data without the errors inherent in manual transcription. “There are so many resources and so much information available online,” says Richards. “It’s just sitting there waiting for someone to come and use it.”
Get the goods
Some scientific databases, such as PubMed, and social networks, such as Twitter, provide application programming interfaces (APIs) that provide controlled access to this data. But for other sites, what you see is what you get, and the only way to turn website data into something you can work with is to laboriously copy visible text, images, and embedded files. . Even if an API exists, websites may limit what data can be obtained and how often.
Scrapers offer an effective alternative. After being “trained” to focus on particular elements on the page, these programs can collect data manually or automatically, and even on a schedule. Commercial tools and services often include user-friendly interfaces that simplify the selection of web page elements to target. Some, such as Web Scraper or Data Miner web browser extensions, allow free manual or automated scraping from a small number of pages. But scaling can be expensive: Services like Mozenda and ScrapeSimple charge a minimum of $250 per month for scraping-based projects. These tools may also lack the flexibility to tackle various websites.
Crack the code
Simple web scraping projects require relatively modest coding skills. Richards says his team solves most problems “by googling how to fix an error.” But a good understanding of web design and coding fundamentals provides a valuable edge, she adds.
“I mostly use developer mode now,” says Luscombe, referring to the browser setting that allows users to remove the familiar facade of a website to access raw HTML and other programming code below. below. But there are tools that can help, including the SelectorGadget browser extension, which provides a user-friendly interface for identifying “tags” associated with specific website elements.
The complexity of a scraping project is largely determined by the targeted site. Forums usually have fairly standard layouts, and a scraper that works on one can be easily changed to another. But other sites are more problematic. Cassey and his colleagues monitor sales of plants and animals that are either illegal or potentially ecologically harmful, and forums hosting such transactions may come and go without warning, or change their design. “They tend to be a lot more changeable to try to limit how easily out-of-the-box web scrapers can just walk by and collect information,” Cassey explains. Other websites may contain encrypted HTML elements or complex dynamic features that are difficult to decipher. Even a sloppy web design can sabotage a scraping project – a problem Luscombe often faces when scraping government-run websites.
The desired data may not be available as HTML-encoded text. Chaowei Yang, a geospatial researcher at George Mason University in Fairfax, Virginia, oversaw the development of the COVID-Scraper tool, which extracts data on pandemic cases and mortality from around the world. He notes that in some jurisdictions, this data was locked into PDF documents and JPEG image files, which cannot be extracted with conventional scraping tools. “We had to find the tools that could read the datasets, and also find local volunteers to help us,” Yang explains.
Data Due Diligence
Once you figure out how to scrape your target site, you need to think about how to do it ethically.
Websites typically specify terms of service that establish rules for collecting and reusing data. These are often permissive, but not always: Luscombe thinks some sites weaponize the terms to prevent bona fide research. “I work against tons of powerful criminal justice agencies that really have no interest in me having data on the race of the people they arrest,” he says.
Many websites also provide “robots.txt” files, which specify acceptable operating conditions for scrapers. These are designed in part to prevent automated queries from overwhelming servers, but generally leave room for routine data collection. Following these rules is considered good practice, even if it prolongs the scraping process, for example by creating delays between each page request. “We don’t mine things at a faster rate than a user would,” Cassey explains. Researchers can also minimize server traffic by scheduling scraping jobs during off-peak hours, such as the middle of the night.
If private and personally identifiable data is collected, additional precautions may be necessary. Researchers led by Cédric Bousquet at the University Hospital of Saint-Étienne in France have developed a tool called Vigi4Med, which scrapes medical forums to identify adverse events associated with drugs that may have escaped notice during clinical trials. “We anonymized user IDs, and they were separated from other data,” says Bissan Audeh, who helped develop the tool as a postdoctoral researcher in Bousquet’s lab. “The team that worked on the data annotation had no access to these usernames.” But contextual cues from online posts still potentially allow re-identification of anonymized users, she says. “No anonymization is perfect.”
Scraping projects don’t end when the harvest is over. “All of a sudden, you’re dealing with massive amounts of unstructured data,” Cassey says. “It becomes more of a data processing problem than a data getting problem.”
Johns Hopkins’ COVID dashboard, for example, requires careful fact-checking to ensure accuracy. The team ended up developing an anomaly detection system that flags unlikely changes in numbers. “Let’s say a small county that was reporting 100 cases every day is reporting maybe 10,000 cases,” Dong says. “It could happen, but it’s very unlikely.” Such cases trigger closer inspection of the underlying data — a task that depends on a small army of multilingual volunteers who can decipher each country’s COVID-19 reports. Even something as simple as a typo or a change in date formatting can stall a data analysis pipeline.
For Cassey’s wildlife tracking app, figuring out what species are actually being sold — and whether those transactions are legal — keeps the team on their toes. If sellers know they are breaking the law, they will often mask transactions with deliberately misleading street names or names of plants and animals, much like online drug dealers do. For one particular parrot species, for example, the team found 28 “trade names”, he says. “A lot of fuzzy data matching and natural language processing tools are needed.”
Still, Richards says would-be scratchers shouldn’t be afraid to explore. Start by reassigning an existing web scraper. Richards’ team adapted their software to analyze coroners’ reports from a colleague’s tool for clinical trial data. “There are so many platforms and there are so many online resources,” she says. “Just because you don’t have a colleague who’s done web scraping before doesn’t stop you from trying.”