Scraped data is being misused to scam people, but it also helped Covid-19 monitoring worldwide and benefited researchers
By Kiran N. Kumar
Facebook owner Meta has sued Octopus, a US subsidiary of a Chinese tech firm and an individual from Turkey for scraping data from Facebook, Instagram and other big tech platforms triggering the debate on its ethical side as well as its advantages.
With more than one million customers, Octopus offers scraping services for a fee to scrape data from Amazon, eBay, Twitter, Yelp, Google, Target, Walmart, Indeed, LinkedIn, Facebook and Instagram.
Turkey-based Ekrem Ates used automated Instagram accounts to scrape data on 350,000 Instagram users and publishing it on his own websites or “ceclone sites”.
Read: With Big Brother watching, what’s your choice now? (May 16, 2022)
The software was able to scrape data about Facebook users, email addresses, phone numbers, gender and date of birth and in Instagram, it was collecting data on followers, information such as name, user profile URL, location and number of likes and comments per post.
Meta has been fighting the menace of cloning for long and succeeded to reduce 100 different Instagram clone sites to ten now as the scraped data is being misused to scam people, and damage the credibility of the original Facebook or Instagram sites of Meta.
In fact, scraping data helped Covid-19 monitoring worldwide and equally benefited researchers in medical, legal and even in environmental protection.
Scraping data during pandemic
As many would remember, Ensheng Dong and his team at the Johns Hopkins University created a Covid 19 Dashboard in January 2020, which became the barometer for the governments and scientists globally.
A systems engineer at the university in Baltimore, Maryland, Dong and his team found scraping useful to get data from Wuhan in China where the Covid-19 outbreak was first reported.
As the outbreak became a pandemic, and the Covid-19 Dashboard became the sole authentic source requiring a high proportion of scalability, Dang and his team turned to web scraping to capture information from thousands of websites and report it in a spreadsheet without human intervention.
“For the first time in human history, we can track what’s going on with a global pandemic in real time,” Dong told Nature.
Evolving tool for researchers
Web scraping is not new. Alex Luscombe, a criminologist at the University of Toronto in Canada, uses scraping to monitor law-enforcement practices in the country, while Phill Cassey, a conservation biologist at the University of Adelaide in Australia, has been engaged in tracking the global wildlife trade on Internet forums with scraping.
Georgia Richards, an epidemiologist at the University of Oxford, UK, vets coroners’ reports for preventable causes of death. “There’s so many resources and so much information available online,” Richards says. “It’s just sitting there waiting for someone to come and make use of it.”
Now, scraping has evolved with sophisticated tools available commercially from service providers such as Mozenda and ScrapeSimple who charge $250 per month for scraping.
But many academics still prefer open-source alternatives such as the Beautiful Soup package, or Selenium, and RSelenium, where they can build further on these platforms to customize.
Web scraping has its own challenges
For instance, Cassey found monitoring sales of animals illegally is far more dynamic. Forums hosting such transactions appear and disappear without warning and the culprits use dubious and misleading names for plants and animals. For one particular parrot species, the team said it has found 28 ‘trade names’.
Chaowei Yang, a geospatial researcher at George Mason University in Fairfax, Virginia, cites another challenge as most data is locked in PDF documents and JPEG image files, which cannot be mined using conventional scraping tools.
Some websites refuse to share data legally. “I work against tons of powerful criminal-justice agencies that really have no interest in me having data about the race of the people that they’re arresting,” Yang says.
Researchers at the University Hospital of Saint-Étienne in France anonymized user IDs when scraping medical forums to identify drug-associated adverse events.
Read: Meta sues Chinese company’s US subsidiary for scraping Facebook and Instagram data (July 6, 2022)
But the danger of context clues can still reveal their identity, says Bissan Audeh, who helped to develop the tool as a postdoctoral researcher in Bousquet’s lab. “No anonymization is perfect,” she says.
Yet, respecting the rules of ethical scraping is considered best practice though it means a protracted process and as good as manual scraping.
Even the Johns Hopkins Covid Dashboard team faced similar ethical questions as the data scrapped urgently required fact-checking for accuracy, thus, requiring an army of multilingual volunteers to decipher each country’s Covid-19 reports.