Web Scraping: The Good, the Bad and the Ugly

No ratings yet.

From Innovation to Privacy Risks, and How Websites Defend Against It

What once started as an experiment to measure the true size of the internet (Norman, 2020) has long since become an integral part of it. Web scraping is not a new topic, it first emerged and gained popularity in the early 90s. But what exactly is web scraping? In short, it is the extraction of data from a website for analysis or retrieval (Zhao, 2017). The current excitement around large language models (LLMs) like OpenAI’s GPT has renewed the importance of web scraping. These models rely on massive, diverse, and current datasets to improve their performance, which can be aggregated at scale using web scraping.

But is web scraping more helpful or harmful, and what can websites do to prevent it?

The Good, the Bad and the Ugly

The Good

Web scraping can be a valuable tool for research and innovation. For instance, search engines rely on scraping to index websites and provide answers directly on search pages. Beyond this, scholars use web scraping to gather data that would otherwise be inaccessible. For example, monitoring Dark Web activity benefits fields like cybersecurity and social science (Bradley & James, 2019).

The Bad

However, scraping often disregards website terms and conditions, raising ethical and legal questions (Krotov et al., 2020). In the U.S., scraping has been challenged numerous times under laws like the Computer Fraud and Abuse Act (CFAA), with high-profile cases such as LinkedIn vs. hiQ Labs. In Europe, scraping also results in legal risks, especially when performed without consent.

The Ugly

At its worst, scraping can lead to serious breaches of privacy. Scrapers can collect sensitive data, including login credentials and personal information. Worse still, LLMs trained on scraped data may unintentionally memorize and expose this information, creating privacy concerns in AI (Al-Kaswan & Izadi, 2023).

Defending Against Scraping

To protect against web scraping, websites employ various techniques. Common defenses include requiring users to log in, implementing CAPTCHA challenges, and restricting access to private content (Turk et al., 2020). For instance, some websites require registration before allowing access to certain information, while others require the use of multi-factor authentication (MFA). This is intended to make automated logins harder. Additionally, rate limiting is used to block scrapers after a certain number of requests. Other tactics include detecting and blocking IP addresses based on blacklisting.

However, these mechanisms are not foolproof. Scrapers, which are increasingly powered by AI, can now mimic human actions such as typing delays and solving CAPTCHAs (Yu & Darling, 2019). Lastly, proxy networks are used to circumvent rate limiting and IP bans.

This back-and-forth between website hosts and scraping technologies has turned into an ongoing arms race, with AI being leveraged on both sides.

Are you a robot?

Fun fact: CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart (Google, n.d.).

References

Al-Kaswan, A., & Izadi, M. (2023). The (ab)use of Open Source Code to Train Large Language Models. https://api.semanticscholar.org/CorpusID:257219963

Bradley, A., & James, R. (2019). Web scraping using R. https://www.semanticscholar.org/paper/Web-Scraping-Using-R-Bradley-James/f5e8594d28f8425490a17e02b5697a26c5b54d03

Google. (n.d.). What is ReCAPTCHA? Google Support. Retrieved September 19, 2024, from https://support.google.com/recaptcha/?hl=en#:~:text=A%20%E2%80%9CCAPTCHA%E2%80%9D%20is%20a%20turing,users%20to%20enter%20with%20ease.

Krotov, V., Johnson, L., & Silva, L. (2020). Legality and ethics of web scraping. Communications of the Association for Information Systems, 47, 539–563. https://doi.org/10.17705/1cais.04724

Norman, J. (2020, September). Matthew Gray develops the world wide web wanderer. Is this the first web search engine? HistoryofInformation.com. Retrieved September 19, 2024, from https://historyofinformation.com/detail.php?id=1050

Turk, K., Pastrana, S., & Collier, B. (2020). A tight scrape: Methodological Approaches to cybercrime Research data collection in adversarial environments. 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). https://doi.org/10.1109/eurospw51379.2020.00064

Yu, N., & Darling, K. (2019). A Low-Cost approach to crack Python CAPTCHAs using AI-Based Chosen-Plaintext attack. Applied Sciences, 9(10), 2010. https://doi.org/10.3390/app9102010

Zhao, B. (2017). Web scraping. In Springer eBooks (pp. 1–3). https://doi.org/10.1007/978-3-319-32001-4_483-1