Web Scraping: The Good, the Bad and the Ugly

19

September

2024

No ratings yet.

From Innovation to Privacy Risks, and How Websites Defend Against It

What once started as an experiment to measure the true size of the internet (Norman, 2020) has long since become an integral part of it. Web scraping is not a new topic, it first emerged and gained popularity in the early 90s. But what exactly is web scraping? In short, it is the extraction of data from a website for analysis or retrieval (Zhao, 2017). The current excitement around large language models (LLMs) like OpenAI’s GPT has renewed the importance of web scraping. These models rely on massive, diverse, and current datasets to improve their performance, which can be aggregated at scale using web scraping.

But is web scraping more helpful or harmful, and what can websites do to prevent it?

The Good, the Bad and the Ugly

The Good

Web scraping can be a valuable tool for research and innovation. For instance, search engines rely on scraping to index websites and provide answers directly on search pages. Beyond this, scholars use web scraping to gather data that would otherwise be inaccessible. For example, monitoring Dark Web activity benefits fields like cybersecurity and social science (Bradley & James, 2019).

The Bad

However, scraping often disregards website terms and conditions, raising ethical and legal questions (Krotov et al., 2020). In the U.S., scraping has been challenged numerous times under laws like the Computer Fraud and Abuse Act (CFAA), with high-profile cases such as LinkedIn vs. hiQ Labs. In Europe, scraping also results in legal risks, especially when performed without consent.

The Ugly

At its worst, scraping can lead to serious breaches of privacy. Scrapers can collect sensitive data, including login credentials and personal information. Worse still, LLMs trained on scraped data may unintentionally memorize and expose this information, creating privacy concerns in AI (Al-Kaswan & Izadi, 2023).

Defending Against Scraping

To protect against web scraping, websites employ various techniques. Common defenses include requiring users to log in, implementing CAPTCHA challenges, and restricting access to private content (Turk et al., 2020). For instance, some websites require registration before allowing access to certain information, while others require the use of multi-factor authentication (MFA). This is intended to make automated logins harder. Additionally, rate limiting is used to block scrapers after a certain number of requests. Other tactics include detecting and blocking IP addresses based on blacklisting.

However, these mechanisms are not foolproof. Scrapers, which are increasingly powered by AI, can now mimic human actions such as typing delays and solving CAPTCHAs (Yu & Darling, 2019). Lastly, proxy networks are used to circumvent rate limiting and IP bans.

This back-and-forth between website hosts and scraping technologies has turned into an ongoing arms race, with AI being leveraged on both sides.

Fun fact: CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart (Google, n.d.).

References

Al-Kaswan, A., & Izadi, M. (2023). The (ab)use of Open Source Code to Train Large Language Models. https://api.semanticscholar.org/CorpusID:257219963

Bradley, A., & James, R. (2019). Web scraping using R. https://www.semanticscholar.org/paper/Web-Scraping-Using-R-Bradley-James/f5e8594d28f8425490a17e02b5697a26c5b54d03

Google. (n.d.). What is ReCAPTCHA? Google Support. Retrieved September 19, 2024, from https://support.google.com/recaptcha/?hl=en#:~:text=A%20%E2%80%9CCAPTCHA%E2%80%9D%20is%20a%20turing,users%20to%20enter%20with%20ease.

Krotov, V., Johnson, L., & Silva, L. (2020). Legality and ethics of web scraping. Communications of the Association for Information Systems, 47, 539–563. https://doi.org/10.17705/1cais.04724

Norman, J. (2020, September). Matthew Gray develops the world wide web wanderer. Is this the first web search engine? HistoryofInformation.com. Retrieved September 19, 2024, from https://historyofinformation.com/detail.php?id=1050

Turk, K., Pastrana, S., & Collier, B. (2020). A tight scrape: Methodological Approaches to cybercrime Research data collection in adversarial environments. 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). https://doi.org/10.1109/eurospw51379.2020.00064

Yu, N., & Darling, K. (2019). A Low-Cost approach to crack Python CAPTCHAs using AI-Based Chosen-Plaintext attack. Applied Sciences, 9(10), 2010. https://doi.org/10.3390/app9102010

Zhao, B. (2017). Web scraping. In Springer eBooks (pp. 1–3). https://doi.org/10.1007/978-3-319-32001-4_483-1

Please rate this

Your Profile Is Being Scraped

18

September

2020

4.33/5 (3) Facial recognition is gaining interest the last few years, all around the internet and also on this forum, more and more is being written about facial recognition itself, the positive and negative effects and the underlying technologies. Major companies are competing on developing better algorithms and are selling their developed technologies as cloud services. Easy API’s make it possible for every tech savvy person to use those services within minutes. But still the subject of facial recognition is still a lot of theory and less action. Current news items often discussed a few local tests or the implementation of video tracking within law enforcements. The major steps made on facial recognition are made within China, were facial identification or payment becomes more mainstream. But over the last year one company’s name popped up several times, gaining interest of several tech journalist, Clearview AI.

A lot of people nowadays have a certain social media profile, often with a public name, profile picture and some basic information. Of course it would be possible to go to every page and collect user information randomly, but no one every took the time to do this or saw the benefits of doing this, expect the startup Clearview AI.

Scraping is the act of automatically extracting public data of the internet. Every website can be scraped, even all data and texts from this blog for example. Clearview AI, performed these scraping operations on a huge level, they started scraping all the public profiles of Facebook and saved this data in one big database. If your profile picture and name are public on one of your social media accounts, which are probably most of the profiles, it is likely that these are included in the database of Clearview AI.

Would not every law enforcement agency be interested in the possibility of finding a suspect with the help of a few clicks? Robbers, fraudsters or cyber bullies are also people, most of the time with a personal social media account. This is exactly what Clearview AI thought while developing their business model, by scraping all public available data, training huge neural networks and selling it worldwide all bundled in a good looking application to law enforcement agencies. According to a graph of the New York Times, this will bring the number of photos the FBI can search from their own database of 411 million photos to a staggering number of 3 billion photos that are included in the Clearview AI application, all supported by an impressive artifical intelligence model.

This brings up some important questions, do we support facial recognition as a way of law enforcement? Is it legal to scrape information from social networks? Does making your profile public also implies that you give permission for your data to be saved and used for AI training purposes?

Next to the negative sides of web scraping, there are also interesting possibilities of using these methods. You could for example scrape this blog and analyze the word usage or identify trends and topics of interest over time. Web scraping also enables new innovations that aggregate data from multiple sources in creative ways creating information that was not available before.

The New York Times has an article going more into depth in the background of Clearview AI. Click here to read the full article or listen to accompanying podcast if your interested.

I would love to hear your opinion about the subject of web scraping and the usage of facial recognition. If you like to have a more technical background on how to implement web scraping techniques please let me know in the comments.

 

Sources

Hill, K. (2020, January 18). The Secretive Company That Might End Privacy as We Know It. The New York Times. https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html

Matsakis, L. (2020, January 27). Scraping the Web Is a Powerful Tool. Clearview AI Abused It. Wired. https://www.wired.com/story/clearview-ai-scraping-web/

 

 

Please rate this