If you are a reader and you are reading The Hindu or any other daily magazine, and you want to see The Hindu magazines which was released two years ago, you need to go to your library or the office of The Hindu magazine to find the archives of the newspaper. Now, I hope if you are reading this article, you will know what the meaning of the archive is.
Coming to the context digitally, this job is taken by the Internet Archive, a non-profit organization which helps users to see articles or websites from certain years back through their website Wayback Machine. Maybe you can hear the name; it is very famous. It is widely used by students, researchers, and journalists. The concept is not scraping the website but preserving it for the future; maybe it is for reference or research or anything.
For example, organizations like The Guardian, The New York Times, The Financial Times, and USA Today are ending their access with the Internet Archive. The main reason they were blocking access to the Internet Archive’s bot is because of AI models. AI models need a lot of content information to provide answers to users’ queries, but the content is taken from big news websites and blogs without proper permissions, and no big tech companies are ready to pay for the scraped data. And there is a lawsuit involving OpenAI for illegally scraping the data. After that lawsuit, they are non-stoppable, and they turn their side to scrape data from the Internet Archive because it has most of the data. So, news companies are claiming AI companies are taking data from the Internet Archive.
And this block is not new. Previously, a few months back, Reddit, a famous social media platform, also blocked the Internet Archive’s bots to stop scraping the data from their platform, describing the same reason that the above news companies said. And the same reason can be said by other news outlets as well in upcoming days. But the big media companies are ready to share the data with AI companies if they are getting paid well. If this continues, no one can see the beauty of the open web in the future, and they will not be able to use the website for any reference or anything.
But still, despite all these, the Internet Archive is keeping the dream of transparency alive.