Hi need help
What will be correct way to block OpenAI (openai.com) from crawling my page?
I know I need to add it in htaccess User agents to block, but don't know what to enter.
Everybody will be able to see its contents. Do not include usernames, passwords or any other sensitive information.
Latest post by nicholas on Thursday, 11 July 2024 02:32 CDT
Hi need help
What will be correct way to block OpenAI (openai.com) from crawling my page?
I know I need to add it in htaccess User agents to block, but don't know what to enter.
OpenAI makes it very easy, they have a page telling you how to opt out of ChatGPT using your site: https://platform.openai.com/docs/gptbot and https://platform.openai.com/docs/plugins/bot. The ChatGPT user agent string is their indexer. The ChatGPT-User is used for RAG (Retrieval-Augmented Generation) i.e. when they do a web search on behalf of the user. I assume you want to prevent the former, not the latter; the latter is basically making your site invisible to the plethora of users who (idiotically) use ChatGPT to search the web.
Essentially, you have two options:
Alternative A (BEST): Edit your robots.txt and add these lines to the bottom
User-agent: GPTBot
Disallow: /
If you do not have a robots.txt file, copy Joomla's robots.txt.dist
file to robots.txt
.
Alternative B. Add ChatGPT
to the list of users agents to block in the .htaccess Maker page, and click on Save & Create .htaccess.
Why I recommend alternative A? Because it uses far less server resources. OpenAI respects the robots.txt de facto web standard. It reads that file when they visit your site. If you tell them to stay away, they stay away. End of story. They won't try to crawl your site anymore. The second alternative returns either a 403 Forbidden or a 404 Not Found HTTP error when OpenAI tries to go through your site. They will still try to access each and every page they know about on your site. Your server will have to still parse the request to the point of loading the .htaccess file and then return the HTTP 403 or 404 response. This ends up consuming more server resources.
Bonus: Alternative C. If you want to block more than just OpenAI from scraping your site, you may want to use CloudFlare and their AI Scrapers and Crawlers block feature, see https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click. This protects against the scumbags who ignore robots.txt and use misleading User-Agent strings to hide the fact they are bots. This is not something Admin Tools or any other WAF installed on your server can do. CloudFlare can only do that because they have insights of billions of pageloads taking place all over the world every day.
Hot take: why I recommend against doing that. While us techies understand the near-farcical limitations of the autocomplete-on-steroids that's generative "AI", most people don't. They are used to – or even have been raised in – a world where if you have a question you go to the computer, ask it, and expect to receive a more or less accurate reply. In their mind, ChatGPT is no different than Google or Bing, it just returns results in a different way that's less effort for them to understand. Preventing AI crawlers from indexing your site only makes the information on your site invisible to these large language models (LLMs), which means that the increasing amount of people using them to look for information will no longer have this information available to them. You become invisible.
I am old enough (first got on-line as a teen back in the early 1990s in BBSes and then the early World Wide Web) to remember when people had the same reaction to search engines in the latter part of the 1990s. They staunchly believed that the web is best experienced through what was the status quo at the time: manually curated directories of sites. Does anyone still remember that Yahoo! stood for Yet Another Hierarchically Organized Oracle, a tongue-in-cheek description of what was at the time a link directory? Those who chose to block search engines from indexing their site became invisible and disappeared.
There's a point that LLMs can create derivative content without attribution, but you do realise that this was already possible, just slower? Someone could just do a series of searches, collect information, and collate them without attribution into a document. It's not just a possibility; it's been happening for decades. The only thing LLMs changes is that this kind of copyright abuse became faster, and even less accurate.
So, yeah, sure I will give you the tools and the knowledge to block "AI" crawlers if that's what you want, but I have to make it clear I think it's worst than a lost cause, it's shooting oneself on the foot. For better or worse – and I reckon it is by far for worse – LLMs are here to stay. Instead of putting our heads in the sand we need to find ways to best exploit them, just like we did with search engines thirty years ago. It's the circle of (tech) life.
Nicholas K. Dionysopoulos
Lead Developer and Director
🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!
Working hours: We are open Monday to Friday, 9am to 7pm Cyprus timezone (EET / EEST). Support is provided by the same developers writing the software, all of which live in Europe. You can still file tickets outside of our working hours, but we cannot respond to them until we're back at the office.
Support policy: We would like to kindly inform you that when using our support you have already agreed to the Support Policy which is part of our Terms of Service. Thank you for your understanding and for helping us help you!