The htaccess Maker from Admin tools has this setting: 'User agents to block'. I was just wondering, is this list update when Admin Tools is updates?
Everybody will be able to see its contents. Do not include usernames, passwords or any other sensitive information.
Latest post by jjst135 on Thursday, 12 September 2024 07:40 CDT
The htaccess Maker from Admin tools has this setting: 'User agents to block'. I was just wondering, is this list update when Admin Tools is updates?
No, it is not updated automatically. The list is published in https://www.akeeba.com/documentation/admin-tools-joomla/htaccess-maker.html#basic-security. The reason is that you were never meant to use it as-is. By default, it may block many actually useful software such as WGet which is may be used in CLI CRON jobs, your host's URL-based CRON jobs, or third party services such as WebCRON. You are supposed to figure out what you want to allow and remove it from the list.
Also keep in mind that this feature is far less relevant today than it was 5 years ago, let alone 14 years ago when it was first introduced. If you expect to block companies building open text corpora for large language model ("AI") training you cannot block them with this feature. These companies use deceptive User-Agent strings identical to contemporary browsers, and their traffic comes from different IP blocks each time. Blocking them requires active monitoring of traffic across millions of sites which is precisely what services like CloudFlare does.
The User-Agent block is still useful for blocking traffic from source you know you don't want to serve (for example, blocking a specific search engine from listing your site) even if you're using a service like CloudFlare so that you can do that kind of blocking without paying an arm and a leg to enable this simple feature in a third party service.
Nicholas K. Dionysopoulos
Lead Developer and Director
🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!
Thanks Nicholos.
I found this list as well:
https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/_generator_lists/bad-user-agents-htaccess.list
I think this one is maintained pretty good and is a bit longer. Not sure if that is better though ;-)
But as you said, may this won;t block everything you would like to. I also noticed an openAI bot eating up CPU by scanning a site with a complex content filter. I manually blocked an IP range to stop this, and that worked, but an other way to block this might be useful. Cloudflare might be helpful if we run into more issues. Not sure if Cloudflare can be used with our server. Or if we want to...
Anyway, back to the 'User agents to block' lis: I know we have to be careful with this list. We do use cron jobs that use wget. So we always exclude that one. Hopefully all extension makers will enable us to used the Joomla cli / scheduler.
If we would like to update the list in Admin Tools or replace it with another list, how can we do this quickly? I think now we need to remove all existing items one by one and add new item one by on?
There are several of these lists. Many user agents are similar across all lists. What I give you is a curated combination of a few lists.
As for what is better or worse, there is no such thing in absolute terms. Do old and obsolete user agents make it a bad or a good list? It depends. Let me give you an example. The Indy Library was a networking library for Delphi (a Pascal-based rapid application development solution) which was insanely popular in the aughts, to the point many bots were written in it. I still include it in our list even though it's not on other lists because idiots may still use old bots which do nothing but suck bandwidth.
About your use case with OpenAI, watch what you wish for. If you completely block them, you become invisible to ChatGPT's RAG (retrieval-augmented generation) which many people are now using as a replacement for a search engine or, worse, through Copilot (Bing uses ChatGPT with RAG to display relevant results to the user's query). I would only block LLMs from sites which have a tightly-knit community and are unlikely to receive search traffic anyway. You can, however, put filters behind a login, or use caching, or do any number of things to prevent wasting resources on third parties treating your site as a scrapbook.
Hopefully all extension makers will enable us to used the Joomla cli / scheduler.
Also watch what you wish for. Joomla Scheduled Tasks needs a “something” to trigger them. That can be the lazy scheduler, a CLI script, or a URL. The lazy scheduler is doubleplusungood as it will only ever run tasks when your site is busy, and through the web server which applies a heck of a lot more constraints. The CLI script is the best approach, but it requires you setting up one (and only one) CLI CRON job which runs every minute; it will fail to deliver the results you wish for if you have many tasks starting at the same time, or if you have many overlapping long-running tasks. The URL needs to be fetched with… wget, or curl, or a service using one of these libraries, which brings us back to where we started. Read all about it in Akeeba Backup's documentation.
There is no substitute for good, old-fashioned CLI scripts and CRON jobs. Fifty years and countless attempts at doing it “better” have invariably failed because they were more complicated, more error-prone, more limited, or a combination thereof.
If we would like to update the list in Admin Tools or replace it with another list, how can we do this quickly? I think now we need to remove all existing items one by one and add new item one by on?
Sort of exporting and re-importing the .htaccess Maker settings, you don't really have an option.
You can always use the CLI commands to export the .htaccess Maker settings as JSON, use standard tools (e.g. awk) to replace the user-agent list, and a CLI command to import the .htaccess Maker config, and regenerate the .htaccess file.
Nicholas K. Dionysopoulos
Lead Developer and Director
🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!
Food for thought ;-) Thanks Nicholas.
I have been trying the Joomla scheduler on one of our sites and was thinking about the benefits and drawbacks. Your comments help with that. Since we have good access to our server and crons maybe the Joomla scheduler is not adding much benefit. Timingg of crons is always an issue, also on the server side. But the Task Scheduler does seem to add an extra layer (plugins ./ triggers) and indeed more possible points of error.
I think we will stick to our good old server crons for now.
Working hours: We are open Monday to Friday, 9am to 7pm Cyprus timezone (EET / EEST). Support is provided by the same developers writing the software, all of which live in Europe. You can still file tickets outside of our working hours, but we cannot respond to them until we're back at the office.
Support policy: We would like to kindly inform you that when using our support you have already agreed to the Support Policy which is part of our Terms of Service. Thank you for your understanding and for helping us help you!