Thanks for the clarification, I'll need to research banning user agents. The purpose of using something like Cleartalk is to have things done automatically, blocking spam based on known IPaddress. I don't think it deals with User agents. It looks like the .htacces file needs editing to ban user agents, but I'm guessing that illegal site crawlers would change frequently?
Who/what determines the user agent name a crawler will use, I mean, for example, what prevents an illegal site crawler from pretending to be a Mozilla browser ?
I guess that's why you suggest detection based on behavior ie look for recurring IP addresses at certain intervals.
Thus some sort of Automated process is needed.
"I guess that's why you suggest detection based on behavior", we mainly use a machine model from SageMaker for this. It compares average traffic with bot traffic and flags it accordingly. It will then apply a hard rate limit but will still allow scraping in the rare event that we flag a customer. We simply take the hit as we automatically scale our application anyways (using kubernetes).
Regarding the user agents, the key behind the fact why people can get banned using the default user agent, is that any request made through Python, gets a python user agent by default. Same applies for curl and most of the other open source alternatives. This should already warn companies that they should not index your domain as they will get an error (which will certainly puzzle their developers).
The next mitigation would be to check your logs every now and then for recurring IP addresses at set intervals. If these do have a separate user agent, you can try sending them a message (using reverse IP lookup you get their domains most of the times).
If all does not work you can just go nuclear and ban their IP ranges and even send them legal warnings that they should not index your domain as it is against your terms of service. But do make sure that you have some kind of 'Fair Use' policy, as you are otherwise left to dust: https://resources.di...-the-word-is-is
PoppedWeb | sales@poppedweb.com |
https://poppedweb.com
TurnKey Website Design | Add-Ons | Performance Audits | Dedicated Server Management
24/7 Support | Response within an hour (during working hours).