Good Vs Bad Bots

Hi everyone!

We monitor our servers performance every day. And every single day we see a huge pile or requests made by crawling bots.

Some of them are search engines, but some just come to get the information for themselves. Sometimes they become too greedy and we have to block them. Though they do no har deliberately, they might eat resources and bandwidth resulting in performance sagging in the end.

The illustration is below. Not too impressive, but that is on an average day.

[attachment=13288:image.png]

We've made a list of 5 Bad Bots that waste and 5 Good Bots that add value to your website.

#1 Bad bot in our practice appeared to be MJ12Bot by Majestic.

#1 Good bot is Googlebot by Google (no surprise here)

Read more about bots on our blog.

Why don't we create an extended list of Bad bots and put them into a single file for the sake of CS-Cart users? I know that there are millions of them, but elt's share those you meet in your practice.

image.png

We'll start with Bad bots:

User-agents:

-MJ12Bot

-AhrefsBot

-SEMrushBot

-DotBot

-MauiBot

list five of yours

This recent list has 1200 bad bots that you can block through htaccess:

http://tab-studio.com/en/blocking-robots-on-your-page/

We are getting a lot of bot registrations specifically targetting CS-Cart because it has no protection. This seems to be done through Xrumer spammer software which changes the user agent and can therefore not be blocked by user agent.

Looks like recaptcha can cope with it. Blocking all of thousands and thousands of existing bots feels like shooting a fly with a shotgun. We would concentrate on the most "greedy".

Recaptcha is completely broken by Xrumer.

Hello,
It looks like you are promoting this software. Possibly the next step will be a referral link? xD
As you can see, we talk here not about spam or ways to bypass a captcha. We are speaking about malicious load and that some robots who scan the internet can make your website unavailable.

Also, I'd like to note that there is invisible ReCaptcha v3 and also custom ways like honeypots, you know. But this is a different story not for this forum topic.

I assure you that I in no way condone this software, but I do want webmasters aware that there are very advanced malicious tools that can be used against CS-Cart and CS-Cart unfortunately has almost no protection.For us this is a major problem which makes it difficult to use CS-Cart. I have opened several threads on related topics. We have just turned all commenting and reviews off because we cannot protect our sites against it.

It looks like you are promoting this software. Possibly the next step will be a referral link? xD
As you can see, we talk here not about spam or ways to bypass a captcha. We are speaking about malicious load and that some robots who scan the internet can make your website unavailable.

I posted a link above that shows how to block 1200 such bots through htaccess.

We are experiencing a high load from several types of bots:

1. unwanted crawler bots

2. content scrapers

3. spam bots

4. vulnerability scanner bots.

Spam bots and vulnerability scanners often cloak the user agent. Scanners actively search for cs-cart installations.

I posted a thread about an SQL injection method that I found after such bots used it. I found it in my logs. CS-Cart staff is aware of it.

Crawler bots are especially causing issues with crawling features and filters and thereby causing millions of cache files. Our hosting costs have skyrocketed because of it. Its also reducing our site speed significantly.

Also, I'd like to note that there is invisible ReCaptcha v3 and also custom ways like honeypots, you know. But this is a different story not for this forum topic.

Actually ReCaptcha and honeypots can be used to stop malicious bots from crawling the site. Its not just for registering an account.

But there is no functionality for this.

Could you be so kind to let me know of any honeypot functionality for cs-cart?

It would be nice if CS-Cart would have bad bot protection similar to:

https://bad-behavior.ioerror.us/(stops bots by analysis & fingerprinting)

https://wordpress.org/plugins/stopbadbots/

https://swissuplabs.com/magento-bot-protection.html

https://www.extendware.com/magento-bot-blocker.html

https://wordpress.org/plugins/project-honey-pot-spam-trap/

As you see many such applications are not just for spam bots but for all kinds of bots.

After analysis of our statistics and most hits & bandwidth by user agents, I have added several malicious user agents to the above block list:

megaindex.ru

dotbot

mauibot

@maksim It seems that we are dealing with mostly the same bots as you are.

baidupider

After blocking baidu, baiduspider took over. We get a lot of supply offers through baidu so it hurts us to block baidu. But CS-Cart its cache file generation in combination with such active spider heavily hurts the site performance.

I also found one particularly malicious custom targetting CS-Cart from UA and RU, which is also blocked by the above httaccess blocklist.

P-Pharma

Thanks for sharing!

Anyway, I can recommend using "Crawl-Delay" directive in the robots.txt file.
Default robots.txt with Crawl-Delay improvement will look like
User-agent: *
Disallow: /app/
Disallow: /store_closed.html
Crawl-Delay: 1

User-Agent: MJ12bot
Crawl-Delay: 5

Bad bots simply ignore that.

Bad bots simply ignore that.

evil bots :)

Bad bots simply ignore that.

Under the "bad bots," I don't mean crawlers/scanners/etc., which should be banned by different methods. If you want, I can write several ways.
Check any store from the Shopify https://shopify.com/examples. They have a similar "bad bots" list
...
​
User-agent: Nutch
Disallow: /

User-agent: MJ12bot
Crawl-Delay: 10

User-agent: Pinterest
Crawl-delay: 1

Yes, but this is not shopify. CS-Cart does not have bad bot protection.

MJ12bot ignores robots.txt You need to block it from your server.

Yes, but this is not shopify. CS-Cart does not have bad bot protection.

MJ12bot ignores robots.txt You need to block it from your server.

:)

if ($http_user_agent ~* (MJ12bot|...) ) {
    return 403;
}
But blocking by user agent(light-changeable) list is bullshit.
Here need a different approach like, which use web application firewalls (WAF) with analyzing of IP (who/from where), user behavior, type of requests, etc. Or using WAF like https://wallarm.com/ or https://aws.amazon.com/waf/, etc xD

Yes, WAF support would be very nice to have. But it needs CSC integration or you will block valid customers.

Until then the only thing we have is blocking on server firewall level.

Hello

Maybe this addon will be usefull in this problem with network robots.

https://marketplace.cs-cart.com/add-ons/site-management/block-network-robots.html

Best regards

Robert.

Good bots" as well as "bad bots" keep on resulting in increasing Web-traffic like never earlier; however, the second type of bots are getting more-and-more prominent.

So claims Distil Networks in its 2018 Bad Bot Report released in the current week. Among innumerable requests related to bad bots exist, possible malicious activities which fraudsters, hackers along with competitors control. Bots are as well utilized for carrying out brute-force assaults, rival data mining, account hijacks, downtime, digital ad scams, data theft and online scam.

CloudFlare is the solution.

not cloudflare is one, i found great solutions

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker