Bots Are Killing My Bandwidth

See attachment. Bingbot and "empty user agent string" bots are killing my site. I've noticed numerous times this month that my server is under high load and causing the site to be slow. My host isn't much help (they normally are great) as they are suggesting just upgrading my server to see if that helps. I don't think I need to upgrade a server due to bot traffic if they are legit. Just need to figure out the problem. IPs from the bingbot appear to be legit. Any input would be appreciated.

botuse.jpg

You need to find out what the bots are accessing. My guess is the bots that are not legit are running multiple search queries which requires a lot of resource. I am facing the same issue right now.

They are hitting my homepage (/public_html/shop/index.php). The only thing I can think of is I don't have a root index page so I forward that to my /shop/index page as my index. Could be caught in some loop? Though that would be odd as I have had this forwarding going on for years now and have only recently seen the issue.

Just accessing the homepage shouldn't be resource intensive but I don't know what it contains. Are you sure you don't have a lot of queries like the following in your logs?

/?subcats=Y&pcode_from_q=Y&pshort=Y&pfull=Y&pname=Y&pkeywords=Y&search_performed=Y&q=

Just accessing the homepage shouldn't be resource intensive but I don't know what it contains. Are you sure you don't have a lot of queries like the following in your logs?

/?subcats=Y&pcode_from_q=Y&pshort=Y&pfull=Y&pname=Y&pkeywords=Y&search_performed=Y&q=

Well my (parent) site error log is clean. No errors since 2016. Now the "shop" error log is jam packed. See http://forum.cs-cart.com/topic/51769-error-log-is-riddle-with-fnfsphp-errors/#entry299119 But I think I got that resolved.

Just accessing the homepage shouldn't be resource intensive but I don't know what it contains. Are you sure you don't have a lot of queries like the following in your logs?

/?subcats=Y&pcode_from_q=Y&pshort=Y&pfull=Y&pname=Y&pkeywords=Y&search_performed=Y&q=

No, don't see anything like that. High server load again tonight so everything I have done hasn't fixed it yet. Even set up an account with CloudFlare to see if that would help but did not.

Please post your robots.txt file. But note that this is an advisory file and only well-behaved robots will follow its directives.

They don't have much in them. Root robot.txt:

User-agent: *

Disallow: /shopOLD/
Disallow: /shopNEW/
Disallow: /dev/
Disallow: /shop/stick-figure-decals/
Disallow: /links/
Disallow: /cgi-bin/quikstore.cgi?

While my /shop/ one is:

User-agent: *

Disallow: /app/
Disallow: /store_closed.html

The 'shop' file is correct.

Without more details on the navigation path of one of the bots (from your access.log) it is hard to identify what is wrong.

The 'shop' file is correct.

Without more details on the navigation path of one of the bots (from your access.log) it is hard to identify what is wrong.

See if this helps:

I could see more hits are to /shop/ of site. Refer below
--
141.101.76.45 - - [30/Mar/2018:12:04:53 +0000] "GET /shop/aftermarket-logos/b-logos/bridgestone-decals/?items_per_page=20 HTTP/1.1" 200 28491 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"
141.101.76.45 - - [30/Mar/2018:12:04:53 +0000] "GET /shop/index.php?dispatch=products.search&pname=Y&pshort=Y&pfull=Y&pkeywords=Y&subcats=Y&search_performed=Y&q=canada&sort_by=price&sort_order=asc&layout=short_list¤cy=EUR HTTP/1.1" 200 28652 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"
173.245.54.48 - - [30/Mar/2018:12:04:54 +0000] "GET /shop/suzuki-decals/ HTTP/1.1" 200 31709 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
46.229.168.66 - - [30/Mar/2018:12:04:53 +0000] "GET /shop/index.php?dispatch=products.search&items_per_page=80&layout=short_list&page=3&pfull=Y&pkeywords=Y&pname=Y&pshort=Y&q=Subaru&search_performed=Y&sort_by=timestamp&sort_order=asc&type=extended HTTP/1.1" 200 37276 "-" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)"
172.68.215.52 - - [30/Mar/2018:12:04:55 +0000] "GET /shop/Heartagram HTTP/1.1" 200 25930 "-" "Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)"
46.229.168.68 - - [30/Mar/2018:12:04:56 +0000] "GET /shop/other-motorcycle-atv-decals/cannondale-decals/cannondale-f3-decal-sticker.html?currency=EUR HTTP/1.1" 200 40005 "-" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)"

I appreciate everyone's help. This is a little over my head and it seems my host just wants me to block access to legitimate bots via robots.txt to make it stop.

Take a look at your "search link" on your home page. You should see an attribute of rel="nofollow" which is supposed to direct robots NOT to follow that link. The data above indicates that a search is being performed on your site. Notice in the 2nd search that page=3 which indicates that it is continuing to follow the links in the page's pagination links.

They are not legitimate bots like I have been stating. It looks like some are from Cloudflare in other countries and others from data centers/web hosting.

Are you using CloudFlare?

Then you'll either have to block them by IP or otherwise find something in the USER_AGENT where you can detect them.

They are not legitimate bots like I have been stating. It looks like some are from Cloudflare in other countries and others from data centers/web hosting.

Are you using CloudFlare?

Yes, I am using CF but only after the problems started. My host recommended it (free with their VPS). So I don't think that is the problem.

I suppose I can block some IPs and will try the nofollow option too.

Had some time to look at this further. What I found out is several of those IPs listed are not found in a reverse look up. Is that normal? I also use WG Suggestive Search add-on (not sure if that would cause any issues but have been using it for over a year now). There is not a "nofollow" attribute that I can see when running a search. Not sure how to fix that.

I should note that my site has about 12,000 product pages and several hundred other information pages. Final bandwidth used for March from Bingbot/47GB and unknown/225GB which is about 50% more than in the past 2 months and has increased over the last few months little by little. "bingbot" didn't even register on my bandwidth last year for March and "unknown" used about 43GB in March last year.

It's not about how much a bot crawls but what the bots are crawling. Bing bots crawl about 10,000 pages a day between my 2 sites on the same server with very little resource usage and they are all pages and not searches. Searches require a lot of resources to find what the user/bot is requesting. If you are getting hit with simultaneous searches, it will bring your server to a halt.

I don't use CloudFlare but my bet is that you can configure it to crawl or what not to crawl. However, the ones that are not CloudFlare cannot be stopped so easily.

Here is my post about fighting them off when they hit a couple of weeks ago. http://forum.cs-cart.com/topic/51716-search-abuse-subcats/

Hello

We prepared addon for blocking robots.

https://cs-cart.pl/moduly/narzedzia/block-network-robots/

Best regards

Robert.