Possible fix for CPU usage issues

My server has been experiencing intermittent CPU-related performance issues for a while now, and I (may have) finally found the problem and solution. (My site runs version 2.2.4 on a VPS with a 3.4 GHz CPU and 2.3 GB of RAM.)



THE PROBLEM: After identifying periods of excessive load (100% CPU usage) and reviewing access logs for those times, a pattern emerged. During each of those times, robots (crawlers) were indexing URLs containing “features_hash” (due to use of CS-Cart's “Filters” feature).



As many of you know, having several Filters, each with with several options, produces a very large number of possible URL combinations. Even with caching turned on, these “features_hash” pages seem to be more processor-intensive than standard category and product URLs. (Maybe someone could test this theory and do some benchmarking?) Even if they're not more processor-intensive, the potential volume of pages is huge, and the value of having them indexed is (for me at least), very small.



THE SOLUTION: Change the robots.txt file to include the following:


User-agent: *
Disallow: /*?




or



User-agent: *
Disallow: /*features_hash




The first option disallows indexing of all URLs which contain a question mark. Because my site has SEO-friendly links, I've chosen that option, as I'm happy to avoid indexing all non-SEO-friendly URLs. The second option is more specific, and only disallows indexing of “features_hash” pages.



For my site, the change was immediate. I understand that not all robots follow the robots.txt protocol, but most of my robot traffic now avoids the CPU-intensive indexing of unwanted pages.



I would appreciate any confirmation (or refutation) of this theory and fix.



Important note on the robots.txt file: be sure to modify the robots.txt file which resides in your site's root web directory (www.yoursite.com/robots.txt). I've seen some posters state (incorrectly) that you can install CS-Cart in a subdirectory (www.yoursite.com/yourfolder) and still expect the standard robots.txt installed by CS-Cart to be found and used. If you doubt this, search your server logs for /yourfolder/robots.txt





cheers,

Don

Nothing too new really. I've been stating for years now how to cut down on unnecessary crawling/indexing by adding to robots.txt but apparently most do not heed and/or read.


[quote name='tdonj' timestamp='1368738481' post='161910']I've seen some posters state (incorrectly) that you can install CS-Cart in a subdirectory (www.yoursite.com/yourfolder) and still expect the standard robots.txt installed by CS-Cart to be found and used. If you doubt this, search your server logs for /yourfolder/robots.txt[/quote]



This is only true if you just create a subdirectory. However, if you create a subdomain, you can utilize a robots.txt file for each subdomain/subdirectory.