Robots.txt - What Do You Recommend?

Hi,



What excludes / includes should we use in our robots.txt ?

Here's a great article on an SEO optimized robots.txt file: What Is A Robots.txt File? Best Practices For Robot.txt Syntax - Moz

This issue has been discussed in several different threads over time. Suggest you do a search in the forum for robots.txt.

1 Like

hi,



this post may help



best regards,

WSA team

Do note that if your site is properly configured (I.e. Apache noindex set) then standard directories that are not referenced by html anchor tags (links) will not be traversed by a robot since they won't be able to see them. So things like var, app, etc. are not needed to be specified in robots.txt. You should rely on Apache to restrict access to your site to your index.php, admin.php, vendor.php (in the case of multi-vendor) and api.php. Unless you want your images directory published, you should not allow access to it either.



As an example, notice that when someone requests /sitemap.xml, the actual site map is stored off in the var directory. But it is the application that deals with the request and returns the resulting data even though the actual /sitemap.xml file doesn't exist.



Robots.txt is only used when an anchor tag is encountered to determine if the robot should be advised they are forbidden from going there. Also note that robots.txt is completely advisory and compliance with the directives is completely voluntary. Hence some robots will completely ignore the robots.txt file. So DO NOT USE robots.txt FOR SECURITY. It is not a way to secure your data.



The goal is to have access to areas of your site that are accessible through the application itself (I.e. returned via html). Robots.txt should be used to tell robots you don't want them going to various links because they will create duplicate content or because those areas are not intended for general public access (like login pages, download links, search pages, etc.).

[quote name='tbirnseth' timestamp='1436646328' post='222579']

Do note that if your site is properly configured (I.e. Apache noindex set) then standard directories that are not referenced by html anchor tags (links) will not be traversed by a robot since they won't be able to see them. So things like var, app, etc. are not needed to be specified in robots.txt. You should rely on Apache to restrict access to your site to your index.php, admin.php, vendor.php (in the case of multi-vendor) and api.php. Unless you want your images directory published, you should not allow access to it either.



As an example, notice that when someone requests /sitemap.xml, the actual site map is stored off in the var directory. But it is the application that deals with the request and returns the resulting data even though the actual /sitemap.xml file doesn't exist.



Robots.txt is only used when an anchor tag is encountered to determine if the robot should be advised they are forbidden from going there. Also note that robots.txt is completely advisory and compliance with the directives is completely voluntary. Hence some robots will completely ignore the robots.txt file. So DO NOT USE robots.txt FOR SECURITY. It is not a way to secure your data.



The goal is to have access to areas of your site that are accessible through the application itself (I.e. returned via html). Robots.txt should be used to tell robots you don't want them going to various links because they will create duplicate content or because those areas are not intended for general public access (like login pages, download links, search pages, etc.).

[/quote]



Why would one not want the images directory published? I took pains to name all my main image files with natural language names which were composed of product name, main category, several features. The result is a highly descriptive and readable, information rich filename which I hoped would be beneficial for placement on google images.

You're confusing what is visible from a page and what's indexed separately. That's how your images get stolen. Not sure your image naming impacts your seo. If it does at all, it is only slightly… The only way google is going to see your image names is via the html of your pages.