robots.txt

A sample of CS-Cart robots.txt file:


User-agent: *
Disallow: /addons/
Disallow: /cgi-bin/
Disallow: /controllers/
Disallow: /core/
#Disallow: /images/
Disallow: /js/
Disallow: /lib/
Disallow: /payments/
Disallow: /schemas/
Disallow: /shippings/
Disallow: /skins/
Disallow: /var/
Disallow: /admin.php
Disallow: /config.php
Disallow: /config.local.php
Disallow: /init.php
Disallow: /prepare.php
Disallow: /store_closed.html
Disallow: /?currency=
Disallow: /index.php?dispatch=
Sitemap: http://www.yourdomain.com/sitemap.xml
# Comments appear after the “#” symbol at the start of a line, or after a directive. A # before Disallow (#Disallow) will deactivate the command.



[SIZE=3] About robots.txt[/SIZE]



Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.



All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.



1. Adding robots.txt not under the root directory



This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since bots check for robots.txt file only in the root of the domain name.



User-agent: *

Disallow:



2. Wrong syntax in robots.txt



Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker

Here is an example



User-agent: *

Disallow: private.html



Start a file/directory name with a leading slash char (Example: /private.html).



3. Adding comment at the end of the sentence instead of at the beginning



If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:


Here are my comments about this entry.

User-agent: *

Disallow:



4. Empty robots.txt file almost like not having one



If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.



5. Blocking the pages which you need to get indexed



If you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.



6. URL’s Paths are case sensitive



URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.



7. Misspelled robots/user agent names



SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.



8. Don’t add all the files in one single line



Some of the common mistake is adding all the files under on disallow.

For example



User-agent: *

Disallow: /private/ /images/ /javascript/



This is a wrong syntax and robots will not understand this format. The correct syntax is given below.



User-agent: *

Disallow: /private/

Disallow: /images/

Disallow: /javascript/



9. No allow command in robots.txt



There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.



10. Missing the colon



Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.



#This is a wrong entry

User-agent: googlebot

Disallow /



#The correct entry

User-agent: googlebot

Disallow: /

Disallow: /install/





?

[quote name=‘ALEXsei_’]Disallow: /install/?[/quote]

After the installation you have to delete the install folder.

"#Disallow: /images/ "



hi,great post.I just dont understand why this folder is commented out…

[quote name=‘gabrieluk’]"#Disallow: /images/ "



hi,great post.I just dont understand why this folder is commented out…[/quote]

Sometimes it’s a good idea to allow indexing of your images as well (more traffic, visitors, popularity, …). Of course only if names of your images are optimized for your keywords and SEO like alternate text (alt) and title.

Indy



One thing is wrong!



As you know, robots.txt can be looked into by anyone and you have included admin path in it. This way, I know your admin path to the demo website you have



riaditel0077.php



Am I right?

Also, Detailed Images folder won’t be indexed because there’s another .htaccess inside Images folder blocking it.

[QUOTE]As you know, robots.txt can be looked into by anyone and you have included admin path in it.[/QUOTE]



Right, that’s the problem of robots.txt files. But still better as the page get indexed.


[quote name=‘Noman’]Also, Detailed Images folder won’t be indexed because there’s another .htaccess inside Images folder blocking it.[/quote]

Delete it, or make some changes inside.

Hi Indy,

would you mind to explain the difference between this:



Disallow: /somefolder



and this:



Disallow: /somefolder/



??

[quote name=‘gabrieluk’]Hi Indy,

would you mind to explain the difference between this:



Disallow: /somefolder



and this:



Disallow: /somefolder/



??[/quote]

  1. Disallow: /somefolder



    will disallow all files and folders which start with “somefolder” in their URL after the / (root)



    ---------------------------------------------------------


  2. Disallow: /somefolder/



    will disallow the folder (and all files inside) “somefolder” after the / (root)

Thanks for the explanation…I took this from another post…Is that good?

Disallow: /index.php?dispatch=products.search

Disallow: /index.php?dispatch=wishlist.view

Disallow: /index.php?dispatch=checkout.checkout

Disallow: /index.php?dispatch=profiles.update

Disallow: /index.php?dispatch=profiles.add

Disallow: /index.php?dispatch=auth.login_form&return_url=index.php

Disallow: /index.php?dispatch=checkout.cart

=======================================

i just realized now…

Disallow: /index.php?dispatch=

this covers all the urls

[quote name=‘gabrieluk’]i just realized now…

Disallow: /index.php?dispatch=

this covers all the urls[/quote]

Would be ok if it doesn’t contain any important URL (string) for you after /index.php?dispatch=

[quote name=‘indy0077’]Right, that’s the problem of robots.txt files. But still better as the page get indexed.[/QUOTE]



It won’t be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.

[quote name=‘adodric’]It won’t be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.[/quote]

The tag noindex could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.

[quote name=‘indy0077’]The tag noindex doesn’t could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.[/QUOTE]



Google won’t index it, at least they shouldn’t since they even promote this meta-tag’s usage:



[URL=“Block Search Indexing with 'noindex' | Google Search Central  |  Documentation  |  Google Developers”]http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=93710[/URL]



Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can’t be found to index then it won’t be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.

[quote name=‘adodric’]Google won’t index it, at least they shouldn’t since they even promote this meta-tag’s usage:



[URL]http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=93710[/URL]



Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can’t be found to index then it won’t be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.[/quote]

[URL]http://www.seoconsultants.com/meta-tags/robots/[/URL]



Take a look at the table on the bottom. There are xx other robots which would disregard maybe everything.



PLUS an excerpt from your posted link:

[quote]

To entirely prevent a page’s contents from being listed in the [COLOR=Red]Google web index[/COLOR] even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index. [/quote]

The main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.



Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the “noindex” meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.



Further more, my link has the text “Google web index” because it is a Google page talking about the Google search results, of course they will reference themselves.

[quote name=‘adodric’]The main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.



Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the “noindex” meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.



Further more, my link has the text “Google web index” because it is a Google page talking about the Google search results, of course they will reference themselves.[/quote]



Next link for you:



[url]http://www.google.com/support/forum/p/Web%20Search/thread?tid=1bcc1ec315286385&hl=en[/url]

[quote name=‘indy0077’]Next link for you:



[url]http://www.google.com/support/forum/p/Web%20Search/thread?tid=1bcc1ec315286385&hl=en[/url][/QUOTE]



If a page was indexed prior to you using the noindex tag then it won’t remove it from the index, at least not immediately, perhaps overtime it will fall out of the index. Instead you have to use the search engine’s page for removal of the link, such as [url]https://www.google.com/webmasters/tools/removals[/url] for google and whatever they are for other engines. There will always be idiots out there that try to use the tag on an already indexed page and then complain about it not working, where the problem was really them not the meta tag.



But again: the main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it. :wink:

[quote name=‘indy0077’]Next link for you:



[url]http://www.google.com/support/forum/p/Web%20Search/thread?tid=1bcc1ec315286385&hl=en[/url][/QUOTE]

Is this supposed to represent some widespread disregard for the meta tag? Their response suggested to me that the page had already been indexed prior to the noindex, nofollow being added by the site owner. And the response explained that Google would, in fact, remove this from the index the next time they visit the page or the site owner can manually remove it.



At any rate, the following link is to an article by Vanessa Fox (who helped develop Google Webmaster Tools) discussing Robot Exclusion Protocol. The table will show the equivalence for REP directive in the robot.txt and the meta tag. There is a good bit of information for people trying to get a general grasp of REP.



[url]http://janeandrobot.com/library/managing-robots-access-to-your-website[/url]



Bob