robots.txt

indy0077 · March 21, 2010, 12:00am

A sample of CS-Cart robots.txt file:

User-agent: * Disallow: /addons/ Disallow: /cgi-bin/ Disallow: /controllers/ Disallow: /core/ #Disallow: /images/ Disallow: /js/ Disallow: /lib/ Disallow: /payments/ Disallow: /schemas/ Disallow: /shippings/ Disallow: /skins/ Disallow: /var/ Disallow: /admin.php Disallow: /config.php Disallow: /config.local.php Disallow: /init.php Disallow: /prepare.php Disallow: /store_closed.html Disallow: /?currency= Disallow: /index.php?dispatch= Sitemap: http://www.yourdomain.com/sitemap.xml# Comments appear after the “#” symbol at the start of a line, or after a directive. A # before Disallow (#Disallow) will deactivate the command.

[SIZE=3] About robots.txt[/SIZE]

Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.

All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.

1. Adding robots.txt not under the root directory

This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since bots check for robots.txt file only in the root of the domain name.

User-agent: *

Disallow:

2. Wrong syntax in robots.txt

Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker

Here is an example

User-agent: *

Disallow: private.html

Start a file/directory name with a leading slash char (Example: /private.html).

3. Adding comment at the end of the sentence instead of at the beginning

If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:

Here are my comments about this entry.

User-agent: *

Disallow:

4. Empty robots.txt file almost like not having one

If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.

5. Blocking the pages which you need to get indexed

If you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.

6. URL’s Paths are case sensitive

URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.

7. Misspelled robots/user agent names

SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.

8. Don’t add all the files in one single line

Some of the common mistake is adding all the files under on disallow.

For example

User-agent: *

Disallow: /private/ /images/ /javascript/

This is a wrong syntax and robots will not understand this format. The correct syntax is given below.

User-agent: *

Disallow: /private/

Disallow: /images/

Disallow: /javascript/

9. No allow command in robots.txt

There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.

10. Missing the colon

Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.

#This is a wrong entry

User-agent: googlebot

Disallow /

#The correct entry

User-agent: googlebot

Disallow: /

alexsei_8 · March 24, 2010, 12:00am

Disallow: /install/

?

indy0077 · March 24, 2010, 12:00am

[quote name=‘ALEXsei_’]Disallow: /install/?[/quote]

After the installation you have to delete the install folder.

gabrieluk · March 25, 2010, 12:00am

"#Disallow: /images/ "

hi,great post.I just dont understand why this folder is commented out…

indy0077 · March 25, 2010, 12:00am

[quote name=‘gabrieluk’]"#Disallow: /images/ "

hi,great post.I just dont understand why this folder is commented out…[/quote]

Sometimes it’s a good idea to allow indexing of your images as well (more traffic, visitors, popularity, …). Of course only if names of your images are optimized for your keywords and SEO like alternate text (alt) and title.

noman · March 25, 2010, 12:00am

Indy

One thing is wrong!

As you know, robots.txt can be looked into by anyone and you have included admin path in it. This way, I know your admin path to the demo website you have

riaditel0077.php

Am I right?

noman · March 25, 2010, 12:00am

Also, Detailed Images folder won’t be indexed because there’s another .htaccess inside Images folder blocking it.

indy0077 · March 25, 2010, 12:00am

[QUOTE]As you know, robots.txt can be looked into by anyone and you have included admin path in it.[/QUOTE]

Right, that’s the problem of robots.txt files. But still better as the page get indexed.

[quote name=‘Noman’]Also, Detailed Images folder won’t be indexed because there’s another .htaccess inside Images folder blocking it.[/quote]

Delete it, or make some changes inside.

gabrieluk · March 26, 2010, 12:00am

Hi Indy,

would you mind to explain the difference between this:

Disallow: /somefolder

and this:

Disallow: /somefolder/

??

indy0077 · March 26, 2010, 12:00am

[quote name=‘gabrieluk’]Hi Indy,

would you mind to explain the difference between this:

Disallow: /somefolder

and this:

Disallow: /somefolder/

??[/quote]

Disallow: /somefolder

will disallow all files and folders which start with “somefolder” in their URL after the / (root)

---------------------------------------------------------
Disallow: /somefolder/

will disallow the folder (and all files inside) “somefolder” after the / (root)

gabrieluk · March 28, 2010, 12:00am

Thanks for the explanation…I took this from another post…Is that good?

Disallow: /index.php?dispatch=products.search

Disallow: /index.php?dispatch=wishlist.view

Disallow: /index.php?dispatch=checkout.checkout

Disallow: /index.php?dispatch=profiles.update

Disallow: /index.php?dispatch=profiles.add

Disallow: /index.php?dispatch=auth.login_form&return_url=index.php

Disallow: /index.php?dispatch=checkout.cart

=======================================

i just realized now…

Disallow: /index.php?dispatch=

this covers all the urls

indy0077 · March 28, 2010, 12:00am

[quote name=‘gabrieluk’]i just realized now…

Disallow: /index.php?dispatch=

this covers all the urls[/quote]

Would be ok if it doesn’t contain any important URL (string) for you after /index.php?dispatch=

adodric · March 28, 2010, 12:00am

[quote name=‘indy0077’]Right, that’s the problem of robots.txt files. But still better as the page get indexed.[/QUOTE]

It won’t be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.

indy0077 · March 28, 2010, 12:00am

[quote name=‘adodric’]It won’t be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.[/quote]

The tag noindex could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.

adodric · March 28, 2010, 12:00am

[quote name=‘indy0077’]The tag noindex doesn’t could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.[/QUOTE]

Google won’t index it, at least they shouldn’t since they even promote this meta-tag’s usage:

[URL=“Block Search Indexing with 'noindex' | Google Search Central | Documentation | Google Developers”]http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=93710[/URL]

Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can’t be found to index then it won’t be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.

indy0077 · March 28, 2010, 12:00am

[quote name=‘adodric’]Google won’t index it, at least they shouldn’t since they even promote this meta-tag’s usage:

[URL]http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=93710[/URL]

Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can’t be found to index then it won’t be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.[/quote]

[URL]http://www.seoconsultants.com/meta-tags/robots/[/URL]

Take a look at the table on the bottom. There are xx other robots which would disregard maybe everything.

PLUS an excerpt from your posted link:

[quote]

To entirely prevent a page’s contents from being listed in the [COLOR=Red]Google web index[/COLOR] even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index. [/quote]

adodric · March 28, 2010, 12:00am

The main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.

Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the “noindex” meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.

Further more, my link has the text “Google web index” because it is a Google page talking about the Google search results, of course they will reference themselves.

indy0077 · March 28, 2010, 12:00am

[quote name=‘adodric’]The main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.

Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the “noindex” meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.

Further more, my link has the text “Google web index” because it is a Google page talking about the Google search results, of course they will reference themselves.[/quote]

Next link for you:

[url]http://www.google.com/support/forum/p/Web%20Search/thread?tid=1bcc1ec315286385&hl=en[/url]

adodric · March 28, 2010, 12:00am

[quote name=‘indy0077’]Next link for you:

[url]http://www.google.com/support/forum/p/Web%20Search/thread?tid=1bcc1ec315286385&hl=en[/url][/QUOTE]

If a page was indexed prior to you using the noindex tag then it won’t remove it from the index, at least not immediately, perhaps overtime it will fall out of the index. Instead you have to use the search engine’s page for removal of the link, such as [url]https://www.google.com/webmasters/tools/removals[/url] for google and whatever they are for other engines. There will always be idiots out there that try to use the tag on an already indexed page and then complain about it not working, where the problem was really them not the meta tag.

But again: the main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.

jobosales · March 28, 2010, 12:00am

[quote name=‘indy0077’]Next link for you:

[url]http://www.google.com/support/forum/p/Web%20Search/thread?tid=1bcc1ec315286385&hl=en[/url][/QUOTE]

Is this supposed to represent some widespread disregard for the meta tag? Their response suggested to me that the page had already been indexed prior to the noindex, nofollow being added by the site owner. And the response explained that Google would, in fact, remove this from the index the next time they visit the page or the site owner can manually remove it.

At any rate, the following link is to an article by Vanessa Fox (who helped develop Google Webmaster Tools) discussing Robot Exclusion Protocol. The table will show the equivalence for REP directive in the robot.txt and the meta tag. There is a good bit of information for people trying to get a general grasp of REP.

[url]http://janeandrobot.com/library/managing-robots-access-to-your-website[/url]

Bob