User-agent: *
Disallow: /addons/
Disallow: /cgi-bin/
Disallow: /controllers/
Disallow: /core/
#Disallow: /images/
Disallow: /js/
Disallow: /lib/
Disallow: /payments/
Disallow: /schemas/
Disallow: /shippings/
Disallow: /skins/
Disallow: /var/
Disallow: /admin.php
Disallow: /config.php
Disallow: /config.local.php
Disallow: /init.php
Disallow: /prepare.php
Disallow: /store_closed.html
Disallow: /?currency=
Disallow: /index.php?dispatch=
Sitemap: http://www.yourdomain.com/sitemap.xml# Comments appear after the “#” symbol at the start of a line, or after a directive. A # before Disallow (#Disallow) will deactivate the command.
[SIZE=3] About robots.txt[/SIZE]
Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.
All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.
1. Adding robots.txt not under the root directory
This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since bots check for robots.txt file only in the root of the domain name.
User-agent: *
Disallow:
2. Wrong syntax in robots.txt
Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker
Here is an example
User-agent: *
Disallow: private.html
Start a file/directory name with a leading slash char (Example: /private.html).
3. Adding comment at the end of the sentence instead of at the beginning
If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:
Here are my comments about this entry.
User-agent: *
Disallow:
4. Empty robots.txt file almost like not having one
If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.
5. Blocking the pages which you need to get indexed
If you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.
6. URL’s Paths are case sensitive
URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.
7. Misspelled robots/user agent names
SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.
8. Don’t add all the files in one single line
Some of the common mistake is adding all the files under on disallow.
For example
User-agent: *
Disallow: /private/ /images/ /javascript/
This is a wrong syntax and robots will not understand this format. The correct syntax is given below.
User-agent: *
Disallow: /private/
Disallow: /images/
Disallow: /javascript/
9. No allow command in robots.txt
There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.
10. Missing the colon
Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.
hi,great post.I just dont understand why this folder is commented out…[/quote]
Sometimes it’s a good idea to allow indexing of your images as well (more traffic, visitors, popularity, …). Of course only if names of your images are optimized for your keywords and SEO like alternate text (alt) and title.
As you know, robots.txt can be looked into by anyone and you have included admin path in it. This way, I know your admin path to the demo website you have
[quote name=‘indy0077’]Right, that’s the problem of robots.txt files. But still better as the page get indexed.[/QUOTE]
It won’t be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.
[quote name=‘adodric’]It won’t be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.[/quote]
The tag noindex could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.
[quote name=‘indy0077’]The tag noindex doesn’t could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.[/QUOTE]
Google won’t index it, at least they shouldn’t since they even promote this meta-tag’s usage:
Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can’t be found to index then it won’t be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.
Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can’t be found to index then it won’t be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.[/quote]
Take a look at the table on the bottom. There are xx other robots which would disregard maybe everything.
PLUS an excerpt from your posted link:
[quote]
To entirely prevent a page’s contents from being listed in the [COLOR=Red]Google web index[/COLOR] even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index. [/quote]
The main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.
Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the “noindex” meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.
Further more, my link has the text “Google web index” because it is a Google page talking about the Google search results, of course they will reference themselves.
[quote name=‘adodric’]The main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.
Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the “noindex” meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.
Further more, my link has the text “Google web index” because it is a Google page talking about the Google search results, of course they will reference themselves.[/quote]
If a page was indexed prior to you using the noindex tag then it won’t remove it from the index, at least not immediately, perhaps overtime it will fall out of the index. Instead you have to use the search engine’s page for removal of the link, such as [url]https://www.google.com/webmasters/tools/removals[/url] for google and whatever they are for other engines. There will always be idiots out there that try to use the tag on an already indexed page and then complain about it not working, where the problem was really them not the meta tag.
But again: the main point is that there should be no index of any page that has no in-bound links. If the search engines can’t find it, they won’t link it.
Is this supposed to represent some widespread disregard for the meta tag? Their response suggested to me that the page had already been indexed prior to the noindex, nofollow being added by the site owner. And the response explained that Google would, in fact, remove this from the index the next time they visit the page or the site owner can manually remove it.
At any rate, the following link is to an article by Vanessa Fox (who helped develop Google Webmaster Tools) discussing Robot Exclusion Protocol. The table will show the equivalence for REP directive in the robot.txt and the meta tag. There is a good bit of information for people trying to get a general grasp of REP.