Jump to content

  • You cannot start a new topic
  • This topic is locked This topic is locked

robots.txt Rate Topic   - - - - -

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 21 March 2010 - 04:16 PM #1

A sample of CS-Cart robots.txt file:
User-agent: *
Disallow: /addons/
Disallow: /cgi-bin/
Disallow: /controllers/
Disallow: /core/
#Disallow: /images/ 
Disallow: /js/
Disallow: /lib/
Disallow: /payments/
Disallow: /schemas/
Disallow: /shippings/
Disallow: /skins/
Disallow: /var/
Disallow: /admin.php
Disallow: /config.php
Disallow: /config.local.php
Disallow: /init.php
Disallow: /prepare.php
Disallow: /store_closed.html
Disallow: /?currency=
Disallow: /index.php?dispatch=
Sitemap: http://www.yourdomain.com/sitemap.xml
# Comments appear after the "#" symbol at the start of a line, or after a directive. A # before Disallow (#Disallow) will deactivate the command.

About robots.txt

Robots.txt is a special file which is located in the root of each server which is a plan text file which allows the administrator of a website to define which web content need to be allowed and disallowed for the bot which visitors their website.

All major search engine like Google, Yahaoo and MSN agrees to the Robots Exclusion Protocol. There are several elements that every website owner need to understand for a easing crawling of their website. Following are the top 10 common mistakes to be avoided while create a robots.txt file.

1. Adding robots.txt not under the root directory

This is one of the common mistake webmaster do. They upload the robots.txt file at the wrong place it must reside in the root of the domain and must be named “robots.txt”. A robots.txt file uploaded in subdirectory is not a valid one since bots check for robots.txt file only in the root of the domain name.

User-agent: *
Disallow:

2. Wrong syntax in robots.txt

Another explanation is that the Webmaster used the wrong syntax when creating the robots.txt. Therefore, always double check the robots.txt file using tools like Robots.txt Checker
Here is an example

User-agent: *
Disallow: private.html

Start a file/directory name with a leading slash char (Example: /private.html).

3. Adding comment at the end of the sentence instead of at the beginning

If you wish to include comments in your robots.txt file, you should precede them with a # sign like this:

# Here are my comments about this entry.
User-agent: *
Disallow:

4. Empty robots.txt file almost like not having one

If you have created a robots.txt file under your root directory and there is nothing in it, then it is similar like not having one. Because nothing is disallowed or no User-agent is given, everything is allowed for every bots.

5. Blocking the pages which you need to get indexed

If you are blocking spider bots and pages using robots.txt you should have thorough understanding of the syntax to be used any mistake can cause you huge problem with the spiderbots.

6. URL’s Paths are case sensitive

URL paths are often case sensitive, so be consistent with the site capitalization WARNING! Many robots and webservers are case-sensitive. So this path will not match any root-level folders named private or PRIVATE.

7. Misspelled robots/user agent names

SpiderBots will ignore mispelled User-Agent names. Check out your raw server log to find User-Agent name which you need to be blocked. Check out UserAgentString.com for a list of User Agent name.

8. Don’t add all the files in one single line

Some of the common mistake is adding all the files under on disallow.
For example

User-agent: *
Disallow: /private/ /images/ /javascript/

This is a wrong syntax and robots will not understand this format. The correct syntax is given below.

User-agent: *
Disallow: /private/
Disallow: /images/
Disallow: /javascript/

9. No allow command in robots.txt

There is only one command that is Disallow: and there is no command called Allow: So if you want to allow the bots to visit the page just don’t add the files.

10. Missing the colon

Missing the colon in Disallow and User-agent entry. Here is one of the example of a missing colon entry.

#This is a wrong entry
User-agent: googlebot
Disallow /

#The correct entry
User-agent: googlebot
Disallow: /
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • ALEXsei_
  • Senior Member
  • Members
  • Join Date: 27-Jun 08
  • 1423 posts

Posted 24 March 2010 - 06:42 PM #2

Disallow: /install/


?

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 24 March 2010 - 07:00 PM #3

Disallow: /install/?

After the installation you have to delete the install folder.
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • gabrieluk
  • Senior Member
  • Members
  • Join Date: 21-Jul 09
  • 133 posts

Posted 25 March 2010 - 05:16 AM #4

"#Disallow: /images/ "

hi,great post.I just dont understand why this folder is commented out...
Number 1

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 25 March 2010 - 09:46 AM #5

"#Disallow: /images/ "

hi,great post.I just dont understand why this folder is commented out...

Sometimes it's a good idea to allow indexing of your images as well (more traffic, visitors, popularity, ...). Of course only if names of your images are optimized for your keywords and SEO like alternate text (alt) and title.
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • Noman
  • Senior Member
  • Members
  • Join Date: 29-Oct 07
  • 526 posts

Posted 25 March 2010 - 12:52 PM #6

Indy

One thing is wrong!

As you know, robots.txt can be looked into by anyone and you have included admin path in it. This way, I know your admin path to the demo website you have

riaditel0077.php

Am I right?
I'm Number 1, so why try harder?

CIA wannabe or having doubts and need some answers?
Spy Gadgets and CCTV Equipment

 
  • Noman
  • Senior Member
  • Members
  • Join Date: 29-Oct 07
  • 526 posts

Posted 25 March 2010 - 12:54 PM #7

Also, Detailed Images folder won't be indexed because there's another .htaccess inside Images folder blocking it.
I'm Number 1, so why try harder?

CIA wannabe or having doubts and need some answers?
Spy Gadgets and CCTV Equipment

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 25 March 2010 - 01:37 PM #8

As you know, robots.txt can be looked into by anyone and you have included admin path in it.


Right, that's the problem of robots.txt files. But still better as the page get indexed.

Also, Detailed Images folder won't be indexed because there's another .htaccess inside Images folder blocking it.

Delete it, or make some changes inside.
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • gabrieluk
  • Senior Member
  • Members
  • Join Date: 21-Jul 09
  • 133 posts

Posted 26 March 2010 - 10:21 PM #9

Hi Indy,
would you mind to explain the difference between this:

Disallow: /somefolder

and this:

Disallow: /somefolder/

??
Number 1

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 26 March 2010 - 10:28 PM #10

Hi Indy,
would you mind to explain the difference between this:

Disallow: /somefolder

and this:

Disallow: /somefolder/

??

1. Disallow: /somefolder

will disallow all files and folders which start with "somefolder" in their URL after the / (root)

---------------------------------------------------------

2. Disallow: /somefolder/

will disallow the folder (and all files inside) "somefolder" after the / (root)
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • gabrieluk
  • Senior Member
  • Members
  • Join Date: 21-Jul 09
  • 133 posts

Posted 28 March 2010 - 03:57 PM #11

Thanks for the explanation....I took this from another post....Is that good?
Disallow: /index.php?dispatch=products.search
Disallow: /index.php?dispatch=wishlist.view
Disallow: /index.php?dispatch=checkout.checkout
Disallow: /index.php?dispatch=profiles.update
Disallow: /index.php?dispatch=profiles.add
Disallow: /index.php?dispatch=auth.login_form&return_url=index.php
Disallow: /index.php?dispatch=checkout.cart
=======================================
i just realized now....
Disallow: /index.php?dispatch=
this covers all the urls
Number 1

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 28 March 2010 - 04:16 PM #12

i just realized now....
Disallow: /index.php?dispatch=
this covers all the urls

Would be ok if it doesn't contain any important URL (string) for you after /index.php?dispatch=
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • adodric
  • Senior Member
  • Members
  • Join Date: 14-May 09
  • 320 posts

Posted 28 March 2010 - 04:31 PM #13

Right, that's the problem of robots.txt files. But still better as the page get indexed.


It won't be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 28 March 2010 - 04:50 PM #14

It won't be indexed, the admin.php file (or whatever you name it) has a meta tag so that robots do not index it. No need for it to be in the robots.txt file.

The tag noindex could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • adodric
  • Senior Member
  • Members
  • Join Date: 14-May 09
  • 320 posts

Posted 28 March 2010 - 05:08 PM #15

The tag noindex doesn't could be disregarded by some crawlers and your site would appear on the web and could appear e.g. in Google as well.


Google won't index it, at least they shouldn't since they even promote this meta-tag's usage:

http://www.google.co...en&answer=93710

Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can't be found to index then it won't be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 28 March 2010 - 05:38 PM #16

Google won't index it, at least they shouldn't since they even promote this meta-tag's usage:

http://www.google.co...en&answer=93710

Plus, if there are no in-bound links to your admin file then no search engine should index it. If it can't be found to index then it won't be. I have files over 5 years old on one of my business servers that I made way back for testing alternate layouts and they have never been indexed. Why? Because there are no links to them anywhere for the robots to follow.

http://www.seoconsul...ta-tags/robots/

Take a look at the table on the bottom. There are xx other robots which would disregard maybe everything.

PLUS an excerpt from your posted link:

To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.


.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • adodric
  • Senior Member
  • Members
  • Join Date: 14-May 09
  • 320 posts

Posted 28 March 2010 - 05:50 PM #17

The main point is that there should be no index of any page that has no in-bound links. If the search engines can't find it, they won't link it.

Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the "noindex" meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.

Further more, my link has the text "Google web index" because it is a Google page talking about the Google search results, of course they will reference themselves.

 
  • indy0077
  • Senior Member
  • Banned
  • Join Date: 03-Nov 09
  • 1431 posts

Posted 28 March 2010 - 06:26 PM #18

The main point is that there should be no index of any page that has no in-bound links. If the search engines can't find it, they won't link it.

Also, I might be looking at the wrong table in your link, but all the robots in it obey and use the "noindex" meta tag. So based on that it is further reason to use the meta tag on the admin page itself (which CS Cart does) and not put an entry for it in the robots.txt file. CS Cart does exactly that too.

Further more, my link has the text "Google web index" because it is a Google page talking about the Google search results, of course they will reference themselves.


Next link for you:

http://www.google.co...315286385&hl=en
.
CS-Cart Professional €160.00 | CS-Cart Multi-Vendor €625.00 | CS-Cart Hosting | SSL Certificates
.
CS-Cart Optimized Servers *** USA & UK VPS Servers

 
  • adodric
  • Senior Member
  • Members
  • Join Date: 14-May 09
  • 320 posts

Posted 28 March 2010 - 06:32 PM #19

Next link for you:

http://www.google.co...315286385&hl=en


If a page was indexed prior to you using the noindex tag then it won't remove it from the index, at least not immediately, perhaps overtime it will fall out of the index. Instead you have to use the search engine's page for removal of the link, such as https://www.google.c.../tools/removals for google and whatever they are for other engines. There will always be idiots out there that try to use the tag on an already indexed page and then complain about it not working, where the problem was really them not the meta tag.

But again: the main point is that there should be no index of any page that has no in-bound links. If the search engines can't find it, they won't link it. ;)

 
  • jobosales
  • Senior Member
  • Members
  • Join Date: 04-Nov 06
  • 3114 posts

Posted 28 March 2010 - 06:47 PM #20

Next link for you:

http://www.google.co...315286385&hl=en

Is this supposed to represent some widespread disregard for the meta tag? Their response suggested to me that the page had already been indexed prior to the noindex, nofollow being added by the site owner. And the response explained that Google would, in fact, remove this from the index the next time they visit the page or the site owner can manually remove it.

At any rate, the following link is to an article by Vanessa Fox (who helped develop Google Webmaster Tools) discussing Robot Exclusion Protocol. The table will show the equivalence for REP directive in the robot.txt and the meta tag. There is a good bit of information for people trying to get a general grasp of REP.

http://janeandrobot....to-your-website

Bob
CS-Cart 2.0.14 (testing)