Robots.txt and redirect crawl issues

jesseporven · March 31, 2010, 12:00am

We are preparing to have our website indexed. I am running a link checker to verify all pages before generating a google xml sitemap. The issues listed have been verified with multiple link checkers and sitemap crawl software. All software was run using the robots.txt exclusion and impersonating googlebot for it useragent.

I have run into the problems, listed below.

Issue 1: Page not found errors:

I have run numerous sitemap and link check programs and all show page not found errors.

I have narrowed part of the problem to pages using the dispath = variable such as:

/store/index.php?dispatch=pages.view&page_id=10

Manually visiting these pages shows they are working and available. Can anyone suggest why these functional pages are being flagged as 404 page not found? I have used multiple link checkers and get similar results.

I am also seeing main category pages also listed as 404 page not found, such as:

/store/baby-bedding/baby-crib-bedding/baby-crib-sheets/function.array-merge

Although I have used robots.txt to eliminate pages including the function.array-merge, my concern is the above page is a main category page and should have been spidered as /store/baby-bedding/baby-crib-bedding/baby-crib-sheets/ (without the function.array-merge on the end). Can anyone tell me what may be causing this? I tried increasing the crawler delay, etc to see if it was a load time issue, but that did not help.

Issue 2: Unspidered pages

Going into the store backend and checking the totals I have:

1,636 products, 84 categories (14 main and 70 sub categories). Consequently, I estimate that at least 1,721 pages should get spidered (1636 product pages + 84 category pages + the homepage). No matter which link check program or sitemap crawler I use, it is not indexing that minimum number of pages. Can anyone suggest why this is occurring?

Issue 3: Spidered non-existent pages

I am getting page not found errors for certain crawled pages that do not exist, such as:

/store/baby-toys/skins/

The robots.txt file includes exclusions for /skins/ so I do not see why these are even crawled.

Attached files:

For troubleshooting, I am including the URL of our site and copies of the htaccess and robots.txt in the attached files. Please do not post our domain name in any replies to this post.

Note on the robots.txt file: The blocked entries are duplicated from engine to engine, because a user agent specific entry overrides the default engine entries. For example, if I block /images in the useragent:* and then have a useragnt: googlebot section, I must list /images again, or googlebot will only follow the limitations set within its own section.

If I can get the above squared away I’m ready to begin marketing, so your help or advice is incredibly appreciated.

files.zip

indy0077 · March 31, 2010, 12:00am

Hi,

anyway sitemap.xml should be at the end of the commands and this:

Disallow: /store/*dispatch* Disallow: /store/*sort_by* Disallow: /store/*subcats* Disallow: /*/function.array-merge* Disallow: /index.php?dispatch=* Disallow: /*?sort_by=* Disallow: /*?subcats=*wouldn’t work for some bots because the [ * ] wildcard.

…then this one:

# Rewrite non-www requests to wwww versions-redirect all non-www requests to a www version. RewriteCond %{HTTP_HOST} . RewriteCond %{HTTP_HOST} !^www\.domain\.com [NC] RewriteRule (.*) http://www.domain.com/$1 [R=301,L]

should be

RewriteCond %{HTTP_HOST} ^domain.com$ RewriteRule ^/?$ "http\:\/\/www\.domain\.com" [R=301,L]

jesseporven · March 31, 2010, 12:00am

Thanks indy0077,

I made the changes you suggested as well as removed a couple of default pages in robots.txt. The revised files are attached. I’ve gotten rid of all the 404 errors now, but still cannot get all the site pages spidered.

I enabled the google sitemap addon and checked the map it generated. It shows 1,601 product pages. I manually reviewed them and the main category and subcategory pages are not listed, so those account for some of the missing pages (versus the 1,721 pages that should be listed including the categories, subcategories and homepage). The cat/subcat/home pages should be 85 in total, so there are still another approximately 36 product pages not being seen.

When I use any external software to sitemap or crawl the site though, I am not getting beyond 1031. Obviously some product or category pages are not being spidered, but I don’t see why.

My concern here is that if I can’t determine why some product pages are excluded, then I can’t reliably assume new products added will be indexed either. (as they may fit the pattern of pages that are not crawled).

files-rev1.zip

indy0077 · March 31, 2010, 12:00am

If the pages you want to be crawled are in the sitemap, then it should be ok for most major search engines. Regarding indexing, it’s another topic. It depends on the serach engine if and when the page will be indexed. Just submit in the Goggle Webmaster Tools and wait for the results. It will be nothing wrong there if some sites will return an 404 error or so. Then you can go step by step to those URLs and make necessary changes.

jesseporven · March 31, 2010, 12:00am

Thanks again.

The missing pages (essentially the main category, subcategory and missing 36 product pages) do not appear in the cs-cart generated sitemap. That is my first concern.

My second concern is that while the cs-cart generated sitemap seemed to find almost all the products (except for the missing 36), when I crawl the site with external tools (sitemappers, broken link checkers, etc.) almost 500+ product pages are not found.

Because the cs-cart sitemaps omiits the cat/subcat/36 pages and the independently crawled sitemaps miss over 500+ product pages, I cannot reliably use either.

indy0077 · March 31, 2010, 12:00am

As I said, forget any external tools to check the sitemap. If the urls are there, then it will be crawled by Google. Regardles the 36 missing urls in the sitemap I would ask the CSC support or here why they are not added to the sitemap.

jesseporven · March 31, 2010, 12:00am

Thank you.

I’ll open a supprt ticket on the 36 items.

I politely must disagree with you on the sitemap issue, however.

I have used multiple link check and sitemap tools over the last two days and all seem to have issues with a complete crawl of the site. From a functional and seo view, I would never count on the internal sitemap picking up some pages as a guarantee my site can be crawled. At its simplest, if multiple independent crawlers and programs cannot get thru my site, then search engines can have the same issues. In addition, the internal sitemap is only listing product pages, so it gives no indication of the ability to crawl category pages.

A couple of examples where the above might be a problem.

If i hadn’t run theexternal crawlers I wouldn’t know the sitemap missed pages unless I manually counted them . (I run a site with about 1600 products, but that wil be going up to 6,000 in the next month so manual checks aren’t an option.
If I rely on just the internal sitemap for indexing of my site, then alot of pages could be missed. Here’s why, the general thinking is that a sitemap baits an engine. They get the sitemap and visit your site, that gets them the bulk of your links, but they then should find internal links and follow them to get the pages not on the sitemap. The issue here is if you didn’t run the external tools, like I did, you’d assume the engines could find/crawl those internal links to unsitemapped pages. As my current issue shows, this isn’t always the case. Hence tey’d never see my additional pages.
Running the external tools lets you get an overview of what will be indexed. For example ,the internal sitemap seems to cut out all the redundant subcat, dispatch amd sort by generated pages. If I only saw the internal sitemap, the list of pages would look fine. Using the external tools all thatredundant stuff came up and let me know I had to “prune” it out in my robots file to avoid duplicate content and also to not index irrelevant pages.

I may be wrong here, but I believe if a site is setup and running properly mostly any tool/engine should be able to crawl thru and get almost all your content. I base this assumption on the fact I’m runninga store and not a blog. If my pages aren’t crawled or accessible, then I am losing money. Every missed/uncrawlable page is like a closed door to my store.

jobosales · March 31, 2010, 12:00am

Regarding the missing categories on the internal sitemap: have you checked the options for Google Sitemap to verify the categories are included in the sitemap? Did you also regenerate a new sitemap after making any changes?

Bob

indy0077 · March 31, 2010, 12:00am

Sure that’s right what you say, the question is do the pages in your sitemap have the necessary inbound links? If yes, then it has to work.

tom437 · September 17, 2010, 12:00am

What should I set this as?

[COLOR=“Red”]Product/page SEF URL format:

Category SEF URL format:

Use single URL for all languages:

Show language in the URL:

Act as HTML catalog:[/COLOR]

Thank you in advance.