Googlebot not seeing all URLs

How familiar are you guys with Googlebot and CS-Cart working together? So, check this out… I went to Google Webmaster Tools.



This is crazy. On 7/7, with my client's domain the number of blocked URL’s from the robots.txt file was just 3. By 7/14, it was up to 419! What in the world caused this huge shift?



Was there an update to the CS-Cart? I don't recall one. I also don't recall changing the sitemap.xml file at any point.



What's inside my robots.txt file now is below:



User-agent: *

Disallow: /images/thumbnails/

Disallow: /skins/

Disallow: /payments/

Disallow: /store_closed.html

Disallow: /core/

Disallow: /lib/

Disallow: /install/

Disallow: /js/

Disallow: /schemas/



Doesn't that look normal? Does CS-Cart generate a robots.txt like Worpress does? I'm wondering if there is another file somewhere that the googlebot might be scanning. I'm totally baffled here. This is so odd.



My site has dropped ranking from page 1 to page 6 for desired keywords. Yet, I seem to remain ranking well on Bing and Yahoo, which makes me feel like this is a googlebot issue and/or the other bots have yet to discover the problem I have.

So, digging deeper. It looks like Google Webmaster Tools is viewing the url with “www” as one site and “without www” as another. Without the www, the sitemap was registering all links. With www, the sitemap was only registering 3 urls. I thought CS-Cart URL Canonicalization?

Ok what you need to do is add both the www version and the non www version to webmaster tools.



Then you will be able to use then use the change of address feature in the setting section. You do this in the site you don't want google to show. For example if you want the WWW to show in your urls then do it the none www site.

Thanks Kickoff3pm!



Is it normal to have to create two different versions of a domain (with and without www) in Google Webmaster Tools? I have never had any issues before? If so, I need to log in and add update my account.

[quote name='ckad79' timestamp='1374187405' post='165552']

Thanks Kickoff3pm!



Is it normal to have to create two different versions of a domain (with and without www) in Google Webmaster Tools? I have never had any issues before? If so, I need to log in and add update my account.

[/quote]



It's the best way, I have 6 domains pointing at my site so six setup. It makes sure all your SEO credit gets to the right place. You'll proberbly see now more sites linking in as the links on other places with WWW and NONWWW become the same place.

In CS-Cart, set your store URL to http://www… and in Webmaster Tools, select the preference of www over non-www in settings, and finally, redirect non-www to www via 301 permanent redirect htaccess rules. You should not need to add two domains to your WMT tool for www and non-www. You should use the preventative measures as described to set your preference for all search engines.

Ok the seems to be a difference of option about this so let me explain why it is needed even if you follow Stellarbytes suggestions (you should do this too by the way)


  1. In webmaster tools select preference of www over non-www - This will only determine how google displays your urls in searchs. It’s trimmings.
  2. Redirect 301 - This will do as is says but won’t stop external links having the wrong format it’ll just convert for your visitors to see the correct one. Again trimmings.



    All the above is good advice but non of it deals with the ranking.



    So you prefer to show www but you still loose ranking juice for any links that don’t use www. So if the bbc.co.uk decide to link to your site like this http://domain.com you’ll earn no credit for it because as far as google is concerned that is a different site as you have not verfied yourself as owning it.



    Feel free to ignore, in SEO the is no right or wrong just Google’s manipulation of what we all think is right or indeed wrong :-)

Here is my other questions, the robots.txt file is now visible to Google and it sees this:



User-agent: *

Disallow: /images/thumbnails/
Disallow: /skins/
Disallow: /payments/
Disallow: /store_closed.html
Disallow: /core/
Disallow: /lib/
Disallow: /install/
Disallow: /js/
Disallow: /schemas/




I believe this setup is normal for the robots.txt file? Google Webmasters Tools is now saying, “[color=#000000][font=Arial, sans-serif][size=3]Detected as a directory; specific files may have different restrictions”. It is looking at it, which is an improvement. [/size][/font][/color]Still it says it is blocking 418 urls. Does that seem normal? Trying to count how many possible urls there would be if you did not disallow all those in the robots.txt file? Is 418 right?



Before you get all worked up about this, you need to find out what URL's are being blocked. It may be legitimate.?

[quote name='The Tool' timestamp='1374607125' post='165745']

Before you get all worked up about this, you need to find out what URL's are being blocked. It may be legitimate.?

[/quote]



The majority of the not found pages are basically CS-Cart pages that we had disabled or deleted products. I thought when you disabled a page or deleted something, that Google wouldn't bother trying to look for it? That may not be the case.



For instance, it's looking for an old page that we had disabled about six months ago because although it was driving in traffic, the traffic wasn't relevant and it had a high bounce rate. So, we disabled it for the time being. I didn't delete it, but thought it was enough. Now, Google Webmaster is saying they can't find it. Why does it care?



Other sort of pages it wants to see are like old products that we no longer sell.



Again, I don't know if this stuff is what they call 404 soft errors and not relevant? The main issue for me is that it hasn't crawled the homepage or many of the popular categories and products in nearly four weeks. I don't know if all this relates or it's a separate issue.



Plus as far as the errors go, none of the BLOCKED URLS seem to be related to the robots.txt disallow restrictions. In other words, nothing relates to /images/thumbnails or /skins, etc. That is what is strange.

thanks agian Tool for checking this out:



Another cause of concern is that I've used the “Submit URL to Index” feature where it's suppose to cache your site in 24 hours. I did this on both 7/15 and 7/16. As of today, 7/23… it's still showing a cache from 7/11 on all of my pages.



I was reading here that sometimes you can inadvertently block Google's IP. That could have been the case, but right now I'm seeing this in Webmaster Tools:







Again, all the checkmarks are there. It looks good for now.

Having a page/product disabled is just the same as deleting it in Google's eye because it cannot be opened. Google crawled the page at one point in time so it knows that it was existent. You will either need to let Google go through its process and let it die or use the URL removal tool. Why it is saying it was blocked by the robot.txt is a mystery to me.



Do you know for sure what URL's are being blocked? I just had a quick look at mine and I cannot find a list of URL's that are blocked by robots.txt.



To be honest, I think that you are worrying over nothing unless you have active pages/product URL's being blocked or not found which doesn't seem to be the case since you submitted 234 and it has indexed 234.

[quote name='The Tool' timestamp='1374646124' post='165764']

Having a page/product disabled is just the same as deleting it in Google's eye because it cannot be opened. Google crawled the page at one point in time so it knows that it was existent. You will either need to let Google go through its process and let it die or use the URL removal tool. Why it is saying it was blocked by the robot.txt is a mystery to me.



Do you know for sure what URL's are being blocked? I just had a quick look at mine and I cannot find a list of URL's that are blocked by robots.txt.



To be honest, I think that you are worrying over nothing unless you have active pages/product URL's being blocked or not found which doesn't seem to be the case since you submitted 234 and it has indexed 234.

[/quote]



UPDATE: looks like Google cache my homepage and other popular pages and categories that it was avoiding on: July 23rd. So, this is good news. Still, the site does not even appear to rank for any of the popular keywords that I once ranked for.



I don't think the site is banned but maybe it's banned for certain keywords? Is that normal? It just looks like it's completely fallen off of Google for some keywords. It's strange because it continues to rank #1 or #2 on search engines like Bing and Yahoo for such keywords. It also ranks #1 on Google for random keywords. Like for instance, let's say you want, “t-shirts” … I can't have that but I can have “buy t-shirts now” which is a fraction of the traffic and not a top keyword.



I've also used websites that scan through Google to check for your website being mentioned for particular keywords (the ones I want) and I've gone all the way up to page 50… nothing. Maybe now that the site has been cached, this will change in time?



I also wonder if maybe some spammy backlinks after Penguin 2.0 are affecting the site's ranking? I can go into GWT and see a list of sites that link to me. Is there anyway to tell if any of these links are damaging my site? Google will allow me to remove particular backlinks from consideration but they all appear legit to me.



Thanks.

[quote name='ckad79' timestamp='1374766859' post='165829']

I also wonder if maybe some spammy backlinks after Penguin 2.0 are affecting the site's ranking? I can go into GWT and see a list of sites that link to me. Is there anyway to tell if any of these links are damaging my site? Google will allow me to remove particular backlinks from consideration but they all appear legit to me.

[/quote]

Remove duplicate content, replace with unique content.



Check WMT and various other tools to find as many of your backlinks as possible. Check each and every one, detailing the quality of the page(s) which link back. Check the Google Blog, WMT guidelines and have a look at Matt Cutts' blog, then see where you fall foul. Your description is the exact result of both the Penguin and Panda updates. Once high ranking keywords are now nowhere (these are likely your site URL's and the relevant anchor text keywords to concentrate on), seemingly random keywords rank well, but yield little traffic.



Once you have checked all of your site backlinks, where possible, contact the site owner requesting the link is removed. Chances are they won't, because if you got a spammy link on their site in the first place, the chances are they won't take the time out to action the removal nor reply to you. Remember you - or whoever you hired - did this to yourself. Now you need to limit the damage done. Create a text file logging every URL which links back to you, and submit a Google disavow request, detailing the fact you have made every effort possible to remove the unnatural backlinks.

Back to the Google is not seeing all links. I just realized a major issue (or possibly one). So, it looks like Google is still trying to scan old URLs from our cart years ago when it used to generate urls like: /index.php?target=topics&topic_id=71. I had CS-Cart create a 301 redirect somehow. It was not done via htacess and I wish I knew how they did it because the code is missing and none of the redirects are working. A lot of the 404 errors in Google are other links linking to stuff like /index.php?target=topics&topic_id=71. That could be a lot of my issue. It's just strange that Google is still looking for those links but they are still active, years later.



As far as content, I will continue to and always have IMO write good content. I feel like this may be a backlink issue… as far as my ranking dropping! I mean, the site still shows up so it isn't banned and a lot of the other sites that I like to (blog used to push traffic to my main site) aren't ranking well either. I think it's a chain.

[quote name='ckad79' timestamp='1376088489' post='166597']

Back to the Google is not seeing all links. I just realized a major issue (or possibly one). So, it looks like Google is still trying to scan old URLs from our cart years ago when it used to generate urls like: /index.php?target=topics&topic_id=71. I had CS-Cart create a 301 redirect somehow. It was not done via htacess and I wish I knew how they did it because the code is missing and none of the redirects are working. A lot of the 404 errors in Google are other links linking to stuff like /index.php?target=topics&topic_id=71. That could be a lot of my issue. It's just strange that Google is still looking for those links but they are still active, years later.



As far as content, I will continue to and always have IMO write good content. I feel like this may be a backlink issue… as far as my ranking dropping! I mean, the site still shows up so it isn't banned and a lot of the other sites that I like to (blog used to push traffic to my main site) aren't ranking well either. I think it's a chain.

[/quote]



I presume your old URLs still have the same or similar content to the new URL's, if so, can you post an example of the old and the new URL's. Writing up .htaccess rules for the 301 redirects should not be too complicated to do. As for the old links being crawled by Google, this is likely due to the fact you have backlinks pointing to these old URL's. Check Google Webmaster Tools which will allow you to download a list of up to 10,000 backlinks, the most linked to pages, etc. This will allow you to create redirect rules for the most important of pages.



Check out Fruition to see if there is any relation to your seo efforts causing a drop in rankings by one or more of the Google updates.

So, I went ahead and created a separate Google Webmaster Tools profile for my site as “www”. I want my main site to be http://mysite.com and not http://www.mysite.com but I think there are some backlinks, etc that are using http://www.mysite.com which is why I want to get to the bottom of this by scanning both sites.



In my config.local.php file, I have:


```php

Example: