Indexed Pages in Google
Identifying Crawling Problems
Start your investigation by simply typing site:yoursite.com into the Google search bar. Does the number of results returned correspond with the amount of pages your site has, give or take? If there’s a a large gap in the number of results VS the actual number of pages, there might be trouble in paradise. (note – the number given by Google is only an estimate not an exact amount). You can use the SEO Quake plugin to extract a list of URLs that Google has indexed.
The very first thing you should have a look at is your Google Webmaster Tools dashboard. Forget about all the other tools available for a second. If Google sees issues with your site, then those are the ones you’ll want to address first. If there are issues, the dashboard will show you the error messages. See below for an example. I don’t have any issues with my sites at the moment, so I had to find someone else’s example screenshot.
The 404 HTTP Status code is most likely the one you’ll see the most. It means that whatever page the link is pointing to, cannot be found. Anything other than a status code of 200 (and a 301 perhaps) usually means there’s something wrong, and your site might not be working as intended for your visitors. A few great tools to check your server headers are URIvalet.com and the Screaming Frog SEO Spider and of course the SEOmoz crawl-test tools although that last one is for Pro member, and limited to two crawls per day.
Fixing Crawling Errors
Typically these kinds of issues are caused by one or more of the following reasons:
- Robots.txt – This text file which sits in the root of your website’s folder communicates a certain number of guidelines to search engine crawlers. For instance, if your robots.txt file has this line in it; User-agent: * Disallow: / it’s basically telling every crawler on the web to take a hike and not index ANY of your site’s content.
- .htaccess – This is an invisible file which also resides in your WWW or public_html folder. You can toggle visibility in most modern text editors and FTP clients. A badly configured htaccess can do nasty stuff like infinite loops, which will never let your site load.
- Meta Tags – Make sure that the page(s) that’s not getting indexed doesn’t have these meta tags in the source code: <META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
- Sitemaps– Your sitemap isn’t updating for some reason, and you keep feeding the old/broken one in Webmaster Tools. Always check, after you have addressed the issues that were pointed out to you in the webmaster tools dashboard, that you’ve run a fresh sitemap and re-submit that.
- URL Parameters – Within the Webmaster Tools there’s a section where you can set URL parameters which tells Google what dynamic links you do not want to get indexed. However, this comes with a warning from Google: “Incorrectly configuring parameters can result in pages from your site being dropped from our index, so we don’t recommend you use this tool unless necessary.”
- You don’t have enough Pagerank – lolwut? Matt Cutts revealed in an interview with Eric Enge that the number of pages Google crawls is roughly proportional to your pagerank.
- Connectivity or DNS issues – It might happen that for whatever reason Google’s spiders cannot reach your server when they try and crawl. Perhaps your host is doing maintenance on their network, or you’ve just moved your site to a new home, in which case the DNS delegation can stuff up the crawlers access.
- Inherited issues – You might have registered a domain which had a life before you. I’ve had a client who got a new domain (or so they thought) and did everything by the book. Wrote good content, nailed the on-page stuff, had a few nice incoming links, but Google refused to index them, even though it accepted their sitemap. After some investigating, it turned out that the domain was used several years before that, and part of a big linkspam farm. We had to file a reconsideration request with Google.
Some other obvious reasons that your site or pages might not get indexed is because they consist of scraped content, are involved with shady linkfarm tactics, or simply add 0 value to the web in Google’s opinion (think thin affiliate landing pages for example).