image: sitemap generator
The Google Sitemap generator is a free plug-in for Dreamweaver, from dmxzone.com
R E L A T E D   C O N T E N T
ADVERTISEMENT

Hands on: Indices and mapping

How to control the way your site is indexed by search engines

Nigel Whitfield, Personal Computer World 18 Oct 2007
ADVERTISEMENT

Last time round, I started exploring how search engines index your site, with a look at some of the more basic aspects ­ adding keyword and meta tags to your pages and looking at some of the tools available in Google’s webmasters area, which will let you see what queries people are using to reach your site, and where they rank.

This month, I’m following on from that with a look at two techniques that you can use to give yourself a little more control over which parts of your site are indexed.

If you have a busy site, it’s not beyond the realms of possibility to find that you have a huge number of simultaneous connections to your server from assorted search engine crawlers, potentially impacting upon the performance for other users.

While Google’s Dashboard allows you some control over the crawl rate for its own bot, even if other search engines were to provide such systems ­ and many don’t ­ it would be tedious to have to visit each one and tweak the settings. The obvious solution, then, would be to have a way that your site can hold information that dictates how search engines will index it, and there are two established methods.

Robots.txt
The first of those is called robots.txt. It’s a simple text file that’s intended to be put at the root level of your site, so if the site is www.nigelwhitfield.com, for example, then the URL of the robots file would be www.nigelwhitfield.com/robots.txt. If the crawler for a search engine doesn’t find such a file, then it will index your site by reading all the pages and following all the links.

If, on the other hand, the file exists and is readable by the crawler, then it will interpret it according to the Robots Exclusion Protocol, about which you can find more information at www.robotstxt.org. Essentially, though, it’s a very simple text file.

Lines that are comments begin with a # symbol, and you control the reading of your site by robots using User-agent and Disallow couplets. Together, these specify which parts of your site should not be read by the web crawlers. Here’s a simple example, to stop Google indexing a site forum, for instance:

# A simple robots.txt file
User-agent: googlebot
Disallow: /forum/

A well-behaved search engine should be liberal in interpreting its name ­ so if something is called SuperWebCrawler-3.07 then it should still ignore your site if you refer to it as ‘superwebcrawler.’
There are various pages around that list the common ­ and some not-so-common search engine bot names, so you can exclude them specifically, but in most cases, you’ll probably want to control access by all search engines, unless you’re nursing a grudge, and you can do that simply with

User-agent: *

And what if you do want to be picky about who looks where? Well, the file should be read from top to bottom, and a bot will use the first matching couplet, so if you only want Google to index your forums, and not anyone else, you could say something like this:

# Let Google index my forums, but not anyone else
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /forum/


All Web Hosting
Tags: Web Development

Like this story? Spread the news by clicking below:

Post this to Delicious del.icio.us    Post this to Digg Digg this    Post this to reddit reddit!

Permalink for this story
R E A D E R   C O M M E N T S
M A R K E T P L A C E
Get your free demo of Numara Track-It! 8 - the leading help desk solution for IT related issues.
Make presentations, review documents & share your entire desktop. 30-day free trial! (cc required).
Discover how remote support can fuel your IT business in ways you've never thought of before.
Apply ITIL best practices at your service desk while eliminating integration cost. Learn more here.
WAN based, automated, daily vulnerability assessments. Click here to try and request our whitepapers.
Have your product or service listed here >   
Sponsored links
F E A T U R E D   J O B S
London, Waterloo, United Kingdom | Christian Aid
Senior Web Designer, £37,526 - £42,257 per annum, London, Waterloo The Senior Web Designer is a crucial post in the Publishing Team and provides creative design and graphic resource for all Christian Aid's websites, with ... more >
Buckinghamshire, United Kingdom | Grass Roots
Tester, Aylesbury, Buckinghamshire, Excellent Salary + Benefits Grass Roots are one of the Sunday Times Top 100 companies to work for (2007 and 2008). Established in 1980, we're part of the Grass Roots Group, which ... more >
South West, Darlington, United Kingdom | University College Falmouth
  Web Sharepoint Development Manager, £23,692-£26,665 (£29,138) per annum (Grade 5) The creation of a new University for the Arts in the South West has taken a major step forward with the merger of University ... more >
Leeds, United Kingdom | UKCRN
Application Developer (Role 2), Leeds Join us, and you'll work within a project team to design, develop, test and deliver web applications using ASP.NET 1.x , 2.0 and/or 3.x  (VB.NET and /or C#), HTML and ... more >
More job opportunities