Software Developer Lalit Sharma: May 2011

First some back ground on search engine robots

Web indexing robots are used by many search engines such as Google, Inktomi, AltaVista and others. These web indexing robots are also known as spiders. These spiders/robots are the tools used by engines to harvest data for their search engines. When you submit your website to the engines, you are effectively asking the search engines to send their web indexing robot to your website so that it can be crawled and added to their database.

So why do i need a robots.txt file?

Web-Indexing Robots can be controlled as to which part of your site they index by installing a simple text file called robots.txt in the root path of the server with explicit instructions on what the spider is and is not permitted to index on your website.
You can define which paths are off limits for spiders to visit an block off such . This is useful for such things as large directories of information, personal information, and parts of the website containing large amounts of recursive links, among others.
Now it is possible to include robots.txt indexing information directly in your meta tag and in some cases this is preferable if only one page needs to be controlled. You can use a meta tag like this meta name="robots" content="INDEX,FOLLOW> to tell the robot it is ok to index this page and follow links it finds on this page. However, if you have whole directories and multiple pages you want to control the indexing of then you need a robots.txt file to ease the burden of managing this task.

How accurate does my robots.txt tag have to be?

You need the correct path of the files or directories that reflect the web viewable path of the server.
Example: many servers use htdocs as the web root, but the ftp root will be different. Your robots.txt tag should not include the htdocs directory in front of the file/directory because the htdocs folder is not viewable on the web...the files in the htdocs are what need to be listed if you whish to control the spiders indexing of them.

Do I have to have a robots.txt file in order to have search engines index my site?

The short answer is no! A web indexing robot will crawl your site unless told not to. However lets go a little deeper than that. A good web indexing robot such as Googlebot or Slurp (Inktomi) are considered well behaved web spiders and will attempt to find your robots.txt file before it indexes your site. As well good robots will look at your meta tags file and check for the The advanced way in stopping malicious spiders that ignore or disobey your robots.txt file is to look at blocking users agents at the server level and even so far as blocking IP's etc where possible. A user agent is a signature that is attached to the robots (provided they added one) which can be used to identify the robot. When a page is requested from your web server, software such as IIS (windows server) or Apache (Linux/Unix) will store this user agent information in your log files which you can review and react accordingly.

Software Developer Lalit Sharma

Thursday, May 26, 2011

Importance of robots.txt file