Thursday, September 27, 2007

Robot Protocol

Web sites also often have restricted areas that crawlers should not crawl. To address these concerns, many Web sites adopted the Robot protocol, which establishes guidelines that crawlers should follow. Over time, the protocol has become the unwritten law of the Internet for Web crawlers. The Robot protocol specifies that Web sites wishing to restrict certain areas or pages from crawling have a file called robots.txt placed at the root of the Web site. The ethical crawlers will then skip the disallowed areas. Following is an example robots.txt file and an explanation of its format:

# robots.txt for http://somehost.com/
User-agent: *
Disallow: /cgi-bin/
Disallow: /registration # Disallow robots on registration page
Disallow: /login
The first line of the sample file has a comment on it, as denoted by the use of a hash (#)
character. Crawlers reading robots.txt files should ignore any comments.

The third line of the sample file specifies the User-agent to which the Disallow rules following it apply. User-agent is a term used for the programs that access a Web site. Each browser has a unique User-agent value that it sends along with each request to a Web server. However, typically Web sites want to disallow all robots (or User-agents) access to certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that all User-agents be disallowed for the rules that follow it. The lines following the User-agent lines are called disallow statements. The disallow statements define the Web site paths that crawlers are not allowed to access. For example, the first disallow statement in the sample file tells crawlers not to crawl any links that begin with “/cgi-bin/”. Thus, the following URLs are both off limits to crawlers according to that line.
http://somehost.com/cgi-bin/
http://somehost.com/cgi-bin/register (Searching Indexing Robots and Robots.txt)

Competing search engines Google, Yahoo!, Microsoft Live, and Ask have announced today their support for 'autodiscovery' of sitemaps. The newly announced autodiscovery method allows you to specify in your robot.txt file where your sitemap is located.

No comments: