Restricting access by robots

Ford & Mason Ltd

HOME

ABOUT

CONTACT

RESOURCES

ADVERTISING

donations support the development of
cronolog.org
and
refcards.com

Next: CERN and NCSA Up: Server start-up mechanism Previous: Running from the

Restricting access by robots

World Wide Web robots are programs that automatically retrieve HTML documents from Web servers, look at the links in the documents and then retrieve the documents referenced by those links and so on recursively. They do this to build up a map or database of information held in the Web. On the face of it this seems a reasonable thing to want to do, but unfortunately there is a downside - robots add to the load on a server, and if the robots are programmed to traverse all the links encountered, they will often make many requests to the same server in quick succession. Furthermore, if any of the documents are dynamically generated, these may form a sort of black hole for the robots as there may be an infinite or at least very large number of documents generated.

For the reasons outlined, it is often considered desirable to restrict access to the server by robots. An agreed standard exists for robot exclusion, and in order to control access by robots that adhere to this standard, you need to create a file with the URL /robots.txt. This file consists of two directives: User-agent and Disallow. The User-agent directive specifies a robot name that applies to the following Disallow directives. All robots that adhere to the standard can be encompassed by using the name *. Each Disallow directive specifies a virtual path that should not be accessed by the robots.

The following example shows a sample robot exclusion file:

# Sample "robots.txt" robot exclusion file.

# JumpStation is allowed to access everything

User-agent: JumpStation Disallow:

# Lycos is not allowed to access /virtual-map

User-agent: Lycos Disallow: /virtual-map

# Other robots are forbidden to access anything

User-agent: * Disallow: /

Martijn Koster of Nexor in Nottingham, GB, maintains a list of robots.

Next: CERN and NCSA Up: Server start-up mechanism Previous: Running from the

[ITCP] Spinning the Web by Andrew Ford
© 1995 International Thomson Publishing
© 2002 Andrew Ford and Ford & Mason Ltd
Note: this HTML document was generated in December 1994 directly from the LaTeX source files using LaTeX2HTML. It was formatted into our standard page layout using the Template Toolkit. The document is mainly of historical interest as obviously many of the sites mentioned have long since disappeared.