|
|||||||||
|
Restricting access by robotsWorld Wide Web robots are programs that automatically retrieve HTML documents from Web servers, look at the links in the documents and then retrieve the documents referenced by those links and so on recursively. They do this to build up a map or database of information held in the Web. On the face of it this seems a reasonable thing to want to do, but unfortunately there is a downside - robots add to the load on a server, and if the robots are programmed to traverse all the links encountered, they will often make many requests to the same server in quick succession. Furthermore, if any of the documents are dynamically generated, these may form a sort of black hole for the robots as there may be an infinite or at least very large number of documents generated. For the reasons outlined, it is often considered desirable to restrict access to the server by robots. An agreed standard exists for robot exclusion, and in order to control access by robots that adhere to this standard, you need to create a file with the URL /robots.txt. This file consists of two directives: User-agent and Disallow. The User-agent directive specifies a robot name that applies to the following Disallow directives. All robots that adhere to the standard can be encompassed by using the name *. Each Disallow directive specifies a virtual path that should not be accessed by the robots. The following example shows a sample robot exclusion file:
# Sample "robots.txt" robot exclusion file. Martijn Koster of Nexor in Nottingham, GB, maintains a list of robots.
Next: CERN and NCSA Up: Server start-up mechanism Previous: Running from the
Spinning the Web by Andrew Ford |
||||||
Copyright © 1996-2002 Ford & Mason Ltd |