This Article is no longer maintained, and has been replaced by the: Robots.txt Tutorial
The Robots Exclusion Protocol, aka robots.txt, can be used by Webmasters to disallow robots, such as search engines, from crawling their site.
Most search engines will request a file called robots.txt when they visit a site, which should be placed in the root of the site, and delivered as text/plain.
The server will then check if the file exists, and respond with the appropriate success code if the file was found.
If the server responds with a success message, in other words, if the file was found. The robot must parse the robots.txt file, and follow potential instructions listed in it.
Behavior on Response codes
The specification recommend the following behavior when robots encounter Access Restrictions, Temporary Failures, and Redirects, when requesting the file.
When a 401 or 403 response code is sent, the Robot should assume the entire site restricted.
If a temporary failure was encountered, robots should delay visits until the file can be retrieved.
If a redirect was encountered (3xx responses), Robots should follow the redirects, until a resource can be found.
If a Robot found out about a URL, it will generally first look for the robots.txt file, before it tries to visit the URL.
Disallow Access to the Entire Site
The below will restrict access to your site for all robots.
User-agent: * Disallow: /
Disallow Specific Robots or Search Engines
To disallow a certain robot or search engine, you would need to know its name. Below shows how to restrict a single robot.
User-agent: Google Disallow: /
Below shows how to restrict multiple robots.
User-agent: Google User-agent: Yahoo Disallow: /
Disallowing a Directory
Below would prevent google from indexing your images, as well as other files located in the images Directory
User-agent: Google Disallow: /images/