By using this site you agree to the use of cookies by Brugbart and our partners.

Learn more

The Robots Text File

Article on the Robots Exclusion Protocol, and how to use the robots.txt to disallow search engines.

Created: 2013-06-28 00:49

This Article is no longer maintained, and has been replaced by the: Robots.txt Tutorial

The Robots Exclusion Protocol, aka robots.txt, can be used by Webmasters to disallow robots, such as search engines, from crawling their site.

Most search engines will request a file called robots.txt when they visit a site, which should be placed in the root of the site, and delivered as text/plain.

The server will then check if the file exists, and respond with the appropriate success code if the file was found.

If the server responds with a success message, in other words, if the file was found. The robot must parse the robots.txt file, and follow potential instructions listed in it.

Behavior on Response codes

The specification recommend the following behavior when robots encounter Access Restrictions, Temporary Failures, and Redirects, when requesting the file.

Restricted Access

When a 401 or 403 response code is sent, the Robot should assume the entire site restricted.

Temporary Failures

If a temporary failure was encountered, robots should delay visits until the file can be retrieved.

Redirection

If a redirect was encountered (3xx responses), Robots should follow the redirects, until a resource can be found.

Examples

If a Robot found out about a URL, it will generally first look for the robots.txt file, before it tries to visit the URL.

Disallow Access to the Entire Site

The below will restrict access to your site for all robots.

 User-agent: *
 Disallow: /

Disallow Specific Robots or Search Engines

To disallow a certain robot or search engine, you would need to know its name. Below shows how to restrict a single robot.

 User-agent: Google
 Disallow: /

Below shows how to restrict multiple robots.

 User-agent: Google
 User-agent: Yahoo
 Disallow: /

Disallowing a Directory

Below would prevent google from indexing your images, as well as other files located in the images Directory

 User-agent: Google
 Disallow: /images/