The Robots Text File

Posted The: 15/06/2009 At: 18:30

The Robots Exclusion Protocol, aka robots.txt, can be used by Webmasters to disallow robots, such as search engines, from crawling their site.

How Robots work with Robots.txt

Most search engines will request a file called robots.txt when they visit a site, which should be placed in the root of the site. The file must be written in the "text/plain" Media Type [3].

The server will then check if the file exists, and respond with the appropriate success code if the file was found.

If the server responds with a success message, in other words, if the file was found. The robot must parse the robots.txt file, and follow potential instructions listed in it. Those instructions could be as simple as to disallow the robot access to certain areas of the site.

Further more The specification recommend the following behaviour, when robots encounter Access Restrictions, Temporary Failures, and Redirects.

Restricted Access

When a 401 or 403 response code is sent, the Robot should assume the entire site restricted.

Temporary Failures

If a temporary failure was encountered, robots should delay visits until the file can be retrieved.

Redirection

If a redirect was encountered (3xx responses), Robots should follow the redirects, until a resource can be found.

Security

Please note, robots can still chose to ignore the robots.txt for whatever reason, so it can't replace a real login system. If you want to make something private, you should use a real login system, and not the robots.txt

Most major search engines will follow the rules that you write in the file, so it still has its uses. I.e. To rule out the Latest Articles or News Section, and only have the search engine index the Article at its permanent location.

Examples

If a Robot found out about a URL, it will generally first look for the robots.txt file, before it tries to visit the URL.

Disallow Access to the Entire Site

The below will restrict access to your site for all robots.

 User-agent: *
 Disallow: /

Disallow Specific Robots or Search Engines

To disallow a certain robot or search engine, you would need to know its name. Below shows how to restrict a single robot.

 User-agent: Google
 Disallow: /

Below shows how to restrict multiple robots.

 User-agent: Google
 User-agent: Yahoo
 Disallow: /

Disallowing a Directory

Below would prevent google from indexing your images, as well as other files located in the images Directory

 User-agent: Google
 Disallow: /images/

Comments: [0]

© Brugbart Webdesign