Posted The: 15/06/2009 At: 18:30
Contents:
The Robots Exclusion Protocol, aka robots.txt, can be used by Webmasters to disallow robots, such as search engines, from crawling their site.
Most search engines will request a file called robots.txt when they visit a site, which should be placed in the root of the site. The file must be written in the "text/plain" Media Type [3].
The server will then check if the file exists, and respond with the appropriate success code if the file was found.
If the server responds with a success message, in other words, if the file was found. The robot must parse the robots.txt file, and follow potential instructions listed in it. Those instructions could be as simple as to disallow the robot access to certain areas of the site.
Further more The specification recommend the following behaviour, when robots encounter Access Restrictions, Temporary Failures, and Redirects.
When a 401 or 403 response code is sent, the Robot should assume the entire site restricted.
If a temporary failure was encountered, robots should delay visits until the file can be retrieved.
If a redirect was encountered (3xx responses), Robots should follow the redirects, until a resource can be found.
Please note, robots can still chose to ignore the robots.txt for whatever reason, so it can't replace a real login system. If you want to make something private, you should use a real login system, and not the robots.txt
Most major search engines will follow the rules that you write in the file, so it still has its uses. I.e. To rule out the Latest Articles or News Section, and only have the search engine index the Article at its permanent location.
If a Robot found out about a URL, it will generally first look for the robots.txt file, before it tries to visit the URL.
The below will restrict access to your site for all robots.
User-agent: * Disallow: /
To disallow a certain robot or search engine, you would need to know its name. Below shows how to restrict a single robot.
User-agent: Google Disallow: /
Below shows how to restrict multiple robots.
User-agent: Google User-agent: Yahoo Disallow: /
Below would prevent google from indexing your images, as well as other files located in the images Directory
User-agent: Google Disallow: /images/
Comments: [0]
© Brugbart Webdesign