Blocking bad bots with robots txt

Which bad bots can be blocked? And what about bots who disregard robots.txt?

Edited: 2011-03-17 12:30

Bad robots are bots which harvest data from websites, or submit spam on blogs and forums, etc. The data they harvest may be anything from pages with form fields, to email addresses. This data can then be used in different blackhat SEO and marketing techniques.

Most of these bad robots, will simply ignore your robots.txt file entirely, so its best to deal with them some other way, such as Ip blocking.

The definition of bad bots

Some website owners would consider otherwise legit bots bad, since they either hammer their sites with requests, and messes up the statistics. Or harvest data for malicious purposes. I.e. Advertising, and so on.

A lot of robots who do respect your robots.txt, are mining data from other websites, with for irrelevant purposes. One of such is ia_archiver, which is mining data to show a archive of how the web looked in the past. Something many webmasters might not be interested in.

Below is an example of how to block a bot if its annoying you, from accessing your entire site.

User-agent: ia_archiver
Disallow: /