bot classes

ok, after a couple days of robots.txt love, i have now much less crap in my logs. a good opportunity to see which bots are well-written. based on what i am seeing with /robots.txt, i am sure glad i blocked most of these festering piles of dung from my site.

not using conditional get while requesting /robots.txt

Only kinjabot, OnetSzukaj/5.0 and Seekbot/1.0 get this right. All other bots, including google and yahoo, do not. lame.

requesting /robots.txt too often

The biggest offender is VoilaBot, checking /robots.txt every 5 minutes, every day. you gotta be kidding me. google and yahoo are not much better, you’d think they’d figured out a way by now to communicate the state of /robots.txt across different crawlers. Other bots fare better by virtue of being less desperate.

update: problems like this are economic opportunities.

a cog in the crawler

now that google is helping me to surf faster (works as advertised, by the way), i have effectively become a cog in a huge distributed crawling machine. obviously, this is only the first step (alexa-style traffic analysis is naturally already happening). if you control the proxy that people use, annotation and tagging at internet scale are suddenly becoming feasible. ‘tag this’ button in the google toolbar anyone? this will lead to a repeat of the third voice law suits, but these features are too useful to be derailed by these problems for long. years ago at kpmg, i experimented with the office server extensions annotation system, and i am eager to see it return in a crossplattform way. [update] people have been pointing out the possibilities for adsense (targeted ads based on your surfing history), personalized search and cobrowsing