ok, after a couple days of robots.txt love, i have now much less crap in my logs. a good opportunity to see which bots are well-written. based on what i am seeing with /robots.txt, i am sure glad i blocked most of these festering piles of dung from my site.
not using conditional get while requesting /robots.txt
Only kinjabot, OnetSzukaj/5.0 and Seekbot/1.0 get this right. All other bots, including google and yahoo, do not. lame.
requesting /robots.txt too often
The biggest offender is VoilaBot, checking /robots.txt every 5 minutes, every day. you gotta be kidding me. google and yahoo are not much better, you’d think they’d figured out a way by now to communicate the state of /robots.txt across different crawlers. Other bots fare better by virtue of being less desperate.
update: problems like this are economic opportunities.
i went ahead and blocked most crawlers in my robots.txt. there are too many of them, and for most, my ROI is negative anyway. if you had any doubts how far search still has to go, or how many moronic copycat companies there are in this space, spend some time with your log files.
now that google is helping me to surf faster (works as advertised, by the way), i have effectively become a cog in a huge distributed crawling machine. obviously, this is only the first step (alexa-style traffic analysis is naturally already happening). if you control the proxy that people use, annotation and tagging at internet scale are suddenly becoming feasible. ‘tag this’ button in the google toolbar anyone? this will lead to a repeat of the third voice law suits, but these features are too useful to be derailed by these problems for long. years ago at kpmg, i experimented with the office server extensions annotation system, and i am eager to see it return in a crossplattform way. [update] people have been pointing out the possibilities for adsense (targeted ads based on your surfing history), personalized search and cobrowsing
google (newly updated) now has over 60’000 entries on my name (results may differ depending on your location). the new msn search limps along with about 14’000. please tell me there is more to MSN search than meets the eye, i’d like some competition not because i would necessarily switch engines, but because it would kick google into high gear.
mnot: xpath2rss. more xpath search tools are always welcome to make the case for more semistructured data. full disclosure: i am a doc-head.
i am just listening to clayton christensen on “capturing the upside”. strongly recommended. which made me wonder if audio search is there yet. it is:
courtesy of HP Speechbot. unfortunately it doesn’t allow you to specify external audio sources just yet.
sooz recently linked to a piece of mine and is now getting lots of “how to write a bio” traffic. instant authority
Ubiquity Breeds Utility
It won’t be long until people will Google-scan each other in real time, as they meet them.
are you ready?
pondering on my spankin-new pure-css layout, i wondered if google pagerank rewards pages with a better content / pagesize ratio, as pages lacking superfluous nested tables and spacer gifs would. just a guess, but well within the realm of the possible (it makes google’s job easier). here is a little pagerank guide