it has long been argued that the google os, particularly MapReduce and GFS, is google’s real competitive strength. yahoo, meanwhile, is paying developers to develop clones of these. with seeming consolidation on a common computing platform, and ever-rising data center expenses, you gotta wonder how much sense it makes for the big three to duplicate all that CAPEX. they might be better off outsourcing their datacenters, and sharing some base datasets, such as a crawler cache (kinda like the feedmesh network).
the outsourced company, on the other hand, would end up running a grid with several million nodes and could optimize running costs overall, by using very low power servers, running on an opensourced processor architecture.
ok, after a couple days of robots.txt love, i have now much less crap in my logs. a good opportunity to see which bots are well-written. based on what i am seeing with /robots.txt, i am sure glad i blocked most of these festering piles of dung from my site.
not using conditional get while requesting /robots.txt
requesting /robots.txt too often
The biggest offender is VoilaBot, checking /robots.txt every 5 minutes, every day. you gotta be kidding me. google and yahoo are not much better, you’d think they’d figured out a way by now to communicate the state of /robots.txt across different crawlers. Other bots fare better by virtue of being less desperate.
update: problems like this are economic opportunities.
i went ahead and blocked most crawlers in my robots.txt. there are too many of them, and for most, my ROI is negative anyway. if you had any doubts how far search still has to go, or how many moronic copycat companies there are in this space, spend some time with your log files.
if you want gmail, send me a note.
Hooray, no spam here! says my shiny new gmail account. i wonder how many hours that will last.. btw, gregorj@ and rothfuss@
Ever since Atom first popped up, I’ve been interested in it, and even attempted to join a small sprint/discussion at Seybold last year to talk about WebDAV. The bomb threat shut that down, but we simply moved locations for drinks rather than hacking So while I’ve been tracking it generally, my specific current interest is through my work at Google. I’m the engineering manager for the Blogger group, so I’ve gotta pay some attention to what we’re signing up for
certainly of interest to open source cms (some of which have horrible urls) is this article by Brice Dunwoodie
More specifically, Google will parse and underscore literally and will parse a dash as a “token”, that represents white space. So if you construct a URL that contains “enterprise_content_management” in it, Google literally sees the word “enterprise_content_management”, which is really not a word at all.
- (gOO g�l’D�rk) noun 1. Slang. An inept or foolish person as revealed by Google.
this is now a pagerank 7 site.