opensourcing the google os

it has long been argued that the google os, particularly MapReduce and GFS, is google’s real competitive strength. yahoo, meanwhile, is paying developers to develop clones of these. with seeming consolidation on a common computing platform, and ever-rising data center expenses, you gotta wonder how much sense it makes for the big three to duplicate all that CAPEX. they might be better off outsourcing their datacenters, and sharing some base datasets, such as a crawler cache (kinda like the feedmesh network).

the outsourced company, on the other hand, would end up running a grid with several million nodes and could optimize running costs overall, by using very low power servers, running on an opensourced processor architecture.

bot classes

ok, after a couple days of robots.txt love, i have now much less crap in my logs. a good opportunity to see which bots are well-written. based on what i am seeing with /robots.txt, i am sure glad i blocked most of these festering piles of dung from my site.

not using conditional get while requesting /robots.txt

Only kinjabot, OnetSzukaj/5.0 and Seekbot/1.0 get this right. All other bots, including google and yahoo, do not. lame.

requesting /robots.txt too often

The biggest offender is VoilaBot, checking /robots.txt every 5 minutes, every day. you gotta be kidding me. google and yahoo are not much better, you’d think they’d figured out a way by now to communicate the state of /robots.txt across different crawlers. Other bots fare better by virtue of being less desperate.

update: problems like this are economic opportunities.

gregs new gig

gstein on atom-syntax:

Ever since Atom first popped up, I’ve been interested in it, and even attempted to join a small sprint/discussion at Seybold last year to talk about WebDAV. The bomb threat shut that down, but we simply moved locations for drinks rather than hacking :-) So while I’ve been tracking it generally, my specific current interest is through my work at Google. I’m the engineering manager for the Blogger group, so I’ve gotta pay some attention to what we’re signing up for :-)

so that is what greg has been up to.. the sprint was much fun, as were the drinks.

optimizing urls for google

certainly of interest to open source cms (some of which have horrible urls) is this article by Brice Dunwoodie

More specifically, Google will parse and underscore literally and will parse a dash as a “token”, that represents white space. So if you construct a URL that contains “enterprise_content_management” in it, Google literally sees the word “enterprise_content_management”, which is really not a word at all.