I got a cold recruiting email a few days ago from a notorious SMS company that's oddly registered in Ascension Island. I normally ignore this kind of recruiting mail, but this time around I wrote back and explained that because of what I'd heard about their unethical business practices I wasn't interested in talking to them. (1, 2, 3.)

The recruiter must not have gotten my email because he called me. I explained again why I wasn't interested in talking with him and was shocked to hear an immediate response to the criticism that his company was unethical. Not "we're misunderstood" or "we've changed". No, he told me they were in a competitive market and couldn't worry about stepping on toes while building a business.

I don't know what was more sad; the answer or the fact he had it at the ready. I'm reminded of this crazy story about it's like to work at a malware provider. Google has taken some knocks for its "don't be evil" policy. But the good thing is it immediately sets the internal debate at the right point. Not "can we get away with this" or "do we need to do this to build our business" but rather "is this right?" That's the place to start.

  2006-08-30 01:04 Z
Amazon's new Elastic compute cloud looks good. So many signups they've closed the beta for now. Between EC2 for compute and S3 for storage, Amazon is offering most of a utility computing service. Store and process data for very cheap, pay as you go. As a developer who recently lost access to a massive compute farm I find these Amazon services very exciting. Think of the startups it could enable!

But as the founder of a failed utility computing company I just don't get how Amazon's going to benefit from offering these services. The problem is the commodity pricing. If Amazon completely sells a server they gross about $900 a year. Say it costs $300 / year to buy and run a server and they fully sell out 10,000 machines, then Amazon nets $6M / year. Amazon made $360M last year, so this best case scenario would be a 2% earnings improvement. Nice, but worth the distraction to their primary business?

I love that Amazon is offering this service; it seems very valuable and useful to people like me. But unless it's a long term viable business for Amazon it'd be foolish to build a product that relies on their service. I guess we'll see how it plays out. Sun's version of this was looking really bad last year, but then again the signup process at Sun is 1000 times harder.

Thanks to waxy for some chat on this question
  2006-08-29 18:41 Z
Congratulations to the Flickr team for their new geotagging interface. I think this is the first time it's truly easy for normal people to indicate where they took their photographs in a way others can benefit from.

It's a bit late for a pony request, but I sure wish geocoding interfaces had a notion of uncertainty along with location. Ie: latitude, longitude, and radius. A photo geocoded to "San Francisco, CA" is different from a photo geocoded to "659 Merchant St. San Francisco, CA". The former has a radius of about four miles. The latter, 20 feet.

Tagging your photos via a map interface gives a natural radius: a few pixels' width in the map the user clicked on. GPS devices have a natural uncertainy measurement, too. So the data's available, why not store it?

Update: A Flickr developer wrote to tell me that their geotagging does store and use accuracy data. If you ask for photos in a precise location you don't see photos that are too general. Cool! Looks like its in the API too.
Thanks to pb for the term fuzzy geocoding
  2006-08-29 15:33 Z
I posed a question about embedding a subject URL in a request URL using percent encoding. Thank you for all the helpful replies, here's what I learned.

First, on the existential question, it seems in practice percent encoding really can create distinct names. Try:

In reality this is just a quirk of Apache's handling of %2F, but since it's the default behaviour for the #1 web server out there that's a strong example. As for the theory, this wikipedia article claims that percent encoded reserved characters does create distinct names whereas percent encoded unreserved characters is just aliasing the same name. So escaping / would make a new name but escaping something like 0 wouldn't. Confusing, huh?

As to the practical problem of PATH_INFO being unescaped basically everyone told me "yeah, CGI's a hack like that". So going with the hack I'll just use the REQUEST_URI variable Apache sets. It's not documented anywhere I can find but it seems to be an unadulterated literal copy of what the client requested, from which I can do careful parsing. For my service clients will need to know to percent-escape any / or ? in their URLs. And I'll just hope nothing else in the network decideds it's OK to unescape things on me.

Some other suggested workarounds: length-delimit the subject URL so you know where it ends, have a magic string delimiting the end of the subject URL that you hope doesn't appear in any legitimate subject URL, or put the subject URL at the end of the request URL so that it ends where it ends. Any of these solutions could be made to work, I was just looking for the principle.

Thanks to SethG, RyanB, MikeB, GregW, GordonM, and SamR
  2006-08-23 01:25 Z
A central reward in most MMOGs is leveling up. You play for a few hours and then ding! you've gained a level. And a common design feature in most games is that the higher your level, the more time it will take to get to the next level.

The PlayOn folks have a nice graph of time to level in World of Warcraft as a function of level. I redrew it as level as a function of time. This view highlights the fact that your reward frequency diminishes over time. Early on it takes 2 hours of play to level up. But the rewards come more and more slowly to where you're playing 10 hours per "ding!".

It's weird that rewards in the game so predictably slow down. The rewards themselves don't get more valuable; going from level 44 to 45 doesn't really get you more than going from 4 to 5.

Maybe the time spent makes the rewards just seem more valuable? Level 60 feels like a big accomplishment given the time involved. But then, that's totally artificial. Or maybe the game developers cynically know they can get away with less frequent rewards because there's so much sunk cost. Players spend so much time getting to 44 that they'll spend longer for 45 because they're already invested.

  2006-08-22 03:09 Z
The French for raccoon is raton laveur. Literally, the "washing rat".

The French for pie chart is camembert.

  2006-08-22 02:47 Z
I need help. I'm building a REST-style web service. Its URLs specify operations on other URLs, so I need to pass URLs as parameters to my REST service. Let's say my service reverses the text of whatever URL is specified.
The above request to my service means "return the contents of someurl reversed". To make things a bit more complicated, there may be other stuff in my URLs after someurl: the argument a=b in my example above.

So my question is, how do I correctly encode someurl? Let's say it's http://google.com/ I'm reversing. I'd think the request would contain a percent-quoted URL, something like

However, the CGI standard (and Apache2's implementation thereof) seems to be decoding the URL before it gets to my application. Ie: PATH_INFO contains
I can work around this (REQUEST_URI isn't decoded), but something about all this makes me uneasy. I'm relying on all web software between me and my client to not mess with my carefully encoded URL. If the CGI standard itself seems to think decoding is OK, who's to say some proxy or browser won't too? Or to ask the question more existentially, do these two URLs name the same resource? Or can they name different things?
http://example.com/foo/bar http://example.com/foo%2Fbar

Is there a good way to talk about URLs inside URLs? If you know the answer, mail me. I promise to share.

  2006-08-21 00:11 Z
One of the highlights of our trip to Australia was Tasmania. Tassie is quite different from mainland Australia; it's very fertile lush, mountainous, and cool. It's also wild and largely unsettled. The whole west half of the island is covered in rugged mountains rainforests, reminded me of the Olympic Peninsula.

Hobart is the main town in Tasmania, but honestly it doesn't offer much to the tourist. But the nearby Port Arthur is fascinating; its the ruins of the last and most notorious prison of the Victorian era.

We also took the five hour drive across the island to Strahan, a small tourist town (and former prison) on the west coast. The drive across was amazing, wild and empty roads with enormous forests. Absolutely beautiful. Strahan is the terminus of the West Coast Wilderness Railway to Queenstown, a mining town fascinating for its hideousness. The destination's not the point though; the trip itself is on a fantastically restored rack steam railway with beautiful engines that puffs up the rugged mountains through beautiful rainforests. Strahan is definitely worth a special trip, for either the first class package on the train or a boat trip up the rivers into the rainforests. You're at the end of nowhere and it's beautiful.

  2006-08-20 00:24 Z
XML is bad for data transfer for lots of reasons. It's slow to parse, awfully wordy, doesn't support 8 bit data, etc etc. But my favourite thing is how most XML consumers can be induced to open arbitrary URLs.

There's a 4+ year old security hole in many XML parsers called XXE, the Xml eXternal Entity attack. Take a look:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE doc [
  <!ENTITY passwd SYSTEM "file:///etc/passwd">
  <!ENTITY http SYSTEM "http://example.com/delAll">
  &passwd; &http;
Don't let all the pointy-parentheses confuse you (although they're a good reason to hate XML too). What's going on there is my document has defined two new entities called passwd and http and then used them. They're defined as expanding to the contents of URIs.

About 3/4 of the XML applications I've encountered out there will blindly do as ordered and load the URIs. The app will load the password file, at which point a clever hacker can usually induce the application to send it back to them (as an error message, for instance). Even better it will load the HTTP URL. Yes, many XML applications will load any URL you tell them to. From the app server. Nice, huh?

XXE is an old bug, but it keeps coming back because most people using XML would never think their little XML parser can be instructed to start opening network sockets. Acrobat had this bug just last year. XML parsers usually have obscure options to secure them, but many have them off by default. Why are we using a data format where this is possible at all?

  2006-08-17 00:48 Z
I'm working on a little webapp project. Usually I just use plain old Python CGI scripts for these things but this time I actually care about performance a bit so I did some delving into the modern world of Python webapps. Things are a lot better than they used to be. WSGI is the standard Servlet API, web.py (small) and Django (big) are good app frameworks, and Zope is mercifully sliding to retirement. But I'm stubborn and want to do things my own way, just use FastCGI to avoid invocation overhead.

Turns out it's easy on a Debian system. Install the packages python-flup and libapache2-mod-fastcgi. Then write a python2.4 program like this:


import flup.server.fcgi
g = 0

def app(environ, start_response):
  global g
  status = '200 OK'
  response_headers = [('Content-type','text/plain')]
  start_response(status, response_headers)
  g += 1
  return ['Hello world! %d\n' % g]

if __name__ == '__main__':
  from flup.server.fcgi import WSGIServer

The one badness is that modifying your Python code won't force a code reload, which sucks for development. If you name your program "foo.cgi" Apache will invoke it as normal CGI and Flup supports that too, so that's what I'll do for now. Even if it behaves differently.

  2006-08-15 18:44 Z
I have about 300 photos from Australia I intend to edit and put online before I leave for France. But as anyone who takes a lot of digital photos will tell you, workflow is still a huge problem. Here's what I do now:
  • Stick flash card in a usb device, copy photos as files
  • Use Photoshop to convert raw to JPG previews
  • Sort through jpegs in ExifPro to find the good ones
  • Symlink (by hand) the raw files for the good photos
  • Lightly edit raw files in Photoshop, correcting exposure and cropping
  • Upload edited photos to Flickr
None of the automatic tools for photo workflow work for me. Either the image edit tools are bad, the database features are too limited, the software is too slow, or the UI is awful. And all of the fancy tools want to store my data in some proprietary database where I can never get at it. The workflow above is manual, but at least the data is transparent.

A couple of folks wrote to suggest I try the Adobe Lightroom beta. It's a lightweight photo workflow tool that combines an image database with simple editing and printing tools. I like it, particularly as a lightweight tool, and think it does 90% of what I do with Photoshop only more easily. I'm looking forward to seeing this become a real product, although it's hard to see how it will fit in with the rest of Adobe's products.
  2006-08-14 22:24 Z
About eight months ago I moved my blog to a new domain and set up 301 redirects to point everyone to it. 301 means "moved permanently": all bots should eventually stop hassling the old URL. Does it work? Mostly; here's a list of hits in last week's traffic from various bots.
808 hits, Rojobot
A feed aggregator with several old RSS URLs they apparently have no way of updating.
424 hits, TailRank robot
A service that claims to "find hot stories", in my case by looking at eight-month-old URLs and not following the "hot lead" that the site has moved. Grabs both RSS feeds and some HTML.
390 hits, Feedfetcher-Google
RSS crawler for Google Reader. I'm pretty sure I know the guy responsible for this bot; tsk tsk, B.
254 hits, Yahoo! Slurp and 160 Googlebot
Basic crawlers trying to grab some HTML. Arguably legitimate: they're verifying the 301 redirect is still live.
136 hits, BecomeBot
Some shopping site bot. Guys, I got nothing to sell.
108 hits, AppleSyndication
There's one person who uses the Safari RSS reader! Too bad about the bug.
None of this matters much; the 301 is cheap to serve and other than the Rojo and Safari examples I think no one would care that I stopped serving it. At least the big consumers like BlogLines have all switched over.
  2006-08-13 16:45 Z
A big part of online games is accumulating virtual wealth. If you're lazy you can buy virtual money with real money, paying to not play the game that you pay for. I did a little study of the price of 1000 gold on 168 US Warcraft servers and came up with some results.
1000 gold costs about $150, although it's over $200 on a fair number of servers. 1000 gold is a lot of gold. It's the price of an epic mount, the most expensive thing many people ever buy. The $150 to buy it is the cost of a year's subscription to WoW. A normal player at level 60 can make 1000 gold in about 50 hours, or about $3 / hour.
There's a lot of variation in the price of gold from server to server. But I can't find any correlation between the price of gold and anything else. Gold costs the same no matter if you're on an old server or new one, have a lot of players or few, or whether you're on a normal or PvP server. I was surprised to find no correlation; maybe the gold-for-dollars market is just really efficient?

I got the price data from IGE and realm ages and populations from the WoW census.

Update: one thing I did not appreciate is how volatile gold prices are. Since writing the above two days ago IGE's median price for 1000 gold went from $153 to $177! I'm collecting more data both in time and from different gold merchants to understand this better.
  2006-08-11 01:19 Z
Every six months or so I do a project that requires drawing graphs from Python code. I usually use pygdchart because it's easy and I know it but it's awfully primitive and ugly. I've admired the output of matplotlib so I thought I'd give it a try.

The 2d rendering is slow. I wrote a quicky program to generate a trivial graph with 4 points then render that plot to an image 10 times in a row. Depending on the backend it can take almost a second to render a graph! Here's the milliseconds it takes to render a single plot using each of the various matplotlib backends:

GD: 440
Agg: 650
Cairo: 850

SVG: 60
PS: 90

There's also the no-op backend Template, 50ms. That's as fast as matplotlib is ever going to be. I'm disappointed by GD performance; it's not doing antialiasing or anything. Also surprised just how fast the SVG rendering is. I hope that's a viable publication mechanism some day; right now MSIE doesn't support it and Firefox misrenders it.

  2006-08-10 21:57 Z
The AOL guys who published the search logs also wrote a paper. The data download includes a readme which says
Please reference the following publication when using this collection: G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.
That paper is readily online: Google search, ACM citation, author PDF download.

The first author is Greg Pass, an AOL employee. Can't find a web page for him but here are some papers he's published. The second author is Dr. Abdur Chowdhury, "AOL Chief Architect for Research at AOL". Ouch. Third author is Cayley Torgeson of Raybeam Solutions.

I gave the paper a quick read. It's an analysis of usage patterns of search engines: query frequencies, user behaviour, scalability requirements, etc. I didn't see any particularly surprising analysis but it's a summary of a lot of interesting hard-to-come-by data. I have a feeling that the AOL employees who released the search logs were honestly just trying to be good researchers and share their data. Only in this case they blew it. Now I feel a little bad for them.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.
  2006-08-09 18:21 Z
AOL released a bunch of its users search session data. A bit of looking (1, 2, 3, data, NYT, commentary) is revealing just how much is in these logs. Why hasn't AOL yet announced that the person responsible has been fired?
cannot sleep with snoring husband
god will fulfill your hearts desires
online friendships can be very special
friends online can be different in person
men need encouragement
men need a womans love
I'm glad the leak happened; now everyone can see just how sensitive search data is. And valuable, too. Search logs are the private corporate property of companies like AOL to use (or misuse) as they want. Or for the US government to subpoena. Is that the best world for us?
men like women with curvy bodies
recipe for coconut cream pie
how to flirt with a man
signs of skin cancer
a pimple that wont heal
insomnia after hysterectomy
We're just at the beginning of the technology to mine the database of intentions. Thanks to AOL, now everyone can play along at home and try their hand at analysis.
  2006-08-09 16:58 Z
My partner Ken and I will be living in Paris this fall. We've rented an apartment for September through November and will be spending most of our time in Paris itself, with occasional weekend trips around France.

Ken and I have talked often about living in Europe, either full time or part time. We really enjoyed being in Zürich last fall but in the end decided it wasn't quite the city for us. We've spent enough time in France as tourists to know we like it, so time to take the next step of actually living there for awhile.

If you're in Paris or think you may be visiting there sometime this fall, drop me a note!

  2006-08-08 18:35 Z
One of the more odious-yet-legal businesses on the Internet is domain squatting. Companies grab unused domain names and run pages of ads on them. The less scrupulous ones also SEO the hell out of their crap ad sites to get them up in search results. It's a hugely profitable business, even Google has a product for domain squatters. But it's ugly bottom-feeding.

However, sometimes it's funny. I present to you kitenwar.com. It's "for resources and information on Kitchenware and Iraq Iran War". The top 5 ads are labeled kitchenware, Iraq Iran war, cookware, war on terrorism, and cutlery. Some other related searches: depleted uranium and hairless cat.

These particular ads seem to be coming from Yahoo/Overture. Not sure who owns the domain. What I was really looking for was kittenwar.com.

  2006-08-06 16:21 Z
The Internet used to be a silent experience. Now thanks to embedded audio, streaming video, podcasts, MySpace music, YouTube, etc my daily Web trawl is interrupted every three minutes some noise. All I want to do is peacefully listen to my Erik Satie and read some stuff. Is there any escape from noise?

Link to annoying sound I briefly embedded.

  2006-08-06 16:01 Z
IZArc is still good software. I wrote awhile back about this compression utility that replaces WinZip, 7-Zip, WinRAR, and any other file archiver you may use. It's free, fast, and has great Windows shell integration so you almost never need to actually use the GUI. All you could want.

For some reason no one seems to know about this program, so I'm writing about it again. The homepage is currently featuring a heartbreaking plea for donations so he can afford to buy a copy of Visual Studio 2005 and release a Unicode version of IZArc. If you happen to work in Microsoft dev relations maybe you can find a free copy to ship him?

  2006-08-04 18:21 Z
I just got a laptop, so now I have the hell of synchronizing multiple machines. Yes, 20 years after the Network is the Computer I still can't transparently get from anywhere to my data and software. Google Browser Sync is one tool to solve this problem. It syncs up your Firefox settings, cookies, passwords, etc to a Google server. No matter where you launch Firefox it will sync your data, in theory giving you transparent access to browser state.

It mostly works, but it's got rough edges. Syncing is transparent and effective although I have mysteriously lost a couple of cookies. It syncs most of the data you care about except extensions and toolbar layout. Data is encrypted on the client, an essential feature. It's kind of ugly: the browser button is too big and it pops up annoying alerts if you run simultaneously on two machines.

But there's one crippling flaw; it adds about five seconds of startup time everytime you launch Firefox. They must be getting a lot of complaint because it says right on the download page "we're working on it". Seems simple enough to fix, just don't sync every time.

What I find most interesting about this is if you squint, Google Browser Sync looks a whole lot like another piece of Microsoft's Passport/Hailstorm vision. I have no idea what Google's planning in this area, but it seems obvious to me that people want to centralize a lot of personal data. Google's proven they can build products that scale and I'd trust them with my data more than any other company. More like this please!

A couple of days after I put this on my weblog, Google fixed the startup time problem.
  2006-08-02 20:21 Z
In the early 1990s when Internet culture was coming together there was a lot of excitement that we were creating a new space free from the stupider rules and laws of our home countries. We even declared independence. But now the Internet is far too vital to our world economy and culture to remain free of traditional jurisdiction. But the question remains: how do countries enforce rules on a space that is both everywhere and nowhere?

The American approach is to extend its jurisdiction throughout the world. In the case of online gambling, by arresting people when they change planes in your country. David Carruthers (former CEO of BetOnSports) was arrested, held without bail, and charged with racketeering along with several of his staff. His company seems to have folded completely, promptly firing him and not appearing to support his not guilty plea.

Countries need to enforce laws on the Internet. But of all the things the US could worry about, why online gambling? And nabbing people while they're in transit through your country seems awfully authoritarian.

  2006-08-01 16:00 Z