Who knew that Microsoft Windows had so many different encodings?

There's CP1252, the almost-but-not-quite ISO-Latin-1 that is responsible for the evil breakage of "smart quotes" by encouraging web publishers to act like 0x93 is a valid way to represent a left double quote. At least it encodes É in a sensible place, 0xc9.

But why stop at one codepage? There's also CP437, an ancient DOS codepage that is nothing like Latin-1 but contains Latin-1 characters like É at 0x90. Yes, that's a different place than CP1252.

Apparently both of these evil 19th century codepages are still coexisting on my 21st century Windows XP system. I just dumped a bunch of MP3 files from my WinXP box to Linux and found the filenames hopelessly garbled. I finally guessed they're in CP437. I'm a bit surprised Samba didn't take care of it for me.

Python to the rescue:

def cp437ToLatin1(s):
  return unicode(s, 'cp437').encode('latin-1')
  2003-05-11 22:26 Z