The Internet mostly survived the leap second two days ago. I’ve seen three confirmed problems. Cloudflare DNS had degraded service; they have an excellent postmortem. Some Cisco routers crashed. And about 10% of NTP pool servers failed to process the leap second correctly.

We’ve had a leap second roughly every two years. They often cause havoc. The big problem was in 2012 when a bunch of Java and MySQL servers died because of a Linux kernel bug. Linux kernels died in 2009 too. There are presumably a lot of smaller user application failures too, most unnoticed. Leap second bugs will keep reoccurring. Partly because no one thinks to test their systems carefully against weird and rare events. But also time is complicated.

Cloudflare blamed a bug in their code that assumed time never runs backwards. But the real problem is POSIX defines a day as containing exactly 86,400 seconds. But every 700 days or so that’s not true and a lot of systems jump time backwards one second to squeeze in the leap second. Time shouldn’t run backwards in a leap second, it’s just a bad kludge. There are some other options available, like the leap smear used by Google. The drawback is your clock is off by as much as 500ms during that day.

The NTP pool problem is particularly galling; NTP is a service whose sole purpose is telling time. Some of the pool servers are running openntpd which does not handle leap seconds. IMHO those servers aren’t suitable for public use. Not clear what else went wrong but leap second handling has been awkward for years and isn’t getting better.

techbad
  2017-01-03 00:38 Z