Machine learning is becoming a mainstream technology any journeyman software engineer can apply. We expect engineers to know how to take an average and standard deviation of data. Perhaps it’s now reasonable to expect a non-expert to be able to train a learning model to predict data, or apply PCA or k-means clustering to better understand data.
The key change that’s enabling high end machine learning like Siri or self driving cars is the availability of very large computing clusters. Machine learning works better the more data you have, so being able to easily harness 10,000 CPUs to process a petabyte of data really makes a difference. For us civilians with fewer resources, libraries like scikit-learn and cloud services make it possible for us to, say, train up a neural network without knowing much about the details of backpropagation.
The danger of inexpert machine learning is misapplication. The algorithms are complex to tune and apply well. A particular worry is overfitting, where it looks like your system is predicting the data well but has really learned the training data too precisely and it won’t generalize well. Being able to measure and improve machine learning systems is an art that I suspect can only be learned with lots of practice.
I just finished an online machine learning course that was my first formal introduction. It was pretty good and worth my time, you can see my detailed blog posts if you want to know a lot more about the class. Now I’m working on applying what I’ve learned to real data, mostly using IPython and scikit-learn. It’s challenging to get good results, but it’s also fun and productive.
The addition of ad blocking capability to iOS has brought on a lot of hand-wringing about whether it’s ethical to block ads in your web browser. Of course it is! Blocking ads is self preservation.
Ad networks act unethically. They inject huge amounts of garbage making pages load slowly and computers run poorly. They use aggressive display tricks to get between you and the content. Sometimes negligent ad networks serve outright malware. They violate your privacy without informed consent and have rejected a modest opt-out technology. Ad systems are so byzantine that content providers pay third parties to tell them what crap they’re embedding in their own websites.
Advertising itself can be unethical. Ads are mind viruses, tricking your brain into wanting a product or service you would not otherwise desire. Ads are often designed to work subconsciously, sometimes subliminally. Filtering ads out is one way to preserve clarity of thought.
I feel bad for publishers whose only revenue is ads. But they and the ad networks brought it on themselves by escalating ad serving with no thought for consumers. The solution is for the ad industry to rein itself way in, to set some industry standards limiting technologies and display techniques. Perhaps blockers should permit ethical ads, although that leads to conflicts of interest. Right now Internet advertisers are predators and we are the prey. We must do whatever we can to defend ourselves.
The Ubiquiti NanoStation loco 5M is good hardware. It’s speciality gear for setting up long distance wireless network links. All of Ubiquiti’s networking gear is worth knowing about if you’re a prosumer-type networking person. I will probably buy their wifi access points next time I need one.
I’m using two NanoStations as a wireless ethernet bridge. My Internet up in Grass Valley terminates 200’ from my house. I couldn’t run a cable but a hacky wireless thing I set up was sort of working. So I asked on Metafilter on how to do a wireless solution right and got a clear consensus on using Ubiquiti equipment. $150 later and it works great! Kind of overkill; the firmware can do a lot more than just bridging and the radios are good for 5+ miles. But it’s reliable and good.
The key thing about Ubiquiti gear is the high quality radios and antennas. It just seems much more reliable than most consumer WiFi gear. Their airOS firmware is good too, it’s a bit complicated to set up but very capable and flexible. And in addition to normal 802.11n or 802.11ac they also have an optional proprietary TDMA protocol called airMax that’s designed for serving several long haul links from a single basestation. They’re mostly marketing to business customers but the equipment is sold retail and well documented for ordinary nerds to figure out.
I still wish I just had a simple wire but I’ve now made my peace with wireless networking. It works well with good gear in a noncongested environment. I wrote up some technical notes on modern wifi so I understood the details better. Starting with 802.11n and MIMO there was a significant improvement in wireless networking protocols, it’s really pretty amazing technology.
The CyberPower CP350SLG is a good small uninterruptible power supply. Its only rated for 250W and it only has a few minutes of battery life. Not suitable for a big computer. But it’s perfect for backup power for network gear, like a router or a modem or the like. And it’s pretty small, just 7x4x3 inches. I made a mistake and bought APC’s small UPS first and the damn thing is ungrounded, which is ridiculous and dumb. I’ve had better luck with CyberPower UPSes anyway and this small one is exactly what I needed.
I’m a big fan of small UPSes. I don’t need something to carry me through a 30 minute power outage, I just want some backup that will keep my equipment running if the power drops for a couple of seconds. Because PG&E, you know? It’s a shame there’s no DC power standard, I bet you could make a DC-only UPS 1/4th the size with a lithium battery. But instead it’s all lead-acid batteries and producing 110V AC just to be transformed back to DC by all the equipment its powering. (That APC UPS does have powered USB ports, a small step towards DC UPS.)
Some day I should look into whole-house UPS units. A quick look suggests it’s about $2500 for 2.7kW, plus installation. This discussion suggests $10k is more realistic if you really mean a whole house.
Jupyter Notebooks (née IPython Notebooks) feel like an important technology to me. It’s a way to interactively build up a computer program, then save the output and share it with other people. You can see a sample notebook I made here or check out this gallery of fancy notebooks. It’s particularly popular with data scientists. If you’re an old Python fogey like me it’s kind of a new thing and it’s exciting and worth learning about. I’m focussing on Python here, but Jupyter is now language-agnostic and supports lots of languages like R, Go, Java, even C++.
The notebook is basically a REPL hosted in a web browser. You type a snippet of code into the web page, run it, and it shows you the output and saves it to the notebook. Because the output is in the browser it can display HTML and images; see cells 6 and 7 in my sample notebook. There’s excellent matplotlib support for quick data visualization and lots of fancier plugins for D3, Leaflet, etc if you need.
Notebooks are made to be shareable. My sample is basically a static snapshot of a program’s output, there’s no Python running when you view it on GitHub. But you can also download that file, run the Python code on your own computer, and modify it however you want. That makes it an incredibly useful pedagogical tool. There are complex notebooks you can download that are effectively whole college courses in computing topics. And you can keep your own notebooks around as documentation of the work you’ve done. It’s a very powerful tool.
Behind the scenes what’s going on is the browser window acts like an attachable debugger. There’s a headless IPython server process running somewhere and the browser connects to it. It’s easy to run IPython yourself on your own machine or there are other options including cloud hosted live notebooks. Most of the display magic works by having objects define their own _repr_html_ methods, that’s what lets Pandas show that nice HTML table in cell 6.
Installing and starting IPython is pretty simple, just install it with pip and run ipython notebook. You’ll also want %matplotlib inline for inline plots. Notebooks seem particularly popular with data scientists; if you want a full Python data environment with Pandas, scikit-learn, etc then the Anaconda distribution is an easy way to get going.
I just upgraded my Internet in Grass Valley from 1 Mbit/s to 12 Mbps. And it is so good. I no longer think about scheduling Internet access, about starting a download when I go to sleep or having the iPad download the newspaper while the coffee is brewing so it doesn’t interrupt my email. And I no longer worry about software upgrades killing my network.
1 Mbps is just not fast enough for modern Internet. That’s 450 MBytes/hour, or about 30 minutes to update a typical medium-size program. It’s just about fast enough to stream a 360p video from Youtube, but you better not be doing anything else. It takes a little over a minute to download a crappy modern web page. I’d never click on video links and those cute animated cat GIFs everyone likes were just pain. And while slow is annoying but predictable; the worst thing is trying to do two things at once online. Once a week I’d be wandering around the house unplugging devices because some autoupdate decided to run and ruined whatever I was trying to do on my desktop computer.
Right now I think there’s a usability inflection point at 6 Mbps. That’s fast enough to watch a 1080p video stream while leaving some headroom to do something else; casual web browsing or the like. Faster is better of course; I have 100 Mbps in San Francisco and it is amazingly good. The FCC recently defined broadband at 25 Mbps, seems like a good goal.
Why was my Internet so slow? My house is in a rural area. Only a mile out of town, but all hills and trees and only a few houses. The Comcast/AT&T duopoly refuses to provide service to houses like mine and the toothless FCC won’t compel them. Wired service does not exist. Our local wireless ISP SmarterBroadband is pretty good but we were limited to 900MHz radio links because we didn’t have a clear view to one of the other sites in their peer to peer network. I finally paid someone to climb 70’ up a tree to get a good view and install a 5GHz antenna. Works great, at least until we have to repair it.
I don’t know for sure, but I suspect SmarterBroadband is only able to provide my service thanks to federal subsidies. They upgraded a lot of their internal network in the last year with a federal grant. They were also nicely proactive in getting us to upgrade, which I suspect is tied to some sort of bounty or benchmark for customer bandwidth.
One of my first email addresses (in 1989) was tektronix!ogicse!reed!minar. I’m feeling old today and I’m guessing half my readers have never seen an email address like that. It was from the long long ago, in the time that was before the Internet, when UUCP was the main Unix mail system.
My unique email address was reed!minar. But there was no ubiquitous routing infrastructure for mail, no global addressing. Unix network email was store-and-forward based on scheduled phone calls and modem transfers via uucico. Each host only talked to a few other hosts. Reed talked to OGICSE regularly, so my address suggested mail be forwarded through there. Other mail hosts might or might not know how to get mail to OGI but they certainly knew how to get to Tektronix, so that sufficed as a global route. UUNET was a hub that knew how to talk to everyone; often addresses began uunet!.
The essential idea is that UUCP email addresses included not just the address but the route to that address. It's a powerful idea. But modern Internet systems don’t do that. Instead we rely on global address lookup systems like DNS and global routing systems like BGP. (If anyone can think of a modern system that includes routes in names, please email me via SMTP)
UUCP users did build a routing system; pathalias. It relied on UUCP maps published to comp.mail.maps. Those maps were discontinued in December 2000. I haven’t found a modern view onto this data; it’d be fascinating to see the history of the growth of UUCPnet. telehack has a usable snapshot of the data, try uumap reed for instance.
The arrogant candidate says “I’m smart so I know my code is good”. That’s certainly a bad sign, although sometimes they’re right. Slightly wiser responses are “I run it and look closely” or “I trace the code and make sure it works like I expect”. Better, but too manual. The truly enlightened say “I have an automated test suite” and then you’re off to the real questions about how to test code properly.
I have a deep distrust of code. Software is organic, unpredictable, chaotically complex. It’s difficult enough to understand what the code you write now is likely to do right now with expected inputs. But hostile inputs, or a weird environment, or the same code a year from now, or the slightly modified open source contribution in some fork somewhere? Forget it. That’s why automated tests are so valuable. It’s a way to demonstrate the code is doing what you expect it to.
Writing good tests is hard, almost as hard as writing good code. Modern environments have a lot of testing tools you should learn. From language unit test frameworks to mock objects for servers to fuzz testing to various continuous integration systems for functional tests. GitHub projects have the miracle which is Travis CI, free no-fuss continuous build and test for any open source project. It’s amazing.
So until software correctness proofs become a real tool we can use in real production code, ask yourself how you know your code is going to work. If you’re honest, you probably don’t. But some testing will certainly help give you at least a little confidence.