John Myles White
Statistics hacker, PhD student
Who are you, and what do you do?
I’m John Myles White. I’m one of the authors of Machine Learning for Hackers, and I’m a Ph.D student in psychology. I spend most of my time trying to help people use statistics to understand the world around us.
What hardware do you use?
I mostly use either my 2008 MacBook, which has 4 GB of RAM and a 2.4 GHz Intel Core 2 Duo Processor, or my first generation iPad. I’ve got a 2007 iMac that I mostly use to watch movies, although it’s also acted as a server at times.
And what software?
I use Hazel to keep my files organized and I keep my PDF library organized using Papers 2. My only complaint about Papers 2 is that it doesn’t do the best job of figuring out when I have duplicates in my library, but I have the same complaint about iTunes. I also use the Kindle app on my iPad to do most of my pleasure reading. I’m trying to move away from paper books to save space and the environment.
I write all my code using TextMate 2. The alpha release is a little brittle, but I’ve been a huge fan of TextMate ever since I first started building sites using Ruby on Rails, so I was really excited to see TextMate 2 leave the world of vaporware. If I’m working only at the command-line, I’ll use emacs.
Most of my work involves programming, so programming languages and their libraries are the bulk of the software I use. I primarily program in R, but, if the situation calls for it, I’ll use Matlab, Ruby or Python. Lately I’ve been programming a lot in a new language called Julia. It hasn’t even reached a 0.1 release yet, but it’s often nearly as fast as C while still being as readable as Ruby. I’m hoping we have Julia ready to be a replacement for the computationally heavy programming that R isn’t well-suited for in the near future.
That said, for me the specific language I use is much less important than the libraries availble for that language. In R, I do most of my graphics using ggplot2, and I clean my data using plyr, reshape, lubridate and stringr. I do most of my analysis using rjags, which interfaces with JAGS, and I’ll sometimes use glmnet for regression modeling. And, of course, I use ProjectTemplate to organize all of my statistical modeling work. To do text analysis, I’ll use the tm and lda packages.
To keep all of my files synchronized across the machines I own, I use Unison, a great program that seems to have never gained the traction it deserves. I also use Dropbox to collaborate with other people, although I’ve started to use Google Docs for a bunch of collaborative editing of Word documents. When I need to write something mathematical, I use MacTex.
And, for version control, I use Git. I’m moving increasingly towards keeping all the work I do in Git, including all the text I write.
What would be your dream setup?
For me, the main limiting factor in my work is always memory: either RAM or hard disk space. I’m working at MSR this summer and there’s an urban legend that there’s a machine with 2 TB of RAM here that some of the researchers have access to. Having a machine with that kind of power is really my ideal, although I could also benefit from more hard disk space. My other dream situation would be to have a small Hadoop cluster running at home that’s only for my own work.