As a relatively new firm doing “magical things” with computers and math (as one client put it), we wanted to highlight some of the super smart people who are really the backbone of 0ptimus. The truth is that each team member plays a role and brings their own unique background and skill set to the group.
From time to time we’ll be posting closer looks at what the members of of our analytics and digital analysis departments use on a daily basis. Be sure to stop by often to connect and learn more about the team.
This week we are highlighting our talented Analyst, Aaron Duke Ph.D
My work usually falls into one of four main areas: predictive modeling, conducting surveys, designing/interpreting randomized experiments, and writing reports. I also do a lot of other things that don’t clearly fall into these categories such as working with vendors, consulting with clients, etc., but all in all, I am usually very busy. Right now I’m devoting most of my time to revamping our modeling pipeline which involves being able to process data on the gigabyte/terabyte scale.
Hmm, Knight’s armor is an interesting analogy. However, I’m going to invoke a completely different analogy in answering your question. The ancient Greek warrior poet Archilochus wrote that, “the fox knows many things; the hedgehog one great thing.” This idea was expanded upon by one of my favorite psychologists, Philip Tetlock, who wrote the book, Expert Political Judgement, which summarizes his research into the predictive accuracy of political experts. It turns out that “foxes” or individuals who entertain multiple theories simultaneously are much better predictors than “hedgehogs” or individuals who ascribe to a single, overarching theory. I like to think that I am much more of a fox than a groundhog when it comes to my approach to data analysis. I prefer to use a variety of tools and methods, and I try cross-validate everything.
A “data analytics stack” is a set of software and technologies that work together to facilitate storing, processing, analyzing, and visualizing data. Regarding my stack, right now I would say that I spend most of my time using the python scientific stack (NumPy, SciPy, scikit-learn, pandas, Numba, NLTK, etc.). Luigi is my favorite package for managing complex data processing pipelines (ruffus and snakemake are not bad though). I’m also a huge fan of R, which is what I got started on in grad school. Some of my favorite packages for R are caret, datatable, ggplot2, dplyr, glmnet, randomForest, gbm, and SuperLearner. I could go on for a while. I’ve also played around with Julia, Scala, and Go, but I’m not nearly as proficient in these three. In terms of IDEs, I mostly use vim/sublime-text and R-studio for R of course. I’m also a huge fan of Linux and like to use shell commands such as awk, sed, grep when I can.
Well, where does the story start? I suppose in graduate school. Looking back now I can see the signs were there, the desire to automate everything, the hours I spend writing manuscripts in LaTeX, my conversion to Bayesian statistics…. all signs of a soon-to-be data scientist? Probably… At some point on my way to getting my Ph.D. in clinical psychology, I realized that I enjoyed research more than I enjoyed clinical work. Later, I realized that I enjoyed the tools, methods, and processes of research so much that the subject matter was secondary. I was doing a fellowship at Yale and became friends with a couple of folks in the Political Science department, Adam Dynes, who was finishing up his Ph.D. at the time and Luke Thompson who was doing a post-doc and had a shared interest in Bayesian methods. I credit them for getting me interested in this area, and it was ultimately through their connections that I ended up at 0ptimus.
Learn by doing. Classes can be helpful, but I learned a long time ago that classes can actually slow down learning when there is a strong intrinsic interest. Don’t get used to working with clean data. 80% of predictive modeling is getting the right data in the right format. Learn functional programming (e.g., map, reduce, filter, etc.) as it will allows you to scale your work when working with big data. Don’t limit yourself to one language, start with python if you don’t have any other background, but expand out from there. Finally, don’t neglect data visualization: matplotlib, ggplot2, and d3.js are all great libraries.