Drew Conway describes data scientist as the combination of domain expertise, statistics
and hacker skills. If you are an IT programmer, you likely already have the last
requirement. If you are a good IT programmer, you probably already understand something
about the business data, and have the domain expertise requirement. In this post I
describe how I gained knowledge of statistics/machine learning through a path of open
source involvement and publication.
There are quite a few articles that discuss how to become a data scientist. Some of them
are even quite good! Most speak in very general terms. I wrote such an summary awhile
back that provides a very general description of what a data scientist is. In this post,
I will describe my own path to becoming a data scientist. I started out as a Java
programmer in a typical IT job.
My publications were some of my earliest credentials. I started publishing before I had
my bachelor’s degree. My publications and side-programming jobswere the major factors
that helped me obtain my first “real” programming job, working for a Fortune 500
manufacturing company, back in 1995. I did not have my first degree at that point. In
1995 I was working on a bachelor’s degree part-time.
Back in the day, I wrote for publications such as C/C++ Users Journal, Java Developers
Journal, and Windows/DOS Developer’s journal. These were all paper-based magazines.
Often on the racks at book stores. The world has really changed since then! These days
I publish code on sites like GitHub and CodeProject. A great way to gain experience is
to find interesting projects to work on, using open source tools. Then post your projects
to GitHub, CodeProject and others.
I’ve always enjoyed programming and have applied it to many individual projects. Back in
the 80’s I was writing BBS software so that I could run a board on a C64 despite
insufficient funds from high school jobs to purchase a RAM expander. In the 90’s I was
hooking up web web cams and writing CGI and then later ASP/JSP code to build websites. I
wrote web servers and spiders from the socket up in C++. Around that time I wrote my
first neural network. Always publish! A hard drive full of cool project code sitting
in your desk is not telling the world what you’ve done. Support open source, a nice set
of independent projects on GitHub looks really good.
Artificial intelligence is closely related to data science. In many ways data science
is the application of certain AI techniques to potentially large amounts of data. AI is
also closely linked with statistics, an integral part of data science. I started with AI
because it was fun. I never envisioned using it in my “day job”. As soon as I got my
first neural network done I wrote an article for Java Users Journal. I quickly discovered
that AI had a coolness factor that could help me convince editors to publish my software.
I also published my first book on AI.
Writing code for a book is very different than writing code for a corporate project/open
- Book code: Readability and understandably are paramount. Second to none.
- Corporate/Open Source Code: Readability and understandably are important. However, real-world necessity often forces scalability & performance to take the front seat.
For example, if my book’s main goal is to show how to use JSP to build a simple blog,
do I really care if the blog can scale to the traffic seen by a top-100 website?
Likewise, if my goal is to show how a backpropagation neural network trains, do I really
want to muddy the water with concurrency?
The neural network code in my books is meant to be example code. A clear starting point
for something. But this code is not meant to be “industrial strength”. However, when
people start asking you questions that indicate that they are using your example code
for “real projects”, it is now time to start (or join) an open source project! This is
why I started the Encog project. This might be a path to an open source project for you!
I’ve often heard that neural networks are the gateway drug to greater artificial
intelligence. Neural networks are an interesting creature. They have risen and fallen
from grace several times. Currently they are back, and with a vengeance. Most
implementations of deep learning are based on neural networks. If you would like to
learn more about deep learning, I ran a successful Kickstarter campaign on that
I took several really good classes from UDacity, just as they were introduced. These
classes have been somewhat re-branded. However, UDacity still several great AI and Machine
Learning courses. I also recommend (and have taken) the Johns Hopkins Coursera Data Science
specialization. Its not perfect, but it will expose you to many concepts in AI. You can
read my summary of it here.
Also, learn statistics. At least the basics of classical statistics. You should
understand concepts like mean, mode, median, linear regression, anova, manova, Tukey HSD,
p-values, etc. A simple undergraduate course in statistics will give you the foundation.
You can build on more complex topics such as Bayesian networks, belief networks, and
others later. Udacity has a nice intro to statistics course.
Public projects are always a good thing. My projects have brought me speaking
opportunities and book opportunities (though I mostly self publish now). Kickstarter
has been great for this. I launched my Artificial Intelligence for Humans series of books
When data science first started to enter the public scene I was working as a Java
programmer writing AI books as a hobby. A job opportunity at my current company later
opened up in data science. I did not even realize that the opportunity was available,
I really was not looking. However, during the recruiting process they discovered that
someone with knowledge of the needed areas lived right here in town. They had found my
project pages. This led to some good opportunities right in my current company.
The point is, get your projects out there! If you don’t have an idea for a project, then
enter Kaggle. You probably won’t win. Try to become a Kaggle master. That will be hard.
But you will learn quite a bit trying. Write about your efforts. Post code to GitHub.
If you use open source tools, write to their creators and send links to your efforts.
Open source creators love to post links to people who are actually using their code.
For bigger projects (with many or institutional creators), post to their communities.
Kaggle gives you a problem to solve. You don’t have to win. It will give you something
to talk about during an interview.
I try to always be learning. You will always hear terminology that you feel like you
should know, but do not. This happens to me every day. Keep a list of what you don’t
know, and keep prioritizing and tacking the list (dare I say backlog grooming).
Keep learning! Get involved in projects like Kaggle and read the discussion board.
This will show you what you do not know really quick. Write tutorials on your efforts.
If something was hard for you, it was hard for others who will appreciate a tutorial.
I’ve seen a number of articles that question “Do you need a PhD to work as a data
scientist?” The answer is that it will help, but is not necessary. I know numerous
data scientists with varying levels of academic credentials. A PhD demonstrates that
someone can follow the rigors of formal academic research and extend human knowledge.
When I became a data scientist I was not a PhD student.
At this point, I am a PhD student in computer science, you can read more about that here.
I want to learn the process of academic research because I am starting to look at
algorithms and techniques that would qualify as original research. Additionally, I’ve
given advice to a several other PhD students, who were using my projects open source
projects in their dissertations. It was time for me to take the leap.
Data science is described, by Drew Conway, as the intersection of hacker skills,
statistics and domain knowledge. To be an “IT programmer” you most likely already have
two of these skills. Hacker skills is ability to write programs that can wrangle data
into many different formats and automate processes. Domain knowledge is knowing something
about the business that you are programming for. Is your business data just a bunch of
columns to you? An effective IT programmer learns about the business and it’s data.
So does an effective data scientist.
This leaves, only really statistics (and machine learning/AI). You can learn that from
books, MOOCS, and other sources. Some were mentioned earlier in this article. I have
a list of some of my favorites here. I also have a few books to teach you about AI.
Most importantly, tinker and learn. Build/publish projects, blog and contribute to open
source. When you talk to someone interested in hiring you as a data scientist, you will
have experience to talk about. Also have a GitHub profile, linked to LinkedIn that shows
you do in-fact have something to talk about.