Heaton Research

So you want to be a data scientist?

Harvard business review calls it the sexiest job of the 21st century. But, what skills
are needed to become a data scientist, and how can you get these skills? I began as an
advanced computer programmer with business knowledge. Open Source involvement in
Artificial Intelligence gave me the foundation to move into a data science role.

There are really three critical skills that a data scientist must posses. A data scientist
must be a statistician, domain expert and hacker - not necessarily in that order. There
are different types of data scientist. Each type will be stronger in one of these three
skills. Lets take a look at each of these three skills and see how you might build up
your knowledge.

Statistician

Data scientists examine data and see insights and patterns in data. Seeing patterns in
data is nothing new. Sir Ronald Fisher was doing this back in 1936. Fisher was interested
in determining what sort of flower he was looking at. You might think it is easy to
determine a flower type. For humans this is an easy task. However, researchers are still
determining exactly how humans performed this seemingly simple task.

How does a human likely recognize a flower? Most likely we are considering features about
the flower. What color is it? How big is it? What does it smell like? Fisher wanted to
determine the iris species using a specified set of features about each iris. To do this
he collected four numeric measurements for 150 iris flowers. This included 50 flowers
each from three different iris species. Using these irises he was able to make a
statistical model that would tell him the iris species for a new flower using just these
four measurements.

There are many different ways to acquire statistical skills. If you are not familiar with
basic statistics terms such as z-score, p-value, ANOVA and Normal Distribution, you should
start with a Statistics 101 type class. UDacity offers several good choices for this.
As your statistical skills grow, you will find Khan Academy to be indispensable. I’ve
learned a great deal pouring over Khan Academy and Wikipedia pages for several statistical
models. As you advance, Artificial Intelligence and specifically Machine Learning will
also become important. I’ve written several books in this space that might be useful for you.

Domain Expert

A domain expert is someone who has real-world knowledge about the data that they are
analyzing. While Fisher was a statistician, he was also an evolutionary biologist and a
geneticist. Fisher was a domain expert. The collected iris data was not just a sheet of
numbers to him. He knew something about the iris flowers he was analyzing.

This domain knowledge allowed him to know what flower measurements to consider. Fisher
measured the length and width of both the petal and sepal of each iris flower. He did not
measure the roots, stem thickness or chemical makeup of each flower. Because Fisher knew
something about flowers, he had an idea of which measurements to consider. He also knew
if his results made any sense.

Being able to determine if your model makes any sense is critical. Dogs of the Dow was a
popular investing strategy from early 1990’s that would seemingly pick winning portfolios
based on very simple data. Just plug in the dividend yields of the DJIA-30 stocks to gain
a portfolio that beats the overall stock market average. Analysis of historic data created
this model. The problem is that this model fit the historic data much better than it did
the future data. “Dogs of the Dow” found a mostly coincidental pattern in the historic
data. While there are some holdouts, the Dogs of the Dow is now a largely discredited
model.

Becoming a domain expert is somewhat more elusive. The first question you should be
asking is “what domain?” You might choose a domain such as finance, marketing, biology,
or any other common business field. It will be helpful if you already have experience in
a particular industry. If not, try to take some courses that will expose you to the data
of a particular industry. Economics, marketing and finance classes are always good choices.

Hacker

Finally, a data scientist must be a hacker. At first this idea might seem strange. By
hacker, I do not mean someone who attempts to circumvent computer security. For this
definition, a hacker is a programmer. However, not every programmer is a hacker. A hacker
is a programmer who will hack at a problem until that problem is solved. The hacker is not
intimidated by hitting a brick wall. The hacker will come up with a very creative way
around the brick wall, even if earlier attempts have all failed.

Fisher did not need to be a hacker. Fisher had 150 flowers to analyze. He measured each
one by hand and made sure his data was clean and accurate. Consider if Fisher had 150
million flowers. Further, a mechanical process with a 90% accuracy rate measured each of
these flowers. Now we have a huge amount of somewhat inaccurate data. We now have a
Big Data” problem. Big Data is any data set that is so large that it is difficult to
work with. Typically “Big Data” starts at the point where a data set can no longer fit in
the memory of a single computer. Not long ago everything over 640k (the original useable
memory size of a PC) was “Big Data”.

The hacker can wrangle “Big Data” and get it into a form that a statistical model can
handle. This wrangling might mean merging data from many sources, or writing automated
programs to harvest data from the Internet. The hacker might need to clean the data in
some way. The data might need further wrangling to even get it into a statistical model.

If you are a computer programmer already, then you already have some of the hacker skills.
If you are a computer programmer who spends their free time learning new programming
skills and perhaps contributing to open source, then you might be a hacker! If not, try
something new. Two of the most predominant data science languages are Python and R. Java,
C# and C/C++ are also choices. UDacity and Coursera(https://www.coursera.org/) both have several courses to allow
you to use hacker skills to sharpen data science. The best way to learn to be a hacker
is to hack. Practice examples and then experiment with data that interests you.

Acquiring Data Science Skills

To summarize, a data scientist must have three primary skills.

  • Mathematics & Statistics (statistician)
  • Business Domain Knowledge (real world knowledge)
  • Hacker (creative computer programmer)

There are also a number of programs available to teach data science.

There are also many great data science blogs. I personally read the following.

For me, the road to data science started at programming. I worked for many years as a
computer programmer in the life insurance industry. As a result, I learned quite a bit
about life insurance data. I also learned how to crunch data and provide reports to
present data. Long before I had ever heard the term “data science,” I developed an
interest in Artificial Intelligence. What does a hacker do when they are interested in
something? I started experimenting and programming. I learned more about AI. This
knowledge ultimately opened doors and I moved into more of a data science role.

If you are interested in AI, you might find some of my projects interesting. I have a
machine learning open source projects, write a blog and write books.