Heaton Research

My Experience with the Coursera Johns Hopkins Data Science Certification

This post is a summary of several posts that I had on my old blog about the Johns Hopkins Data Science certification offered by Coursera.

I am not sure how typical of a student I was for this program. I currently work as a data scientist, have a decent background in AI, have a number of publications, and was completing a PhD in computer science at the time of this program. So, a logical question is what did I want from this program?

At the time I took this certification I had not done a great deal of R programming. This program focused heavily on R. I view this as both a strength and weakness of the program. I am mostly a Java/Python/C#/C++ guy. I found the R instruction very useful. I’ve focused mainly in AI/machine learning, I hoped this program would fill in some gaps.

I really liked this program. Courses 1-9 provide a great introduction to the predictive modelling side of data science. Both machine learning and traditional regression models were covered. R can be a slow and painful language, at times, but I was able to get through. It is my opinion that R is primarily useful for ferrying data between models and visualization graphs. It is not good for heavy-lifting and data wrangling. The syntax to R is somewhat appalling. However, it is a domain specific language (DSL), not a general purpose language like Python. Don’t get me wrong. I like R for setting up models and graphics. Not for performing tasks better suited to a general purpose language.

In a nutshell, here are my opinions.

  • Pros: With the exception of the capstone, very practical real-world data sets. Experience with both black-box (machine learning) and more explainable (regression models) systems. Introduction to Slidify and Shiny, I’ve already used both in my day-to-day job. It takes some real work and understanding to make it through this program. The last three courses rocked!

  • Cons: Peer review is really hit or miss. More on this later. Some lecture material was sub-par (statistical inference) compared to Khan Academy. Only reinvent the wheel if you are going to make a better wheel.

Course Breakdown

Here are my quick opinions on some of these courses.

  1. The Data Scientist’s Toolbox: Basically, can you install R, RStudio and use GitHub. I’ve already done all three so I got little from this course. If you have not dealt with R, RStudio and GitHub this class will be a nice slow intro to the program.
  2. R Programming: I enjoyed this course! It was hard, considering I was taking classes #1 and #3 at the same time. If you had no programming experiance, this course will be really hard! Be ready to supplement the instruction with lots of Google Searching.
  3. Getting and Cleaning Data: Data wrangling is an important part of data science. Getting data into a tidy format is important. This course used quite a bit of R programming. For me, not being an R programmer and taking course #2 at the same time meant extra work. If you are NOT an advanced programmer already DO NOT take #2 and #3 at the same time.
  4. Exploratory Data Analysis: This was a valuable class, it taught you all about the R graphing packages.
  5. Reproducible Research: Valuable course! Learning about R markdown was very useful, I am already using this in one of my books to make sure that several of my examples are reproducible by providing a RMD script to produce all the charts from my book.
  6. Statistical Inference: This was an odd class. I already knew statistical inference and did quite well despite not watching any lectures (hardly). I don’t believe this course made anyone happy (hardly). Either you already knew the topic and were bored, or you were completely lost trying to learn statistics for the first time. There are several Khan academy videos that cover all the material in this course. Why dose Hopkins need to reproduce this? Is this not the point of MOOC? Why not link to the Khan academy videos and test the students. Best of both worlds! Also, 90% of the material was not used in the rest of the course, so I suspect many students might have been left wondering why this course is for.
  7. Regression Models: Great course, this is the explainable counterpart to machine learning. You are introduced to linear regression and GLM’s. This course was setup as the perfect counterpart to #8. My only beef on this course was that I got screwed by peer review. More on this later.
  8. Practical Machine Learning: Great course. This course showed some of the most current model types in data science: Gradient Boosting Machines (GBMs) and Random Forests. Also a great description of boosting and an awesome Kaggle like assignment where you submitted results from your model to see if you can match a “hidden data set”.
  9. Developing Data Products: Great course. I really enjoyed playing with shiny, and even used it for one of the examples in my upcoming book. You can see my shiny project [here](https: /jeffheaton.shinyapps.io/shiny/). They encouraged you to post these to GitHub and public sites, so I assume I am not violating anything by posting here. Don’t plagiarize me!!!
  10. Capstone Project: Bad ending to an otherwise great program.

Peer Reviewed Grading

If you are not familiar with peer review grading, here is how it works. For each project you are given four counterparts that review and grade your assignment. This is mostly double-blind, as neither the student or reviewer knows the other. I used my regular GitHub account on all assignments. So it was pretty obvious who I was. I was even emailed by a grader once who recognized me from my open source projects. Your grade is an average of what those four people gave you. At $49 a course maybe this is the only way they can afford grade. I currently spend nearly 100 times that for each of my PhD courses. :(

Overall, peer review grading worked good for me in all courses but one. Here are some of my concerns on peer grading.

  • You probably have many graders who are pressed for time and just give high-marks without much thought. (just a guess/opinion)
  • You are going to be graded by people who may not have not gotten the question correctly in the first place.
  • You are instructed NOT to run the R program. So now I am being graded on someone’s ability to mentally compile and execute my program?
  • Each peer is going to apply different standards. You could get radically different marks depending on who your four peers were.

So here is my story in the one case where peer review did not work for me. I in the upper 98-99% range on most of these courses. Except for course #8. I had good scores going into the final project. However, two of my peers knocked me for these reasons:

  • Two of my peers could not download my file from Coursera. Yet the other two had no problem. Fine, so I get a zero because someone’s ISP was flaking out.
  • Two of peers did not give me credit because the felt I had not used RMD for my report?? (which I had) Fine, so I lose a fair amount of points because two random peers did not know what RMD output looks like.

This took a toll on my grade, I still passed. But this is the one course I did not get “with distinction” credit. Yeah big deal. In the grand scheme of things I don’t really care. Just mildly annoying. However, if you are hovering near a 70%, and you get one or two bad reviewers you are probably toast.

Capstone Project

The capstone project was to produce a program similar to Swiftkey, the company that was the partner/sponsor for the capstone. If you are not familiar with Swiftkey, it attempts to speed mobile text input by predicting the next word you are going to type. For example, you might type “to be or not to ____”. The application should fill in “be”. The end program had to be written in R and deployed to a Shiny Server.

This project was somewhat flawed in several regards.

  • Natural Language Processing was not covered in the course. Neither was unstructured data. The only material provided on NLP was a handful of links to sites such as Wikipedia.
  • The first 9 courses had a clear direction. However, less than half of them had anything to do with the capstone.
  • The project is not typical of what you would see in most businesses as a data scientist. It would have been better to do something similar to Kaggle or one of the KDD cups.
  • In my opinion, R is a bad choice for this sort of project. During the meetup with Swiftkey, they were asked what tools they used. R was not among them. R is so cool for many things, why not showcase its abilities?
  • Student peer review is badbadbad… But it might be the only choice. The problem with peer review is you have three random reviewers. They might be easy, they might be hard. They might penalize you for the fact that they don’t know how to load your program! (this happened to me on a previous coursera course).
  • Perfect scores on the quizzes were really not possible. We were given several sample sentences to predict. The sentences were very specialized and no model would predict them correctly. The Swiftkey surely did not. Using my own human intuition and several text mining apps I wrote in Java, I did get 100% on the quizzes. Even though the instructions clearly said to use your final model. Knowing I might draw a short straw on peer review, I opted to do what I could to get max points. I don’t care about my grade, but falling below the cutoff for a bad peer review would not be cool!
  • Marketing based rubric for final project. One of the grading criteria posted the question, “Would you hire this person?” Seriously? I do participate in the hiring process for data scientists. I would never hire someone without meeting them, performing a tech interview, and small coding challenge. I hope this stat is not used in marketing material. xx% of our graduates produced programs that might land them a job.

After spending several days writing very slow model building code in R, I eventually dropped it and used Java and OpenNLP to write code that would build my model in under 20 minutes. Others ran into the same issues. There are somewhat kludge interfaces between R and OpenNLP, Weka and OpenNLP. But these are native Java apps. I just skipped the kludge and built my model in Java and wrote a Shiny app to use the model in R. This was enough to pass the program. I was not alone in this approach, based on forum comments.

Okay, I will just say it. I thought this was a bad capstone. This was just my experience on the first run of the certification; hopefully, they’ve improved it since. The rest of the program was really good! If I could make a suggestion, I would say to let the students choose a Kaggle competition to compete. The Kaggle competitions are closer to the sort of data real data scientists will see. I am proud of the certificate that I earned.

If I were interviewing someone who had this certificate I would consider it a positive. The candidate would still need to go through a standard interview/evaluation process.

Conclusions

Great program. It won’t make you a star data scientist, but it will give you a great foundation to go from. Kaggle might be a good next step. Another might be a blog and doing some real, and interesting data science to showcase your skills! This is somewhat how I got into data science.

A question that I am often asked, is what would I think of this certification, if I saw it on the resume of a new data scientist that I was interviewing. In isolation, I would not give a hire recommend based solely on this certification. However, it would show me that someone has mastered the basics of data science. They know what format data needs to be in for predictive modeling. They know their way around the R-programming language. They also took the initiative to undertake something that took a decent amount of effort. So yes, it is important, particularly, if your resume is lacking in the analytics area.