Heaton Research

AWS EC2 Data Science: My Jupyter Workspaces

Lately I’ve been using AWS more for data science tasks. In this series of posts I describe how to setup a JupyterHub in a variety of different setups. My goal for this series is meant mainly as a manually controlled Jupyter notebook where the data scientist will run ad hoc tasks remotely. If the data scientist is to run longer batch-oriented tasks, you can likely find other less expensive means of accomplishing this. This system gives me a full JupyterHub based environment that I can access from anywhere. This allows me to have a known configuration and only require a web browser of whatever machine I am currently using. Additionally, I can quickly switch from a low-end AWS EC2 system to a very high-end one.

For my system I primarily use the Python programming language. As a result I make use of the JupyterHub system. This allows me to stand up a Jupyter notebook that I might use to experiment with a Kaggle, lecture a class, or present at a conference. I can use a variety of AWS systems that give me access to 8, 16, 32 or more GB of RAM. I can also create instances that are capable of using GPUs. This is somewhat similar to the free Data Science Workbench service, except that I control the AWS instances and can provision them at any size that I need.

As I create additional blog posts to describe how to implement these systems I will update this post with links.