Zero to JupyterHub with Ansible

tl;dr: In this post we’re going to show you how to deploy your own self-healing, Kubernetes-managed JupyterHub cluster on Amazon Web Services using the infrastructure automation tool Ansible.

Jupyter Notebooks make it easy to capture data-driven workflows that combine code, equations, text, and visualizations. JupyterHub is an analysis platform that allows astronomers to run notebooks, access a terminal, install custom software and much more. The JupyterHub environment is rapidly becoming a core part of the science platforms being developed by LSST, NOAO, SciServer, and here at STScI.

If you want to create a new JupyterHub environment, there’s a popular community guide ‘Zero to JupyterHub’ which walks you through, step by step how to create a JupyterHub deployment (managed by Kubernetes) on a variety of different platforms.

As great as this guide is, there’s a lot of copying and pasting required and even for someone experienced, it takes a good hour or two to create a new cluster.

In this blog post we’re going to walk you through a new resource we’ve recently released to the community which is essentially the Zero to JupyterHub guide but as an executable Ansible playbook. By following this guide, building a new JupyterHub cluster can be done with a a single command and takes about 5 minutes.

At the end of this playbook you’ll have a basic JupyterHub installation suitable for use within a research group or collaboration.

Some definitions

Before we get stuck in to the actual guide, it’s worth spending a moment defining some of the terms/technologies we’re going to be using here today.

Jupyter Notebooks: The file format for capturing data-driven workflows that combine code, equations, text, and visualizations.

JupyterHub: The analysis platform that allows astronomers to run notebooks, access a terminal, and install custom software.

Ansible: An infrastructure automation engine for automating repetitive infrastructure tasks.

Kubernetes: A platform for managing deployments of services (the service in this case being JupyterHub).

EFS: A scalable, cloud-hosted file-system. In this example, the home directories of our users are stored on an EFS volume.

Getting set up

Sign up for Amazon Web Services

We’re going to start by assuming that you already have an Amazon Web Services account, and are logged in to the AWS console at http://console.aws.amazon.com. If you don’t have an account, you can sign up here.

Install Ansible (2.5 or higher) on your local machine

You’ll also need to have Ansible version 2.5 or higher installed on your local machine to execute the playbook later on – 2.4 will almost certainly not work with this tutorial. This guide walks you through how to install Ansible on Mac OSX, and guides for other platforms can be found here.

Deploying our JupterHub cluster

Before we can deploy our JupyterHub cluster, we need to configure some pieces on the AWS console:

Configure IAM role

First, we need to visit the Identity and Access Management (IAM) and create a new IAM role that can act on our behalf when setting up the JupyterHub console.

Specifically, we need to create IAM role with following permissions:

  • AmazonEC2FullAccess
  • IAMFullAccess
  • AmazonS3FullAccess
  • AmazonVPCFullAccess
  • AmazonElasticFileSystemFullAccess
IAM roles

IAM console showing required permissions

Create a CI node

Next we need to start up a small (micro) instance which will act on our behalf within the AWS environment when the Ansible commands are executed.

  • Select the Amazon Linux 2 AMI (ami-009d6802948d06e52) and instance type t2.micro
  • (Optionally) give your instance a name such as [custom namespace]-ci so it’s easier to find
  • Configure it with the IAM role we’ve just created
  • Configure it to authenticate with an SSH key you own (or create one here)
  • Start the instance and note the node’s public DNS (IPv4) from the description tab
Configuring CI instance

Configuring instance with IAM role we just created

Instance details

Running CI instance and description tab showing IPv4 address

Download and configure the Z2JH Ansible repository

Next we need to clone the Z2JH Ansible repository and make some configuration changes based on the instance we’ve just deployed on AWS.

$ git clone https://github.com/spacetelescope/z2jh-aws-ansible

Configuring our deployments

Edit the hosts file: replace the contents of this file with the CI node public DNS (IPv4) as the only line of this file.

Edit the group_vars/all to configure your deployment.

namespace: Choose a custom name or leave default, many things are named based on this for consistency
aws_region: Here we’re using us-east-1 but others are available
ansible_ssh_private_key_file: The absolute local path of key file (.pem) which you use to ssh into the CI node

Deploy!

Now we’re ready to deploy our JupyterHub cluster. We do this with the following command:

$ ansible-playbook -i hosts z2jh.yml

Allow for the ‘Verify that kops setup is complete’ step to retry as many as 20 times, it’s simply polling for a successfully setup Kubernetes cluster which can take a few minutes to complete1.

FAILED - RETRYING: Verify that kops setup is complete (30 retries left).
FAILED - RETRYING: Verify that kops setup is complete (29 retries left).
FAILED - RETRYING: Verify that kops setup is complete (28 retries left).
FAILED - RETRYING: Verify that kops setup is complete (27 retries left).
FAILED - RETRYING: Verify that kops setup is complete (26 retries left).
FAILED - RETRYING: Verify that kops setup is complete (25 retries left).
FAILED - RETRYING: Verify that kops setup is complete (24 retries left).

Once the play is successfully complete, the URL of your new JupyterHub environment will be printed out in the terminal.

Complete terminal

Completed task with JupyterHub URL

Note: you may have to allow for a few minutes for the public proxy to become available after the script finishes running.

Configure GitHub Authentication

The default Jupyterhub authenticator uses PAM and by default, any username and password is accepted. While this is useful for testing, it’s really not secure enough for real-world usage.

Fortunately it’s very easy to configure authentication via GitHub and we’ve created a second Ansible playbook to help set this up:

Copy your LoadBalancer Ingress

We’re going to create a GitHub OAuth application and that will point to our new JupyterHub deployment. To set this up, we need to copy the LoadBalancer Ingress value printed out at the end of the original Ansible play (We like to use Route53 to create a CNAME record that points to this, see below.) For this example, our Ingress value is:

a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com

Register a new application on GitHub

Visit the GitHub new application page at https://github.com/settings/applications/new and fill in the following fields:

  • Application Name: Something users will recognize and trust
  • Homepage URL: http://[your ingress]
  • Authorization callback URL: http://[your ingress]/hub/oauth_callback

So, for our example, we used the following values:

- Application name: MAST Labs JupyterHub
- Homepage URL: http://a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com
- Authorization callback URL: http://a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com/hub/oauth_callback
GitHub OAuth page

Completed GitHub OAuth application page

Copy the OAuth tokens for your GitHub application

GitHub will give you a client id and client secret that you’ll need for your group_vars/all file:

github_client_id: YOUR_GITHUB_CLIENT_ID
github_client_secret: GITHUB_CLIENT_SECRET
ingress: http://a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com

Note, the ingress value should include the http/https but not have a trailing slash.

Apply the GitHub Auth

Finally, let’s apply the updated config for the GitHub OAuth application:

$ ansible-playbook -i hosts apply_github_auth.yml -v

Once this playbook completes, visit you Ingress URL again and you should be presented with a login screen with a prompt to ‘Sign in with GitHub’

Sign in with GitHub

Signing into our JupyterHub environment with GitHub.

Log in with GitHub

OAuth prompt at GitHub.

To logout, simply navigate to [your ingress]/hub/home and click logout in the top right corner.

Now what?

At this point, we have a JupyterHub deployment on AWS, configured with GitHub authentication.

From here you can further customize your deployment by:

  • Setting up a CNAME record via the AWS Route 53 console (or using another DNS service) to point to your proxy-public ingress URL, becoming your new ingress URL.
  • Further restrict access to your deployment by limiting access to a limited number of GitHub organizations
  • Modify the parameters in group_vars/all to choose a different instance type with more RAM, CPU etc.
  • Change the machine environment that spawns in the JupyterHub environment by modifying the singleuser_image_repo.

Idempotency

Idempotence is a term originally from mathematics, and commonly used in computer science and engineering to describe an operation that doesn’t change the result beyond the initial application if applied multiple times. This means, if we re-ran the ansible-playbook -i hosts z2jh.yml -v command immediately after the initial Ansible run, the playbook is smart enough to realize that all of the things you’re asking it to do have already been done.

We can leverage the idempotency we’ve built in to this Ansible playbook for ease of development to modify/update/upgrade our infrastructure. Sometimes, it’s a little easier to simply start from a fresh install when you see lots of errors and CrashLoopBackoffs. If kubectl or helm upgrade operations are fine tuning with a screwdriver, you can think of teardown.yml like a sledgehammer. For example, let’s say we wanted to swap out the base image defined in our groupvars that is used to spawn the JupyterHub environment but we want to keep the home directories of our users on the same EFS volume. This can be done as follows:

  1. Teardown the Kubernetes-managed JupyterHub cluster: ansible-playbook -i hosts teardown.yml -t kubernetes
  2. Modify the Docker container image specified in our groupvars/all file
  3. Re-run the Ansible playbook: ansible-playbook -i hosts z2jh.yml and Ansible will rebuild the JupyterHub environment based on this change and connect it back up with the original EFS volume.

Delete play (JupyterHub to Zero)

Finally, if you want to delete the JupyterHub environment you’ve created, (JupyterHub to Zero) you can use the teardown.yml playbook:

ansible-playbook -i hosts teardown.yml -t [tags]

This will tear down everything needed for the JupyterHub environment, but only the parts of the infrastructure you specify. This playbook is intended to work with a cluster that was set up with z2jh.yml, and it keys off of the names that are generated therein. A number of different tags can be passed to this playbook:

# Default: remove JupyterHub release
$ ansible-playbook -i hosts teardown.yml

# kubernetes: remove k8s namespace and tear down kubernetes
$ ansible-playbook -i hosts teardown.yml -t kubernetes

# all-fixtures: terminate EFS volume, S3, and the EC2 CI node. (implies kubernetes and default
$ ansible-playbook -i hosts teardown.yml -t all-fixtures

Note that you’ll need to wait a few minutes for these operations to resolve before attempting to rebuild anything that was deleted.

Wrapping up

This set of Ansible playbooks provides an easy way to create a simple, load-balanced JupyterHub deployment on Amazon Web Services that can be configured for your needs.

While the deployment offered by these playbooks is not quite a ‘reference architecture’ for a JupyterHub deployment as it lacks some important security pieces (such as configuring https) or system monitoring (observability) we think this deployment is useful for a small team/research group to begin using JupyterHub in their collaboration.

As we further extend these Ansible deployment scripts and develop our science platform and JupyterHub deployment at STScI, we’ll be updating the Z2JH Ansible repository.

Any problems, triumphs, suggestions following this tutorial? We would love to hear from you about your experience using these plays. Drop us a line at dsmo@stsci.edu

Brought to you by Jacob Matuskey & Arfon Smith

  1. Time to make a cup of tea?