The Planet Hunters Analysis Database

The Planet Hunters citizen science project has seen volunteers searching for exoplanets since 2011. The latest version of the project uses data from the Transiting Exoplanet Survey Satellite (TESS), a mission we archive here at MAST. Every month, new data from TESS is released at MAST and the Zooniverse team are ingesting the ~20,000 TESS light curves and displaying them on the Planet Hunters site for the public to help analyze. Results from initial tests with the Zooniverse community and TESS data are promising.

One of the most exciting aspects of Planet Hunters is the rapid pace of discovery: Every day, thousands of volunteers from the Zooniverse community analyze the latest data, identifying interesting candidates and discussing them in the Talk discussion forums. Popular projects such as this regularly run out of data as the community rapidly analyzes new data posted to the site. While the Zooniverse community can be very fast at the initial analysis/classification of data, there’s often a significant delay between the initial identification of a possible transit by volunteers and any confirmation of a new exoplanet candidate. Additionally, crediting those individuals who help identify a planetary candidate, and tracking the provenance of any discovery was often hard for astronomers and generally had to wait until a paper was published in the astronomical literature.

With Planet Hunters TESS, the team here at MAST has been working with the Zooniverse team to address some of these challenges and try something a little new…

What we’re doing

Together with Planet Hunters TESS, we’re launching an experimental new service at MAST, the Planet Hunters Analysis Database (PHAD). PHAD receives data in real time from the Planet Hunters project as members of the Zooniverse community analyze the data from TESS and as potential new exoplanet candidates are discovered.

Planet Hunters interface

Transits marked on the Planet Hunters interface

Planet Hunters + PHAD works as follows:

  1. At Planet Hunters, each time a volunteer marks potential transits (see figure above), a program called Caesar listens in, and updates our calculation of the best consensus results. Caesar takes the multiple analyses for each light curve shown and attempts to generate a consensus result in real time. This classifier is the work of Nora Eisner, a PhD student at the University of Oxford.
  2. As Caesar generates results, these are posted to an API at Planet Hunters Analysis Database (PHAD). PHAD collects these analyses and displays them in a searchable user interface making it possible for members of the Zooniverse community to see which analyses their classifications have contributed to. For more information on what these values mean, take a look at the PHAD about page.
  3. Once per month, a complete set of the raw classification data from the Planet Hunters interface will be provided, thereby giving an opportunity for the wider research community to reanalyze the raw classification data from Planet Hunters.

Planet Hunters + real time analyses archived @ MAST

This combination of real time analysis of the Planet Hunters classifications, and live posting of these results back to the archive is something completely new for both teams. While MAST hosts static community-contributed data used in the scientific literature in the form of High Level Science Products, the PHAD is something different. PHAD is a rapidly changing table of the Planet Hunter’s consensus analyses, the data should be considered pretty ‘raw’, and is obviously not based on any refereed publications. By providing rapid reporting the Zooniverse volunteers have the opportunity to see how they are contributing, and everyone can immediately find the most promising transit events discovered by the Planet Hunters project and give credit appropriately.

Brought to you by Arfon Smith & Nora Eisner

Zero to JupyterHub with Ansible

tl;dr: In this post we’re going to show you how to deploy your own self-healing, Kubernetes-managed JupyterHub cluster on Amazon Web Services using the infrastructure automation tool Ansible.

Jupyter Notebooks make it easy to capture data-driven workflows that combine code, equations, text, and visualizations. JupyterHub is an analysis platform that allows astronomers to run notebooks, access a terminal, install custom software and much more. The JupyterHub environment is rapidly becoming a core part of the science platforms being developed by LSST, NOAO, SciServer, and here at STScI.

If you want to create a new JupyterHub environment, there’s a popular community guide ‘Zero to JupyterHub’ which walks you through, step by step how to create a JupyterHub deployment (managed by Kubernetes) on a variety of different platforms.

As great as this guide is, there’s a lot of copying and pasting required and even for someone experienced, it takes a good hour or two to create a new cluster.

In this blog post we’re going to walk you through a new resource we’ve recently released to the community which is essentially the Zero to JupyterHub guide but as an executable Ansible playbook. By following this guide, building a new JupyterHub cluster can be done with a a single command and takes about 5 minutes.

At the end of this playbook you’ll have a basic JupyterHub installation suitable for use within a research group or collaboration.

Some definitions

Before we get stuck in to the actual guide, it’s worth spending a moment defining some of the terms/technologies we’re going to be using here today.

Jupyter Notebooks: The file format for capturing data-driven workflows that combine code, equations, text, and visualizations.

JupyterHub: The analysis platform that allows astronomers to run notebooks, access a terminal, and install custom software.

Ansible: An infrastructure automation engine for automating repetitive infrastructure tasks.

Kubernetes: A platform for managing deployments of services (the service in this case being JupyterHub).

EFS: A scalable, cloud-hosted file-system. In this example, the home directories of our users are stored on an EFS volume.

Getting set up

Sign up for Amazon Web Services

We’re going to start by assuming that you already have an Amazon Web Services account, and are logged in to the AWS console at http://console.aws.amazon.com. If you don’t have an account, you can sign up here.

Install Ansible (2.5 or higher) on your local machine

You’ll also need to have Ansible version 2.5 or higher installed on your local machine to execute the playbook later on – 2.4 will almost certainly not work with this tutorial. This guide walks you through how to install Ansible on Mac OSX, and guides for other platforms can be found here.

Deploying our JupterHub cluster

Before we can deploy our JupyterHub cluster, we need to configure some pieces on the AWS console:

Configure IAM role

First, we need to visit the Identity and Access Management (IAM) and create a new IAM role that can act on our behalf when setting up the JupyterHub console.

Specifically, we need to create IAM role with following permissions:

  • AmazonEC2FullAccess
  • IAMFullAccess
  • AmazonS3FullAccess
  • AmazonVPCFullAccess
  • AmazonElasticFileSystemFullAccess
IAM roles

IAM console showing required permissions

Create a CI node

Next we need to start up a small (micro) instance which will act on our behalf within the AWS environment when the Ansible commands are executed.

  • Select the Amazon Linux 2 AMI (ami-009d6802948d06e52) and instance type t2.micro
  • (Optionally) give your instance a name such as [custom namespace]-ci so it’s easier to find
  • Configure it with the IAM role we’ve just created
  • Configure it to authenticate with an SSH key you own (or create one here)
  • Start the instance and note the node’s public DNS (IPv4) from the description tab
Configuring CI instance

Configuring instance with IAM role we just created

Instance details

Running CI instance and description tab showing IPv4 address

Download and configure the Z2JH Ansible repository

Next we need to clone the Z2JH Ansible repository and make some configuration changes based on the instance we’ve just deployed on AWS.

$ git clone https://github.com/spacetelescope/z2jh-aws-ansible

Configuring our deployments

Edit the hosts file: replace the contents of this file with the CI node public DNS (IPv4) as the only line of this file.

Edit the group_vars/all to configure your deployment.

namespace: Choose a custom name or leave default, many things are named based on this for consistency
aws_region: Here we’re using us-east-1 but others are available
ansible_ssh_private_key_file: The absolute local path of key file (.pem) which you use to ssh into the CI node

Deploy!

Now we’re ready to deploy our JupyterHub cluster. We do this with the following command:

$ ansible-playbook -i hosts z2jh.yml

Allow for the ‘Verify that kops setup is complete’ step to retry as many as 20 times, it’s simply polling for a successfully setup Kubernetes cluster which can take a few minutes to complete1.

FAILED - RETRYING: Verify that kops setup is complete (30 retries left).
FAILED - RETRYING: Verify that kops setup is complete (29 retries left).
FAILED - RETRYING: Verify that kops setup is complete (28 retries left).
FAILED - RETRYING: Verify that kops setup is complete (27 retries left).
FAILED - RETRYING: Verify that kops setup is complete (26 retries left).
FAILED - RETRYING: Verify that kops setup is complete (25 retries left).
FAILED - RETRYING: Verify that kops setup is complete (24 retries left).

Once the play is successfully complete, the URL of your new JupyterHub environment will be printed out in the terminal.

Complete terminal

Completed task with JupyterHub URL

Note: you may have to allow for a few minutes for the public proxy to become available after the script finishes running.

Configure GitHub Authentication

The default Jupyterhub authenticator uses PAM and by default, any username and password is accepted. While this is useful for testing, it’s really not secure enough for real-world usage.

Fortunately it’s very easy to configure authentication via GitHub and we’ve created a second Ansible playbook to help set this up:

Copy your LoadBalancer Ingress

We’re going to create a GitHub OAuth application and that will point to our new JupyterHub deployment. To set this up, we need to copy the LoadBalancer Ingress value printed out at the end of the original Ansible play (We like to use Route53 to create a CNAME record that points to this, see below.) For this example, our Ingress value is:

a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com

Register a new application on GitHub

Visit the GitHub new application page at https://github.com/settings/applications/new and fill in the following fields:

  • Application Name: Something users will recognize and trust
  • Homepage URL: http://[your ingress]
  • Authorization callback URL: http://[your ingress]/hub/oauth_callback

So, for our example, we used the following values:

- Application name: MAST Labs JupyterHub
- Homepage URL: http://a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com
- Authorization callback URL: http://a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com/hub/oauth_callback
GitHub OAuth page

Completed GitHub OAuth application page

Copy the OAuth tokens for your GitHub application

GitHub will give you a client id and client secret that you’ll need for your group_vars/all file:

github_client_id: YOUR_GITHUB_CLIENT_ID
github_client_secret: GITHUB_CLIENT_SECRET
ingress: http://a8c0ecdeb145f11e98c461202338f981-1590971719.us-east-1.elb.amazonaws.com

Note, the ingress value should include the http/https but not have a trailing slash.

Apply the GitHub Auth

Finally, let’s apply the updated config for the GitHub OAuth application:

$ ansible-playbook -i hosts apply_github_auth.yml -v

Once this playbook completes, visit you Ingress URL again and you should be presented with a login screen with a prompt to ‘Sign in with GitHub’

Sign in with GitHub

Signing into our JupyterHub environment with GitHub.

Log in with GitHub

OAuth prompt at GitHub.

To logout, simply navigate to [your ingress]/hub/home and click logout in the top right corner.

Now what?

At this point, we have a JupyterHub deployment on AWS, configured with GitHub authentication.

From here you can further customize your deployment by:

  • Setting up a CNAME record via the AWS Route 53 console (or using another DNS service) to point to your proxy-public ingress URL, becoming your new ingress URL.
  • Further restrict access to your deployment by limiting access to a limited number of GitHub organizations
  • Modify the parameters in group_vars/all to choose a different instance type with more RAM, CPU etc.
  • Change the machine environment that spawns in the JupyterHub environment by modifying the singleuser_image_repo.

Idempotency

Idempotence is a term originally from mathematics, and commonly used in computer science and engineering to describe an operation that doesn’t change the result beyond the initial application if applied multiple times. This means, if we re-ran the ansible-playbook -i hosts z2jh.yml -v command immediately after the initial Ansible run, the playbook is smart enough to realize that all of the things you’re asking it to do have already been done.

We can leverage the idempotency we’ve built in to this Ansible playbook for ease of development to modify/update/upgrade our infrastructure. Sometimes, it’s a little easier to simply start from a fresh install when you see lots of errors and CrashLoopBackoffs. If kubectl or helm upgrade operations are fine tuning with a screwdriver, you can think of teardown.yml like a sledgehammer. For example, let’s say we wanted to swap out the base image defined in our groupvars that is used to spawn the JupyterHub environment but we want to keep the home directories of our users on the same EFS volume. This can be done as follows:

  1. Teardown the Kubernetes-managed JupyterHub cluster: ansible-playbook -i hosts teardown.yml -t kubernetes
  2. Modify the Docker container image specified in our groupvars/all file
  3. Re-run the Ansible playbook: ansible-playbook -i hosts z2jh.yml and Ansible will rebuild the JupyterHub environment based on this change and connect it back up with the original EFS volume.

Delete play (JupyterHub to Zero)

Finally, if you want to delete the JupyterHub environment you’ve created, (JupyterHub to Zero) you can use the teardown.yml playbook:

ansible-playbook -i hosts teardown.yml -t [tags]

This will tear down everything needed for the JupyterHub environment, but only the parts of the infrastructure you specify. This playbook is intended to work with a cluster that was set up with z2jh.yml, and it keys off of the names that are generated therein. A number of different tags can be passed to this playbook:

# Default: remove JupyterHub release
$ ansible-playbook -i hosts teardown.yml

# kubernetes: remove k8s namespace and tear down kubernetes
$ ansible-playbook -i hosts teardown.yml -t kubernetes

# all-fixtures: terminate EFS volume, S3, and the EC2 CI node. (implies kubernetes and default
$ ansible-playbook -i hosts teardown.yml -t all-fixtures

Note that you’ll need to wait a few minutes for these operations to resolve before attempting to rebuild anything that was deleted.

Wrapping up

This set of Ansible playbooks provides an easy way to create a simple, load-balanced JupyterHub deployment on Amazon Web Services that can be configured for your needs.

While the deployment offered by these playbooks is not quite a ‘reference architecture’ for a JupyterHub deployment as it lacks some important security pieces (such as configuring https) or system monitoring (observability) we think this deployment is useful for a small team/research group to begin using JupyterHub in their collaboration.

As we further extend these Ansible deployment scripts and develop our science platform and JupyterHub deployment at STScI, we’ll be updating the Z2JH Ansible repository.

Any problems, triumphs, suggestions following this tutorial? We would love to hear from you about your experience using these plays. Drop us a line at dsmo@stsci.edu

Brought to you by Jacob Matuskey & Arfon Smith

  1. Time to make a cup of tea? 

TESS data available on AWS

tl;dr - Sectors 1 & 2 from TESS are available on Amazon Web Services (AWS). In this first post, we’ll introduce a basic method for accessing the data programmatically through the astroquery.mast client library.

With the release of TESS sectors 1 & 2, we’re making calibrated and uncalibrated full frame images, two-minute cadence target pixel and light curve files, and co-trending basis vectors, and FFI cubes (for the Astrocut tool) available in the s3://stpupdata/tess S3 bucket on AWS.

These data are available under the same terms as the public dataset for Hubble, that is, if you compute against the data from the AWS US-East region, then data access is free.

Accessing the data

In what follows, we are going to assume you already have an AWS account, have created AWS secret access keys and are able to create an authenticated session using the boto3 Python package with these keys.

Astroquery & Boto3

The astroquery.mast Python package has built in support for working with the cloud-hosted data. To retrieve data from the cloud, you’ll need to enable_cloud_dataset as follows and then use the Boto3 library to download the files:

What data are available, how often will they be updated?

With this initial release of sectors 1 & 2 data, calibrated and uncalibrated full frame images, two-minute cadence target pixel and light curve files, and co-trending basis vectors, and FFI cubes used by the MAST TESSCut and Astrocut library.

The data in this S3 bucket will be automatically updated as further sectors of data are released.

FAQ & Resources

Where are the data?: AWS US East

How can I access the data?: You’ll need an AWS account. See this example of how to use your AWS account with boto3 and Python.

How much does it cost to access the data?: Within the AWS US-East region it’s free. To download outside of US-East standard S3 charges apply.

So now you’re charging for TESS data?: No, TESS data is, and will always be, free from MAST. This copy of the TESS data in MAST is being provided in a ‘highly available’ environment next to the significant computational resources of the AWS platform.

I like this idea but I’d rather use a different cloud vendor.: Please get in touch and let us know.

Exploring AWS Lambda with cloud-hosted Hubble public data

tl;dr: In this post we are going to show you how to processing every WFC3/IR image on AWS Lambda in about 2 minutes (and for about $2)

In our earlier post, we announced the availability of HST public data for currently active instruments in the AWS Public Dataset Program. In that post we described how to access ~110TB of data (raw and calibrated) from ACS, WFC3, STIS, and COS available in the stpubdata S3 bucket.

In this post we will show how to leverage an AWS cloud service called Lambda to process a set of WFC3/IR data. Using this approach it is possible to process every WFC3/IR image (all ~120,000 of them) on AWS Lambda in about 2 minutes (and for about $2).

A brief introduction to Lambda

Lambda1 is a serverless2, cloud-hosted function that can be called on-demand. The basic idea is that a function (some code written by you) can be saved somewhere and used when needed. When the function is not executing there is no cost, but when it is, you just pay for the CPU and memory that are used for the duration of the function executing. This means that services like Lambda are charged in weird units like GBms (Gigabyte milliseconds) which is a combination of the memory used by the function and how long it executes for.

‘Serverless’ computing is an exciting development in modern computing architectures and AWS is not alone in offering a service like this:

Lambda

Getting set up

In this post we are going to be using AWS Lambda together with the HST public data on S3 to demonstrate how serverless computing can be used to process data at scale (and very low cost).

In what follows, we are going to assume you already have an AWS account, have AWS secret access keys and are able to create an authenticated session using the boto3 Python package with these keys. The example above can be run locally and all you need installed are Docker, the Python AWS client boto3 library and the latest version of the astroquery library (install from the master or build from source to get the necessary features). You can use an AWS EC2 instance to work through the example but that is not required beacuse we will not be downloading any of the data.

If you want to try something simpler than this example, take a look at our earlier blog post introducing the AWS Hubble public dataset.

Building our Lambda function

The first thing we need to do, is write a Lambda function and build the computational environment (e.g. any software dependencies) required to support the function.

Because Lambda functions spawn in milliseconds, Lambda requires all of our dependencies (Astropy, numpy etc.) to be installed and available as soon as the function is triggered. Lambda does not allow you to specify the machine environment that supports the execution of your function (the underlying machine running a Lambda function is AWS Linux) so you need to compile all of your dependencies and zip them up as part of the Lambda function.

Generating the Lambda code bundle

Generating Lambda functions with dependencies can be a little tricky, so in this repository we have created an example project that demonstrates how to generate Lambda code bundles using a simple bash script and Docker. Clone this repository on your local machine. There are a few important things to know about this repository:

  1. Install and start Docker.
  2. Run the following to commands in the repository directory to generate a code bundle called venv.zip that is designed to work on AWS Lambda:
    $ docker pull amazonlinux:2017.09
    $ docker run -v $(pwd):/outputs -it amazonlinux:2017.09 /bin/bash /outputs/build.sh
    
  3. Open the S3 console and create a new bucket to hold your function.
  4. Upload the venv.zip file to the bucket.

To change what the Lambda function actually does, modify process.py which is the code that is called when our Lambda function executes. If you change the function dependencies, modify the file requirements.txtto reflect the requirements of your function.

Creating a Lambda function

In this worked example, we are going to use the SEP library3 (which makes the core parts the Source Extractor available as a standalone library) to find sources in a collection of WFC3/IR images.

Note: If you would rather begin with a more straightforward example, we have also written a small Lambda function that downloads a FITS file from a location on S3 and summarizes the content of the opened FITS file using the astropy.io.fits info() function.

Assuming we have run the build.sh script locally, and that your process.py matches the example in this repository, we now need to register a new Lambda function with AWS as follows:

This code and the one below can be run either as a Python script or from a Jupyter notebook. Make sure you follow the inline comments above for creating a ~/.aws/credentials file and a Lambda role, and change the name of the S3 bucket to the one you created in the previous step.

Create an output location (on S3)

In this example, we are going to write the output from our Lambda function to a bucket on S3, but this could be anywhere we can programmatically access from our Lambda function (e.g. a database, another service). Here is our output empty bucket dsmo-lambda-test-outputs:

Empty output S3 bucket

Make yourself an empty S3 bucket too.

Executing the Lambda function

Now we need to write a script that calls our Lambda function. To do this we are going to:

  • Write code that queries MAST using the astroquery.mast package
  • Grab the S3 URLs for a collection of Hubble WFC3/IR FITS files
  • Loops through the array of S3 URLs, each time calling our SEP-powered Lambda function

To run this piece of code (again, either as a Python script or in a notebook) add your own credentials at the top and the name of your empty output bucket as s3_output_bucket.

Because we are triggering Lambda in an asynchronous4 mode (by passing InvocationType='Event'), the API calls to Lambda to process the 100 files we have queried from MAST fire extremely quickly (~1 second total).

As soon as our script runs, we can start checking out output bucket for the results of the Lambda SEP function5:

Output S3 bucket with FITS tables

How did our script do? The last step of our Lambda function writes out a FITS table with the catalog of source detected by SEP. Let’s open one of these FITS catalogs and overlay it with the WFC3/IR image:

SEP Sources

Sources detected by SEP. Not too shabby!

Estimating costs

While we could extend our Lambda function further, at this point we have some code that does something reasonably substantial with a WFC3/IR image. Let’s look at how long these functions took to execute, and how much it cost.

Cloudwatch logs

When a Lambda function executes, the outputs (e.g. any print statements in your script, as well as a high-level summary of the Lambda execution) are outputted to a service called Cloudwatch.

Cloudwatch logs

Cloudwatch logs for our SEP function.

These logs also include a summary report stating how long the function took to execute, how many milliseconds we are being charged for, and how much memory was used by our function:

REPORT RequestId: b851a5c3-6450-11e8-adbb-2f24ab6d943e  
Duration: 1060.45 ms  
Billed Duration: 1100 ms Memory Size: 1024 MB	Max Memory Used: 164 MB

Looking at the above report for one of the Lambda function executions, our SEP source-extraction function (for one of the WFC3/IR images) took ~1.1s to download the image, run SEP, and write the FITS catalog back out to S3.

Show me the money!

Earlier in the post, I mentioned that Lambda is priced in GBms. While we only used 164MB of RAM with this function, we requested 1024MB (1GB) of memory and so we are charged for how much we asked for6.

Let’s work out how much it cost to process the 100 WFC3/IR images we queried for:

# Cost per function call:
$0.00001667 * 1.1s = $0.000018337

# Cost for 100 function calls (all the images in our query):
$0.00001667 * 1.1s * 100 = $0.0018337

# Cost to process every WFC3/IR image every taken (all 122,078 of them):
$0.00001667 * 1.1s * 122,078 = $2.24

So for less than $0.01 ($0.0018337) we have extracted sources for 100 WFC3/IR images. Extrapolating these numbers to every WFC3/IR image ever taken (122,078 at the time of writing) works out at about $2.24 in Lambda charges.

AWS free tier

The above costing ignores the fact that AWS has a free tier available which gives you a limited amount of free compute (400,000 GB-SECONDS to be precise) per month. Quoting from the Lambda pricing page:

The Lambda free tier does not automatically expire at the end of your 12 month AWS Free Tier term, but is available to both existing and new AWS customers indefinitely.
1M REQUESTS FREE
400,000 GB-SECONDS PER MONTH FREE
$0.00001667 FOR EVERY GB-SECOND USED THEREAFTER

With our example of processing 122,078 files we used something like 134,000 GB-seconds. i.e., it was completely free.

If you want to get a feel for how much Lambda costs, take a look at this simple cost calculator from dashbird.

Caveats with this cost estimate

A few caveats with this cost calculation:

  • We have not included the cost of storing the results (FITS tables) on S3. This is likely about $0.11 cents/month for ~10GB of output data.
  • We are assuming that you are running the script to invoke Lambda on a machine you own (i.e. you are not running a separate machine on AWS).
  • Important: This costing assumes that you are running the Lambda function in US-East Region which is in the same AWS Region as the data in S3. In this mode of operation, there are no charges for downloading the data from S3 to your Lambda environment.

Lots of Lambda

In the cost calculation above, we have calculated what the theoretical cost of processing all of the WFC3/IR images. How long might this actually take if we were to try it?

Lambda by default has a concurrency limit of 1,000 simultaneously executing functions. Assuming we want to process all 122,078 WFC3/IR images then:

# Assuming we could spread the 122,078 file processing across
# all 1,000 Lambda processes available to us:
122,078 / 1,000 = ~122 images per Lambda concurrency unit

# Assuming average compute time of 1.1s
122 * 1.1 = ~134 seconds

That’s right, it would likely take just over two minutes to process every WFC3/IR image ever taken using Lambda7 ⚡️⚡️⚡️.

Conclusions and next steps

Services such as Lambda (a.k.a. serverless computing) offer a powerful new model for on-demand computing, especially when combined with cloud-hosted astronomical datasets.

We would love to hear about your experiences of using AWS Lambda for astronomical data processing. You can find us on Twitter (@mast_news) or email us on dsmo@stsci.edu.

Finally, as a reminder, if you are interested in doing more with the HST public dataset on AWS then you might want to take a look at the Cycle 26 Call for Proposals which includes a new type of proposal: Legacy Archival Cloud Computation Studies. This proposal category is specifically aimed at teams that would like to leverage this dataset.

Brought to you by Arfon Smith & Iva Momcheva

Footnotes

  1. Presumably this is an homage to Anonymous functions 

  2. http://martinfowler.com/articles/serverless.html 

  3. https://github.com/kbarbary/sep 

  4. Other options are available https://docs.aws.amazon.com/lambda/latest/dg/API_Invoke.html#API_Invoke_RequestSyntax 

  5. Note, in this example, we are writing to an S3 bucket but we could write the results out somewhere else convenient. 

  6. We could have requested less RAM but that generally means you get less CPU with your function too so it is not clear the function would have been cheaper to execute. See this blog post for an analysis of this. 

  7. It is currently not easy to actually retrieve all 122,078 WFC3/IR file names from the MAST API using Astroquery. We are working on making this faster… 🔜 

Making HST Public Data Available on AWS

tl;dr - All public data from Hubble’s currently active instruments are now available on Amazon Web Services. In this post, we show you how to access it and announce a new opportunity for funding to make use of the data.

The Hubble Space Telescope has undeniably expanded our understanding of the universe during its 28 years in space so far, but this is not just due to its superior view from space: One of the major advantages to Hubble is that every single image it takes becomes public within six months (and in many cases immediately) after it is beamed back to Earth. The treasure trove that is the Hubble archive has produced just as many discoveries by scientists using the data “second hand” as it has from the original teams who requested the observations. Providing access to archives is at the core of our mission.

For all its richness however, the archive of Hubble observations has been geared to individual astronomers analyzing relatively small sets of data. The data access model has always been that an astronomer first downloads the data and then analyzes it on their own computer. Currently, most astronomers are limited both in the volume of data they can reasonably download, and by their access to large-scale computing resources.

HST public dataset on Amazon Web Services

We’re pleased to announce that as of May 2018, ~110 TB of Hubble’s archival observations are available in cloud storage on Amazon Web Services (AWS) which provides unlimited access to the data right next to the wide variety of computing resources provided by AWS.

These data consist of all raw and processed observations from the currently active instruments: the Advanced Camera for Surveys (ACS), the Wide Field Camera 3 (WFC3), the Cosmic Origins Spectrograph (COS), the Space Telescope Imaging Spectrograph (STIS) and the Fine Guidance Sensors (FGS).

The data on AWS (available at https://registry.opendata.aws/hst/) are kept up to date with the data held in MAST and new and reprocessed data are updated on AWS within 20 minutes of them being updated at MAST.

So, how do I use it?

To get started you will need:

  • An AWS account. Sign up for an account using the AWS Console.
  • A running EC2 instance in US-East (N. Virginia) (watch this video on starting an instance) with Python 3. We recommend the astroconda Anaconda channel.
  • The astroquery and boto3 Python libraries. These do not come standard with the astroconda distribution and need to be installed separately.
  • An AWS access key ID and a secret access key. These can be generated under User > Your Security Credentials > Access Keys in the AWS console. Remember to save the ID-key combination.
  • Some code to query MAST and download data from the public dataset. In order to view or analyze a file from the archive, you’ll need to transfer it from S3 to your instance. This transfer however is free, as long as it happens within the same AWS region (US-East N. Virginia).

Alternatively…

To help you get started we have simplified the process of setting up an EC2 instance by creating an Amazon Machine Image (AMI) with all the necessary software pre-installed (astroconda,boto3,astroquery). To launch a copy of this machine, search the AMI Community Marketplace for “STScI-Hubble-Public-Data” or ami-cfdfb6b0. The README in the home directory of the AMI describes how to set your AWS credentials as environmental variables and how to run the example above in the instance.

This example shows you how to grab several drizzled images for the CANDELS WFC3/IR observations of the GOODS-South field:

Transferring all 270 images (13 MB each or > 3GB total) takes 90 seconds. For comparison, downloading the data over an average network connection (~50 mbps) will take over eight minutes or five to six times slower. You can now display the images, do source detection on them, mosaic them together, etc.

A cloud hosted copy of Hubble data

The Hubble AWS Public Dataset is not a substitute for the Mikulski Archive for Space Telescopes (MAST). Data are, and always will be, available free of charge from MAST. Also, while we’re making every effort to keep the data on AWS up to date, if you absolutely definitely want to be sure you’re getting the latest and greatest calibrated data, you should download directly from MAST rather than this copy on AWS.

Using these data from within the US-East (N. Virginia) AWS region does not incur any charges, but downloading data from this copy to other AWS regions or outside of AWS will cost money. Also, note that the copy on AWS only includes public data. Proprietary datasets aren’t available.

By distributing this copy of Hubble data on AWS, we’re exploring a new kind of archive service – one where the data are highly available i.e., bulk, high-speed access to the data next to the vast computational resources of Amazon Web Services.

Astronomers who want to experiment with AWS can take advantage of their free tier. In later posts, we’ll show you how you can process significant volumes of data at little/no cost. Elastic Cloud Computing (EC2), the AWS service which provides basic compute capacity, has a one year free tier to new users which is ideal for learning, experimenting and testing.

If you’re interested in doing more with these data then you might want to take a look at the Cycle 26 Call for Proposals which includes a new type of proposal: Legacy Archival Cloud Computation Studies. This proposal category is specifically aimed at teams that would like to leverage this dataset.

Proposals to make use of this dataset should include the phrase ‘Cloud Exploration:’ at the beginning of their proposal title and should include a line item in their budget for AWS costs (limit $10,000 USD). For questions regarding the call for proposal you can reach us at dsmo@stsci.edu.

Tell us more about how you did this…

The Hubble data is hosted on AWS as a result of an agreement between STScI and AWS to participate in the AWS Open Data Program. There it joins a wide variety of other datasets, including Landsat-8 imaging, 1000 Human Genomes and the subtitles of 32,000 movies. The initial hosting agreement between AWS and STScI is for three years and can be extended based on the data access volume and frequency.

So how do you move 110 TB of data from Baltimore to Virginia? Turns out the best way to transport large quantities of data is still via mail. We used the AWS Snowball service to move data from STScI to AWS. The Snowball is an 80 TB bank of hard drives (larger options are available 😀) which we plugged into our local network and, after some debugging, we rsync-ed the data to. Then we mailed it back. Two Snowballs were needed to deliver all the data and once the initial copy was uploaded to S3, we worked with our internal pipelines team to ensure that going forward, the files on AWS are updated as soon as there is a change internally. And that is it! The updates happen in real time - the S3 copy of the data is only 10-20 minutes behind MAST. Proprietary data is not included in the AWS data. PIs of proprietary data can only retrieve those from MAST.

Wrapping up

Whether you’re looking to process large volumes of HST data, or train some kind of deep learning algorithm to analyzing Hubble images, we think that making Hubble public data available in the cloud is a first step in facilitating new, more sophisticated analyses of archival data.

Teams such as the PHAT survey have already utilized cloud computing to handle their data processing needs and we can not wait to see analyses involving machine learning, transient detection, creating large, multi-epoch mosaics, joint processing with other survey data carried out on these data.

We hope you find this new data availability useful and we look forward to reading your Cycle 26 proposals and papers on the arXiv!

Brought to you by Iva Momcheva, Arfon Smith, Josh Peek, and Mike Fox

FAQ & Resources

Where are the data?: AWS US East

What data have you uploaded?: Currently active instruments: ACS, COS, STIS, WFC3, FGS

How can I access the data?: You’ll need an AWS account. See this example of how to use your AWS account with boto3 and Python.

How much does it cost to access the data?: Within the AWS US-East region it’s free. To download outside of US-East standard S3 charges apply.

So now you’re charging for Hubble data?: No, Hubble data is, and will always be, free from MAST. This copy of the Hubble data in MAST is being provided in a ‘highly available’ environment next to the significant computational resources of the AWS platform.

How can I get some money to do science with this data?: We’re glad you asked! HST CFP 26 explicitly calls out this dataset as something we’d like you to explore.

I like this idea but I’d rather use a different cloud vendor.: Please get in touch and let us know.