Exploring AWS Lambda with cloud-hosted Hubble public data

tl;dr: In this post we are going to show you how to processing every WFC3/IR image on AWS Lambda in about 2 minutes (and for about $2)

In our earlier post, we announced the availability of HST public data for currently active instruments in the AWS Public Dataset Program. In that post we described how to access ~110TB of data (raw and calibrated) from ACS, WFC3, STIS, and COS available in the stpubdata S3 bucket.

In this post we will show how to leverage an AWS cloud service called Lambda to process a set of WFC3/IR data. Using this approach it is possible to process every WFC3/IR image (all ~120,000 of them) on AWS Lambda in about 2 minutes (and for about $2).

A brief introduction to Lambda

Lambda1 is a serverless2, cloud-hosted function that can be called on-demand. The basic idea is that a function (some code written by you) can be saved somewhere and used when needed. When the function is not executing there is no cost, but when it is, you just pay for the CPU and memory that are used for the duration of the function executing. This means that services like Lambda are charged in weird units like GBms (Gigabyte milliseconds) which is a combination of the memory used by the function and how long it executes for.

‘Serverless’ computing is an exciting development in modern computing architectures and AWS is not alone in offering a service like this:

Lambda

Getting set up

In this post we are going to be using AWS Lambda together with the HST public data on S3 to demonstrate how serverless computing can be used to process data at scale (and very low cost).

In what follows, we are going to assume you already have an AWS account, have AWS secret access keys and are able to create an authenticated session using the boto3 Python package with these keys. The example above can be run locally and all you need installed are Docker, the Python AWS client boto3 library and the latest version of the astroquery library (install from the master or build from source to get the necessary features). You can use an AWS EC2 instance to work through the example but that is not required beacuse we will not be downloading any of the data.

If you want to try something simpler than this example, take a look at our earlier blog post introducing the AWS Hubble public dataset.

Building our Lambda function

The first thing we need to do, is write a Lambda function and build the computational environment (e.g. any software dependencies) required to support the function.

Because Lambda functions spawn in milliseconds, Lambda requires all of our dependencies (Astropy, numpy etc.) to be installed and available as soon as the function is triggered. Lambda does not allow you to specify the machine environment that supports the execution of your function (the underlying machine running a Lambda function is AWS Linux) so you need to compile all of your dependencies and zip them up as part of the Lambda function.

Generating the Lambda code bundle

Generating Lambda functions with dependencies can be a little tricky, so in this repository we have created an example project that demonstrates how to generate Lambda code bundles using a simple bash script and Docker. Clone this repository on your local machine. There are a few important things to know about this repository:

  1. Install and start Docker.
  2. Run the following to commands in the repository directory to generate a code bundle called venv.zip that is designed to work on AWS Lambda:
    $ docker pull amazonlinux:2017.09
    $ docker run -v $(pwd):/outputs -it amazonlinux:2017.09 /bin/bash /outputs/build.sh
    
  3. Open the S3 console and create a new bucket to hold your function.
  4. Upload the venv.zip file to the bucket.

To change what the Lambda function actually does, modify process.py which is the code that is called when our Lambda function executes. If you change the function dependencies, modify the file requirements.txtto reflect the requirements of your function.

Creating a Lambda function

In this worked example, we are going to use the SEP library3 (which makes the core parts the Source Extractor available as a standalone library) to find sources in a collection of WFC3/IR images.

Note: If you would rather begin with a more straightforward example, we have also written a small Lambda function that downloads a FITS file from a location on S3 and summarizes the content of the opened FITS file using the astropy.io.fits info() function.

Assuming we have run the build.sh script locally, and that your process.py matches the example in this repository, we now need to register a new Lambda function with AWS as follows:

This code and the one below can be run either as a Python script or from a Jupyter notebook. Make sure you follow the inline comments above for creating a ~/.aws/credentials file and a Lambda role, and change the name of the S3 bucket to the one you created in the previous step.

Create an output location (on S3)

In this example, we are going to write the output from our Lambda function to a bucket on S3, but this could be anywhere we can programmatically access from our Lambda function (e.g. a database, another service). Here is our output empty bucket dsmo-lambda-test-outputs:

Empty output S3 bucket

Make yourself an empty S3 bucket too.

Executing the Lambda function

Now we need to write a script that calls our Lambda function. To do this we are going to:

  • Write code that queries MAST using the astroquery.mast package
  • Grab the S3 URLs for a collection of Hubble WFC3/IR FITS files
  • Loops through the array of S3 URLs, each time calling our SEP-powered Lambda function

To run this piece of code (again, either as a Python script or in a notebook) add your own credentials at the top and the name of your empty output bucket as s3_output_bucket.

Because we are triggering Lambda in an asynchronous4 mode (by passing InvocationType='Event'), the API calls to Lambda to process the 100 files we have queried from MAST fire extremely quickly (~1 second total).

As soon as our script runs, we can start checking out output bucket for the results of the Lambda SEP function5:

Output S3 bucket with FITS tables

How did our script do? The last step of our Lambda function writes out a FITS table with the catalog of source detected by SEP. Let’s open one of these FITS catalogs and overlay it with the WFC3/IR image:

SEP Sources

Sources detected by SEP. Not too shabby!

Estimating costs

While we could extend our Lambda function further, at this point we have some code that does something reasonably substantial with a WFC3/IR image. Let’s look at how long these functions took to execute, and how much it cost.

Cloudwatch logs

When a Lambda function executes, the outputs (e.g. any print statements in your script, as well as a high-level summary of the Lambda execution) are outputted to a service called Cloudwatch.

Cloudwatch logs

Cloudwatch logs for our SEP function.

These logs also include a summary report stating how long the function took to execute, how many milliseconds we are being charged for, and how much memory was used by our function:

REPORT RequestId: b851a5c3-6450-11e8-adbb-2f24ab6d943e  
Duration: 1060.45 ms  
Billed Duration: 1100 ms Memory Size: 1024 MB	Max Memory Used: 164 MB

Looking at the above report for one of the Lambda function executions, our SEP source-extraction function (for one of the WFC3/IR images) took ~1.1s to download the image, run SEP, and write the FITS catalog back out to S3.

Show me the money!

Earlier in the post, I mentioned that Lambda is priced in GBms. While we only used 164MB of RAM with this function, we requested 1024MB (1GB) of memory and so we are charged for how much we asked for6.

Let’s work out how much it cost to process the 100 WFC3/IR images we queried for:

# Cost per function call:
$0.00001667 * 1.1s = $0.000018337

# Cost for 100 function calls (all the images in our query):
$0.00001667 * 1.1s * 100 = $0.0018337

# Cost to process every WFC3/IR image every taken (all 122,078 of them):
$0.00001667 * 1.1s * 122,078 = $2.24

So for less than $0.01 ($0.0018337) we have extracted sources for 100 WFC3/IR images. Extrapolating these numbers to every WFC3/IR image ever taken (122,078 at the time of writing) works out at about $2.24 in Lambda charges.

AWS free tier

The above costing ignores the fact that AWS has a free tier available which gives you a limited amount of free compute (400,000 GB-SECONDS to be precise) per month. Quoting from the Lambda pricing page:

The Lambda free tier does not automatically expire at the end of your 12 month AWS Free Tier term, but is available to both existing and new AWS customers indefinitely.
1M REQUESTS FREE
400,000 GB-SECONDS PER MONTH FREE
$0.00001667 FOR EVERY GB-SECOND USED THEREAFTER

With our example of processing 122,078 files we used something like 134,000 GB-seconds. i.e., it was completely free.

If you want to get a feel for how much Lambda costs, take a look at this simple cost calculator from dashbird.

Caveats with this cost estimate

A few caveats with this cost calculation:

  • We have not included the cost of storing the results (FITS tables) on S3. This is likely about $0.11 cents/month for ~10GB of output data.
  • We are assuming that you are running the script to invoke Lambda on a machine you own (i.e. you are not running a separate machine on AWS).
  • Important: This costing assumes that you are running the Lambda function in US-East Region which is in the same AWS Region as the data in S3. In this mode of operation, there are no charges for downloading the data from S3 to your Lambda environment.

Lots of Lambda

In the cost calculation above, we have calculated what the theoretical cost of processing all of the WFC3/IR images. How long might this actually take if we were to try it?

Lambda by default has a concurrency limit of 1,000 simultaneously executing functions. Assuming we want to process all 122,078 WFC3/IR images then:

# Assuming we could spread the 122,078 file processing across
# all 1,000 Lambda processes available to us:
122,078 / 1,000 = ~122 images per Lambda concurrency unit

# Assuming average compute time of 1.1s
122 * 1.1 = ~134 seconds

That’s right, it would likely take just over two minutes to process every WFC3/IR image ever taken using Lambda7 ⚡️⚡️⚡️.

Conclusions and next steps

Services such as Lambda (a.k.a. serverless computing) offer a powerful new model for on-demand computing, especially when combined with cloud-hosted astronomical datasets.

We would love to hear about your experiences of using AWS Lambda for astronomical data processing. You can find us on Twitter (@mast_news) or email us on dsmo@stsci.edu.

Finally, as a reminder, if you are interested in doing more with the HST public dataset on AWS then you might want to take a look at the Cycle 26 Call for Proposals which includes a new type of proposal: Legacy Archival Cloud Computation Studies. This proposal category is specifically aimed at teams that would like to leverage this dataset.

Brought to you by Arfon Smith & Iva Momcheva

Footnotes

  1. Presumably this is an homage to Anonymous functions 

  2. http://martinfowler.com/articles/serverless.html 

  3. https://github.com/kbarbary/sep 

  4. Other options are available https://docs.aws.amazon.com/lambda/latest/dg/API_Invoke.html#API_Invoke_RequestSyntax 

  5. Note, in this example, we are writing to an S3 bucket but we could write the results out somewhere else convenient. 

  6. We could have requested less RAM but that generally means you get less CPU with your function too so it is not clear the function would have been cheaper to execute. See this blog post for an analysis of this. 

  7. It is currently not easy to actually retrieve all 122,078 WFC3/IR file names from the MAST API using Astroquery. We are working on making this faster… 🔜 

Making HST Public Data Available on AWS

tl;dr - All public data from Hubble’s currently active instruments are now available on Amazon Web Services. In this post, we show you how to access it and announce a new opportunity for funding to make use of the data.

The Hubble Space Telescope has undeniably expanded our understanding of the universe during its 28 years in space so far, but this is not just due to its superior view from space: One of the major advantages to Hubble is that every single image it takes becomes public within six months (and in many cases immediately) after it is beamed back to Earth. The treasure trove that is the Hubble archive has produced just as many discoveries by scientists using the data “second hand” as it has from the original teams who requested the observations. Providing access to archives is at the core of our mission.

For all its richness however, the archive of Hubble observations has been geared to individual astronomers analyzing relatively small sets of data. The data access model has always been that an astronomer first downloads the data and then analyzes it on their own computer. Currently, most astronomers are limited both in the volume of data they can reasonably download, and by their access to large-scale computing resources.

HST public dataset on Amazon Web Services

We’re pleased to announce that as of May 2018, ~110 TB of Hubble’s archival observations are available in cloud storage on Amazon Web Services (AWS) which provides unlimited access to the data right next to the wide variety of computing resources provided by AWS.

These data consist of all raw and processed observations from the currently active instruments: the Advanced Camera for Surveys (ACS), the Wide Field Camera 3 (WFC3), the Cosmic Origins Spectrograph (COS), the Space Telescope Imaging Spectrograph (STIS) and the Fine Guidance Sensors (FGS).

The data on AWS (available at https://registry.opendata.aws/hst/) are kept up to date with the data held in MAST and new and reprocessed data are updated on AWS within 20 minutes of them being updated at MAST.

So, how do I use it?

To get started you will need:

  • An AWS account. Sign up for an account using the AWS Console.
  • A running EC2 instance in US-East (N. Virginia) (watch this video on starting an instance) with Python 3. We recommend the astroconda Anaconda channel.
  • The astroquery and boto3 Python libraries. These do not come standard with the astroconda distribution and need to be installed separately.
  • An AWS access key ID and a secret access key. These can be generated under User > Your Security Credentials > Access Keys in the AWS console. Remember to save the ID-key combination.
  • Some code to query MAST and download data from the public dataset. In order to view or analyze a file from the archive, you’ll need to transfer it from S3 to your instance. This transfer however is free, as long as it happens within the same AWS region (US-East N. Virginia).

Alternatively…

To help you get started we have simplified the process of setting up an EC2 instance by creating an Amazon Machine Image (AMI) with all the necessary software pre-installed (astroconda,boto3,astroquery). To launch a copy of this machine, search the AMI Community Marketplace for “STScI-Hubble-Public-Data” or ami-cfdfb6b0. The README in the home directory of the AMI describes how to set your AWS credentials as environmental variables and how to run the example above in the instance.

This example shows you how to grab several drizzled images for the CANDELS WFC3/IR observations of the GOODS-South field:

Transferring all 270 images (13 MB each or > 3GB total) takes 90 seconds. For comparison, downloading the data over an average network connection (~50 mbps) will take over eight minutes or five to six times slower. You can now display the images, do source detection on them, mosaic them together, etc.

A cloud hosted copy of Hubble data

The Hubble AWS Public Dataset is not a substitute for the Mikulski Archive for Space Telescopes (MAST). Data are, and always will be, available free of charge from MAST. Also, while we’re making every effort to keep the data on AWS up to date, if you absolutely definitely want to be sure you’re getting the latest and greatest calibrated data, you should download directly from MAST rather than this copy on AWS.

Using these data from within the US-East (N. Virginia) AWS region does not incur any charges, but downloading data from this copy to other AWS regions or outside of AWS will cost money. Also, note that the copy on AWS only includes public data. Proprietary datasets aren’t available.

By distributing this copy of Hubble data on AWS, we’re exploring a new kind of archive service – one where the data are highly available i.e., bulk, high-speed access to the data next to the vast computational resources of Amazon Web Services.

Astronomers who want to experiment with AWS can take advantage of their free tier. In later posts, we’ll show you how you can process significant volumes of data at little/no cost. Elastic Cloud Computing (EC2), the AWS service which provides basic compute capacity, has a one year free tier to new users which is ideal for learning, experimenting and testing.

If you’re interested in doing more with these data then you might want to take a look at the Cycle 26 Call for Proposals which includes a new type of proposal: Legacy Archival Cloud Computation Studies. This proposal category is specifically aimed at teams that would like to leverage this dataset.

Proposals to make use of this dataset should include the phrase ‘Cloud Exploration:’ at the beginning of their proposal title and should include a line item in their budget for AWS costs (limit $10,000 USD). For questions regarding the call for proposal you can reach us at dsmo@stsci.edu.

Tell us more about how you did this…

The Hubble data is hosted on AWS as a result of an agreement between STScI and AWS to participate in the AWS Open Data Program. There it joins a wide variety of other datasets, including Landsat-8 imaging, 1000 Human Genomes and the subtitles of 32,000 movies. The initial hosting agreement between AWS and STScI is for three years and can be extended based on the data access volume and frequency.

So how do you move 110 TB of data from Baltimore to Virginia? Turns out the best way to transport large quantities of data is still via mail. We used the AWS Snowball service to move data from STScI to AWS. The Snowball is an 80 TB bank of hard drives (larger options are available 😀) which we plugged into our local network and, after some debugging, we rsync-ed the data to. Then we mailed it back. Two Snowballs were needed to deliver all the data and once the initial copy was uploaded to S3, we worked with our internal pipelines team to ensure that going forward, the files on AWS are updated as soon as there is a change internally. And that is it! The updates happen in real time - the S3 copy of the data is only 10-20 minutes behind MAST. Proprietary data is not included in the AWS data. PIs of proprietary data can only retrieve those from MAST.

Wrapping up

Whether you’re looking to process large volumes of HST data, or train some kind of deep learning algorithm to analyzing Hubble images, we think that making Hubble public data available in the cloud is a first step in facilitating new, more sophisticated analyses of archival data.

Teams such as the PHAT survey have already utilized cloud computing to handle their data processing needs and we can not wait to see analyses involving machine learning, transient detection, creating large, multi-epoch mosaics, joint processing with other survey data carried out on these data.

We hope you find this new data availability useful and we look forward to reading your Cycle 26 proposals and papers on the arXiv!

Brought to you by Iva Momcheva, Arfon Smith, Josh Peek, and Mike Fox

FAQ & Resources

Where are the data?: AWS US East

What data have you uploaded?: Currently active instruments: ACS, COS, STIS, WFC3, FGS

How can I access the data?: You’ll need an AWS account. See this example of how to use your AWS account with boto3 and Python.

How much does it cost to access the data?: Within the AWS US-East region it’s free. To download outside of US-East standard S3 charges apply.

So now you’re charging for Hubble data?: No, Hubble data is, and will always be, free from MAST. This copy of the Hubble data in MAST is being provided in a ‘highly available’ environment next to the significant computational resources of the AWS platform.

How can I get some money to do science with this data?: We’re glad you asked! HST CFP 26 explicitly calls out this dataset as something we’d like you to explore.

I like this idea but I’d rather use a different cloud vendor.: Please get in touch and let us know.