Exploring AWS Lambda with cloud-hosted Hubble public data

tl;dr: In this post we are going to show you how to processing every WFC3/IR image on AWS Lambda in about 2 minutes (and for about $2)

In our earlier post, we announced the availability of HST public data for currently active instruments in the AWS Public Dataset Program. In that post we described how to access ~110TB of data (raw and calibrated) from ACS, WFC3, STIS, and COS available in the stpubdata S3 bucket.

In this post we will show how to leverage an AWS cloud service called Lambda to process a set of WFC3/IR data. Using this approach it is possible to process every WFC3/IR image (all ~120,000 of them) on AWS Lambda in about 2 minutes (and for about $2).

A brief introduction to Lambda

Lambda1 is a serverless2, cloud-hosted function that can be called on-demand. The basic idea is that a function (some code written by you) can be saved somewhere and used when needed. When the function is not executing there is no cost, but when it is, you just pay for the CPU and memory that are used for the duration of the function executing. This means that services like Lambda are charged in weird units like GBms (Gigabyte milliseconds) which is a combination of the memory used by the function and how long it executes for.

‘Serverless’ computing is an exciting development in modern computing architectures and AWS is not alone in offering a service like this:

Lambda

Getting set up

In this post we are going to be using AWS Lambda together with the HST public data on S3 to demonstrate how serverless computing can be used to process data at scale (and very low cost).

In what follows, we are going to assume you already have an AWS account, have AWS secret access keys and are able to create an authenticated session using the boto3 Python package with these keys. The example above can be run locally and all you need installed are Docker, the Python AWS client boto3 library and the latest version of the astroquery library (install from the master or build from source to get the necessary features). You can use an AWS EC2 instance to work through the example but that is not required beacuse we will not be downloading any of the data.

If you want to try something simpler than this example, take a look at our earlier blog post introducing the AWS Hubble public dataset.

Building our Lambda function

The first thing we need to do, is write a Lambda function and build the computational environment (e.g. any software dependencies) required to support the function.

Because Lambda functions spawn in milliseconds, Lambda requires all of our dependencies (Astropy, numpy etc.) to be installed and available as soon as the function is triggered. Lambda does not allow you to specify the machine environment that supports the execution of your function (the underlying machine running a Lambda function is AWS Linux) so you need to compile all of your dependencies and zip them up as part of the Lambda function.

Generating the Lambda code bundle

Generating Lambda functions with dependencies can be a little tricky, so in this repository we have created an example project that demonstrates how to generate Lambda code bundles using a simple bash script and Docker. Clone this repository on your local machine. There are a few important things to know about this repository:

  1. Install and start Docker.
  2. Run the following to commands in the repository directory to generate a code bundle called venv.zip that is designed to work on AWS Lambda:
    $ docker pull amazonlinux:2017.09
    $ docker run -v $(pwd):/outputs -it amazonlinux:2017.09 /bin/bash /outputs/build.sh
    
  3. Open the S3 console and create a new bucket to hold your function.
  4. Upload the venv.zip file to the bucket.

To change what the Lambda function actually does, modify process.py which is the code that is called when our Lambda function executes. If you change the function dependencies, modify the file requirements.txtto reflect the requirements of your function.

Creating a Lambda function

In this worked example, we are going to use the SEP library3 (which makes the core parts the Source Extractor available as a standalone library) to find sources in a collection of WFC3/IR images.

Note: If you would rather begin with a more straightforward example, we have also written a small Lambda function that downloads a FITS file from a location on S3 and summarizes the content of the opened FITS file using the astropy.io.fits info() function.

Assuming we have run the build.sh script locally, and that your process.py matches the example in this repository, we now need to register a new Lambda function with AWS as follows:

This code and the one below can be run either as a Python script or from a Jupyter notebook. Make sure you follow the inline comments above for creating a ~/.aws/credentials file and a Lambda role, and change the name of the S3 bucket to the one you created in the previous step.

Create an output location (on S3)

In this example, we are going to write the output from our Lambda function to a bucket on S3, but this could be anywhere we can programmatically access from our Lambda function (e.g. a database, another service). Here is our output empty bucket dsmo-lambda-test-outputs:

Empty output S3 bucket

Make yourself an empty S3 bucket too.

Executing the Lambda function

Now we need to write a script that calls our Lambda function. To do this we are going to:

  • Write code that queries MAST using the astroquery.mast package
  • Grab the S3 URLs for a collection of Hubble WFC3/IR FITS files
  • Loops through the array of S3 URLs, each time calling our SEP-powered Lambda function

To run this piece of code (again, either as a Python script or in a notebook) add your own credentials at the top and the name of your empty output bucket as s3_output_bucket.

Because we are triggering Lambda in an asynchronous4 mode (by passing InvocationType='Event'), the API calls to Lambda to process the 100 files we have queried from MAST fire extremely quickly (~1 second total).

As soon as our script runs, we can start checking out output bucket for the results of the Lambda SEP function5:

Output S3 bucket with FITS tables

How did our script do? The last step of our Lambda function writes out a FITS table with the catalog of source detected by SEP. Let’s open one of these FITS catalogs and overlay it with the WFC3/IR image:

SEP Sources

Sources detected by SEP. Not too shabby!

Estimating costs

While we could extend our Lambda function further, at this point we have some code that does something reasonably substantial with a WFC3/IR image. Let’s look at how long these functions took to execute, and how much it cost.

Cloudwatch logs

When a Lambda function executes, the outputs (e.g. any print statements in your script, as well as a high-level summary of the Lambda execution) are outputted to a service called Cloudwatch.

Cloudwatch logs

Cloudwatch logs for our SEP function.

These logs also include a summary report stating how long the function took to execute, how many milliseconds we are being charged for, and how much memory was used by our function:

REPORT RequestId: b851a5c3-6450-11e8-adbb-2f24ab6d943e  
Duration: 1060.45 ms  
Billed Duration: 1100 ms Memory Size: 1024 MB	Max Memory Used: 164 MB

Looking at the above report for one of the Lambda function executions, our SEP source-extraction function (for one of the WFC3/IR images) took ~1.1s to download the image, run SEP, and write the FITS catalog back out to S3.

Show me the money!

Earlier in the post, I mentioned that Lambda is priced in GBms. While we only used 164MB of RAM with this function, we requested 1024MB (1GB) of memory and so we are charged for how much we asked for6.

Let’s work out how much it cost to process the 100 WFC3/IR images we queried for:

# Cost per function call:
$0.00001667 * 1.1s = $0.000018337

# Cost for 100 function calls (all the images in our query):
$0.00001667 * 1.1s * 100 = $0.0018337

# Cost to process every WFC3/IR image every taken (all 122,078 of them):
$0.00001667 * 1.1s * 122,078 = $2.24

So for less than $0.01 ($0.0018337) we have extracted sources for 100 WFC3/IR images. Extrapolating these numbers to every WFC3/IR image ever taken (122,078 at the time of writing) works out at about $2.24 in Lambda charges.

AWS free tier

The above costing ignores the fact that AWS has a free tier available which gives you a limited amount of free compute (400,000 GB-SECONDS to be precise) per month. Quoting from the Lambda pricing page:

The Lambda free tier does not automatically expire at the end of your 12 month AWS Free Tier term, but is available to both existing and new AWS customers indefinitely.
1M REQUESTS FREE
400,000 GB-SECONDS PER MONTH FREE
$0.00001667 FOR EVERY GB-SECOND USED THEREAFTER

With our example of processing 122,078 files we used something like 134,000 GB-seconds. i.e., it was completely free.

If you want to get a feel for how much Lambda costs, take a look at this simple cost calculator from dashbird.

Caveats with this cost estimate

A few caveats with this cost calculation:

  • We have not included the cost of storing the results (FITS tables) on S3. This is likely about $0.11 cents/month for ~10GB of output data.
  • We are assuming that you are running the script to invoke Lambda on a machine you own (i.e. you are not running a separate machine on AWS).
  • Important: This costing assumes that you are running the Lambda function in US-East Region which is in the same AWS Region as the data in S3. In this mode of operation, there are no charges for downloading the data from S3 to your Lambda environment.

Lots of Lambda

In the cost calculation above, we have calculated what the theoretical cost of processing all of the WFC3/IR images. How long might this actually take if we were to try it?

Lambda by default has a concurrency limit of 1,000 simultaneously executing functions. Assuming we want to process all 122,078 WFC3/IR images then:

# Assuming we could spread the 122,078 file processing across
# all 1,000 Lambda processes available to us:
122,078 / 1,000 = ~122 images per Lambda concurrency unit

# Assuming average compute time of 1.1s
122 * 1.1 = ~134 seconds

That’s right, it would likely take just over two minutes to process every WFC3/IR image ever taken using Lambda7 ⚡️⚡️⚡️.

Conclusions and next steps

Services such as Lambda (a.k.a. serverless computing) offer a powerful new model for on-demand computing, especially when combined with cloud-hosted astronomical datasets.

We would love to hear about your experiences of using AWS Lambda for astronomical data processing. You can find us on Twitter (@mast_news) or email us on dsmo@stsci.edu.

Finally, as a reminder, if you are interested in doing more with the HST public dataset on AWS then you might want to take a look at the Cycle 26 Call for Proposals which includes a new type of proposal: Legacy Archival Cloud Computation Studies. This proposal category is specifically aimed at teams that would like to leverage this dataset.

Brought to you by Arfon Smith & Iva Momcheva

Footnotes

  1. Presumably this is an homage to Anonymous functions 

  2. http://martinfowler.com/articles/serverless.html 

  3. https://github.com/kbarbary/sep 

  4. Other options are available https://docs.aws.amazon.com/lambda/latest/dg/API_Invoke.html#API_Invoke_RequestSyntax 

  5. Note, in this example, we are writing to an S3 bucket but we could write the results out somewhere else convenient. 

  6. We could have requested less RAM but that generally means you get less CPU with your function too so it is not clear the function would have been cheaper to execute. See this blog post for an analysis of this. 

  7. It is currently not easy to actually retrieve all 122,078 WFC3/IR file names from the MAST API using Astroquery. We are working on making this faster… 🔜