Making HST Public Data Available on AWS

tl;dr - All public data from Hubble’s currently active instruments are now available on Amazon Web Services. In this post, we show you how to access it and announce a new opportunity for funding to make use of the data.

The Hubble Space Telescope has undeniably expanded our understanding of the universe during its 28 years in space so far, but this is not just due to its superior view from space: One of the major advantages to Hubble is that every single image it takes becomes public within six months (and in many cases immediately) after it is beamed back to Earth. The treasure trove that is the Hubble archive has produced just as many discoveries by scientists using the data “second hand” as it has from the original teams who requested the observations. Providing access to archives is at the core of our mission.

For all its richness however, the archive of Hubble observations has been geared to individual astronomers analyzing relatively small sets of data. The data access model has always been that an astronomer first downloads the data and then analyzes it on their own computer. Currently, most astronomers are limited both in the volume of data they can reasonably download, and by their access to large-scale computing resources.

HST public dataset on Amazon Web Services

We’re pleased to announce that as of May 2018, ~110 TB of Hubble’s archival observations are available in cloud storage on Amazon Web Services (AWS) which provides unlimited access to the data right next to the wide variety of computing resources provided by AWS.

These data consist of all raw and processed observations from the currently active instruments: the Advanced Camera for Surveys (ACS), the Wide Field Camera 3 (WFC3), the Cosmic Origins Spectrograph (COS), the Space Telescope Imaging Spectrograph (STIS) and the Fine Guidance Sensors (FGS).

The data on AWS (available at https://registry.opendata.aws/hst/) are kept up to date with the data held in MAST and new and reprocessed data are updated on AWS within 20 minutes of them being updated at MAST.

So, how do I use it?

To get started you will need:

  • An AWS account. Sign up for an account using the AWS Console.
  • A running EC2 instance in US-East (N. Virginia) (watch this video on starting an instance) with Python 3. We recommend the astroconda Anaconda channel.
  • The astroquery and boto3 Python libraries. These do not come standard with the astroconda distribution and need to be installed separately.
  • An AWS access key ID and a secret access key. These can be generated under User > Your Security Credentials > Access Keys in the AWS console. Remember to save the ID-key combination.
  • Some code to query MAST and download data from the public dataset. In order to view or analyze a file from the archive, you’ll need to transfer it from S3 to your instance. This transfer however is free, as long as it happens within the same AWS region (US-East N. Virginia).

Alternatively…

To help you get started we have simplified the process of setting up an EC2 instance by creating an Amazon Machine Image (AMI) with all the necessary software pre-installed (astroconda,boto3,astroquery). To launch a copy of this machine, search the AMI Community Marketplace for “STScI-Hubble-Public-Data” or ami-cfdfb6b0. The README in the home directory of the AMI describes how to set your AWS credentials as environmental variables and how to run the example above in the instance.

This example shows you how to grab several drizzled images for the CANDELS WFC3/IR observations of the GOODS-South field:

Transferring all 270 images (13 MB each or > 3GB total) takes 90 seconds. For comparison, downloading the data over an average network connection (~50 mbps) will take over eight minutes or five to six times slower. You can now display the images, do source detection on them, mosaic them together, etc.

A cloud hosted copy of Hubble data

The Hubble AWS Public Dataset is not a substitute for the Mikulski Archive for Space Telescopes (MAST). Data are, and always will be, available free of charge from MAST. Also, while we’re making every effort to keep the data on AWS up to date, if you absolutely definitely want to be sure you’re getting the latest and greatest calibrated data, you should download directly from MAST rather than this copy on AWS.

Using these data from within the US-East (N. Virginia) AWS region does not incur any charges, but downloading data from this copy to other AWS regions or outside of AWS will cost money. Also, note that the copy on AWS only includes public data. Proprietary datasets aren’t available.

By distributing this copy of Hubble data on AWS, we’re exploring a new kind of archive service – one where the data are highly available i.e., bulk, high-speed access to the data next to the vast computational resources of Amazon Web Services.

Astronomers who want to experiment with AWS can take advantage of their free tier. In later posts, we’ll show you how you can process significant volumes of data at little/no cost. Elastic Cloud Computing (EC2), the AWS service which provides basic compute capacity, has a one year free tier to new users which is ideal for learning, experimenting and testing.

If you’re interested in doing more with these data then you might want to take a look at the Cycle 26 Call for Proposals which includes a new type of proposal: Legacy Archival Cloud Computation Studies. This proposal category is specifically aimed at teams that would like to leverage this dataset.

Proposals to make use of this dataset should include the phrase ‘Cloud Exploration:’ at the beginning of their proposal title and should include a line item in their budget for AWS costs (limit $10,000 USD). For questions regarding the call for proposal you can reach us at dsmo@stsci.edu.

Tell us more about how you did this…

The Hubble data is hosted on AWS as a result of an agreement between STScI and AWS to participate in the AWS Open Data Program. There it joins a wide variety of other datasets, including Landsat-8 imaging, 1000 Human Genomes and the subtitles of 32,000 movies. The initial hosting agreement between AWS and STScI is for three years and can be extended based on the data access volume and frequency.

So how do you move 110 TB of data from Baltimore to Virginia? Turns out the best way to transport large quantities of data is still via mail. We used the AWS Snowball service to move data from STScI to AWS. The Snowball is an 80 TB bank of hard drives (larger options are available 😀) which we plugged into our local network and, after some debugging, we rsync-ed the data to. Then we mailed it back. Two Snowballs were needed to deliver all the data and once the initial copy was uploaded to S3, we worked with our internal pipelines team to ensure that going forward, the files on AWS are updated as soon as there is a change internally. And that is it! The updates happen in real time - the S3 copy of the data is only 10-20 minutes behind MAST. Proprietary data is not included in the AWS data. PIs of proprietary data can only retrieve those from MAST.

Wrapping up

Whether you’re looking to process large volumes of HST data, or train some kind of deep learning algorithm to analyzing Hubble images, we think that making Hubble public data available in the cloud is a first step in facilitating new, more sophisticated analyses of archival data.

Teams such as the PHAT survey have already utilized cloud computing to handle their data processing needs and we can not wait to see analyses involving machine learning, transient detection, creating large, multi-epoch mosaics, joint processing with other survey data carried out on these data.

We hope you find this new data availability useful and we look forward to reading your Cycle 26 proposals and papers on the arXiv!

Brought to you by Iva Momcheva, Arfon Smith, Josh Peek, and Mike Fox

FAQ & Resources

Where are the data?: AWS US East

What data have you uploaded?: Currently active instruments: ACS, COS, STIS, WFC3, FGS

How can I access the data?: You’ll need an AWS account. See this example of how to use your AWS account with boto3 and Python.

How much does it cost to access the data?: Within the AWS US-East region it’s free. To download outside of US-East standard S3 charges apply.

So now you’re charging for Hubble data?: No, Hubble data is, and will always be, free from MAST. This copy of the Hubble data in MAST is being provided in a ‘highly available’ environment next to the significant computational resources of the AWS platform.

How can I get some money to do science with this data?: We’re glad you asked! HST CFP 26 explicitly calls out this dataset as something we’d like you to explore.

I like this idea but I’d rather use a different cloud vendor.: Please get in touch and let us know.