Astronomy and Big Data

The following is from G. Jogesh Babu, Professor of Statistics Director, Center for Astrostatistics at Penn State University and one of the organizers of the astrostatistics workshop.

Jogesh Babu working at the astrostatistics workshopo

Jogesh Babu at the recent astrostatistics workshop held at SAMSI.

Astronomers are among the first researchers to encounter Big Data. Until a few decades ago astronomers would typically compete for observation time on telescopes, spending cold nights on distant mountain top observatories to collect data on few stars and galaxies. This has changed substantially. Today, they pour over massive data through high speed internet connection to their office computers, thinking of automated procedures to identify objects. They have become computer scientists, developing algorithms to read through massive data and make sense of it by inventing algorithms specific to their task.  In addition, some astrophysicists make massive simulations  under assumptions dictated by physical models; for example, the  Millennium Simulation calculates the formation of galaxies in an expanding Universe dominated by Dark Matter and Dark Energy.
The simulations must then be compared to the massive datasets to see if the assumed model explain the data.

The Sloan Digital Sky Survey (SDSS), designed in the 1990s and still active today, really brought astronomy into the massive data era. SDSS data rate for imaging was 17GBytes/hour, and much less for spectroscopy. Thus, SDSS produces about 200 GB of data every night, adding to a database that stands at around 50 TB today. The Sloan project has produced several thousand research papers, revolutionizing many fields of astronomy. Massive data in astronomy is thus producing a paradigm shift the way astronomy research is done. It is bringing  information scientists, statisticians and astronomers together to  collaborate on scientific investigations. For more on massive data in astronomy, see the recent article `Big data in astronomy’ by Eric D.  Feigelson and the author in the August 2012 issues of `Significance‘.

While astronomical data traditionally consists of images and spectra,  the time domain is adding a new dimension to the astronomical imaging (

Repeated images of the sky reveal a wealth of information about our ever changing universe: dozens of species of variable stars; thousands of moving asteroids in the Solar System;  tens of thousands of quasars, supermassive black holes in distant galaxies; and hundreds of supernova explosions from dying stars. Type Ia supernovae are particularly important, as their numbers shed light on Dark Matter and Dark Energy.

An alphabet soup of time domain surveys in visible light are underway: SDSS III, PTF, CRTS, SNF, Pan-STARRS, VISTA, and more. The largest of the planned projects based on multi-epoch imaging is the Large Synoptic Survey Telescope (LSST), recently approved by the National Science Foundation as the largest U.S. ground-based project in astronomy.  It is expected to start around 2020. The LSST images half of the sky every 3 nights, producing a video of the sky with hundreds of millions of variable objects.  The data flow from this project will be around several Terabytes each night. The challenges from this project is putting hundreds of astronomers, engineers, computer scientists to think challenging problems, both in the management of massive data streams and in the data mining to emerge with strong scientific findings.

Recently, I co-organized a workshop on `Astrostatistics’  with Prajval  Shastri of Indian Institute of Astrophysics, as part of 2012-13 SAMSI  Program on Statistical and Computational Methodology for Massive Datasets.

Prajval Shastri and Ann Lee

Ann Lee (L) and son, and Prajval Shastri (R).

The three day workshop was held at SAMSI during September 19-21. Though it has a mix of talks by statisticians like Jim Berger and David Donoho, the majority of the talks were by astronomers who have ongoing collaborations with statisticians. Each talk ended with a lively, stimulating discussions. The audience consisted of a good mix of statisticians and astronomers. Presentations concentrated on Bayesian methods, faint source detection, learning from massive multidimensional data, exoplanets, time-domain astronomy, sparsity, reproducible research etc. The workshop concluded with a good discussion on future directions.

It is nice to get back to SAMSI and interact with friends and collaborators;  I had organized a very stimulating semester-long  Astrostatistics program at SAMSI in Spring 2006. Astronomers were already familiar with the concept of large scale electronic integration
of astronomy data, tools, and services on a global scale in a manner  that provides easy access by individuals around the world, via the  Virtual Observatory (VO). They are thus enabling science on massive data. Even in 2006, astronomers were grappling with massive data much before the terms `Big data’ or `megadatasets’ became vogue.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s