Astronomy and Big Data

The following is from G. Jogesh Babu, Professor of Statistics Director, Center for Astrostatistics at Penn State University and one of the organizers of the astrostatistics workshop.

Jogesh Babu working at the astrostatistics workshopo

Jogesh Babu at the recent astrostatistics workshop held at SAMSI.

Astronomers are among the first researchers to encounter Big Data. Until a few decades ago astronomers would typically compete for observation time on telescopes, spending cold nights on distant mountain top observatories to collect data on few stars and galaxies. This has changed substantially. Today, they pour over massive data through high speed internet connection to their office computers, thinking of automated procedures to identify objects. They have become computer scientists, developing algorithms to read through massive data and make sense of it by inventing algorithms specific to their task.  In addition, some astrophysicists make massive simulations  under assumptions dictated by physical models; for example, the  Millennium Simulation calculates the formation of galaxies in an expanding Universe dominated by Dark Matter and Dark Energy.
The simulations must then be compared to the massive datasets to see if the assumed model explain the data.

The Sloan Digital Sky Survey (SDSS), designed in the 1990s and still active today, really brought astronomy into the massive data era. SDSS data rate for imaging was 17GBytes/hour, and much less for spectroscopy. Thus, SDSS produces about 200 GB of data every night, adding to a database that stands at around 50 TB today. The Sloan project has produced several thousand research papers, revolutionizing many fields of astronomy. Massive data in astronomy is thus producing a paradigm shift the way astronomy research is done. It is bringing  information scientists, statisticians and astronomers together to  collaborate on scientific investigations. For more on massive data in astronomy, see the recent article `Big data in astronomy’ by Eric D.  Feigelson and the author in the August 2012 issues of `Significance‘.

While astronomical data traditionally consists of images and spectra,  the time domain is adding a new dimension to the astronomical imaging (

Repeated images of the sky reveal a wealth of information about our ever changing universe: dozens of species of variable stars; thousands of moving asteroids in the Solar System;  tens of thousands of quasars, supermassive black holes in distant galaxies; and hundreds of supernova explosions from dying stars. Type Ia supernovae are particularly important, as their numbers shed light on Dark Matter and Dark Energy.

An alphabet soup of time domain surveys in visible light are underway: SDSS III, PTF, CRTS, SNF, Pan-STARRS, VISTA, and more. The largest of the planned projects based on multi-epoch imaging is the Large Synoptic Survey Telescope (LSST), recently approved by the National Science Foundation as the largest U.S. ground-based project in astronomy.  It is expected to start around 2020. The LSST images half of the sky every 3 nights, producing a video of the sky with hundreds of millions of variable objects.  The data flow from this project will be around several Terabytes each night. The challenges from this project is putting hundreds of astronomers, engineers, computer scientists to think challenging problems, both in the management of massive data streams and in the data mining to emerge with strong scientific findings.

Recently, I co-organized a workshop on `Astrostatistics’  with Prajval  Shastri of Indian Institute of Astrophysics, as part of 2012-13 SAMSI  Program on Statistical and Computational Methodology for Massive Datasets.

Prajval Shastri and Ann Lee

Ann Lee (L) and son, and Prajval Shastri (R).

The three day workshop was held at SAMSI during September 19-21. Though it has a mix of talks by statisticians like Jim Berger and David Donoho, the majority of the talks were by astronomers who have ongoing collaborations with statisticians. Each talk ended with a lively, stimulating discussions. The audience consisted of a good mix of statisticians and astronomers. Presentations concentrated on Bayesian methods, faint source detection, learning from massive multidimensional data, exoplanets, time-domain astronomy, sparsity, reproducible research etc. The workshop concluded with a good discussion on future directions.

It is nice to get back to SAMSI and interact with friends and collaborators;  I had organized a very stimulating semester-long  Astrostatistics program at SAMSI in Spring 2006. Astronomers were already familiar with the concept of large scale electronic integration
of astronomy data, tools, and services on a global scale in a manner  that provides easy access by individuals around the world, via the  Virtual Observatory (VO). They are thus enabling science on massive data. Even in 2006, astronomers were grappling with massive data much before the terms `Big data’ or `megadatasets’ became vogue.

Andrea G. Campos Bianchi’s Impressions of the Massive Datasets Opening Workshop

The following is remarks from Andrea G. Campos Bianchi, Visiting Researcher, Lawrence Berkeley National Laboratory, who attended the Massive Datasets Opening Workshop in September.

Andrea Bianchi, visiting professor, Lawrence Berkeley National Laboratory.

What a delight to attend the 2012 SAMSI Workshop on Massive Datasets. The tutorials and talks were impressive, and they exposed me to different approaches of Massive Dataset, since the lectures covered a broad spectrum of cutting-edge topics, ranging from randomized methods in statistics to complex high energy physics problems.

As a visiting researcher at Lawrence Berkeley National Laboratory, and, originally, a professor at Federal University of Ouro Preto-Brazil, I want to express my sincere appreciation for SAMSI for sponsoring my trip, which make my attendance possible.
Certainly, I will benefit from this workshop and ideas for years to come, specially regarding large data analysis, theory and applications. Looking forward to participate of the Imaging Working Group, and establish collaboration with researchers from SAMSI.
 Big Thanks!

Astrophysicsts are using Palomar Transient Factory Wide-Field Imaging Data

The following is from Rollin Thomas, Lawrence Berkeley National Laboratory, as he shares a little bit of what he talked about at the opening workshop of the Massive Datasets program at SAMSI.

Rollin Thomas sitting in chair

Rollin Thomas

It was my pleasure to tell the audience at the Opening Workshop for the Massive Datasets Program about how astrophysicists are using wide-field time-domain sky surveys to make discovery of short-lived “transient” objects routine, and much less a result of serendipity.  It was fun to talk about how Palomar Transient Factory (PTF) wide-field imaging data are moved, stored, processed, subtracted, and then scrutinized by robots and humans to identify new targets for triggered follow-up with other specialized instruments and bigger telescopes.  PTF is an instructive case study in how instrumentation, high-performance networking and computing, machine learning, and scientific work-flows combine to help us
do better science faster.  The tale of SN 2011fe shows what can be accomplished when all those things have been lined up.

Pinwheel Galaxy image from Las Cumbres Observatory Global Telescope Network

SN 2011fe in the Pinwheel Galaxy (M101) at maximum brightness, a composite of optical data from the Las Cumbres Observatory Global
Telescope Network 0.8m Byrne Observatory Telescope at Sedgwick
Reserve and hydrogen emission data from the Palomar Transient
Factory (red). Image credit: BJ Fulton (LCOGT) / PTF.

The whole 4-year PTF imaging data set takes up 100+ TB of disk space; not that big of a data set when compared to some found in other domains. The challenge is to plow through it all as it streams in without delay, to find faint signatures of real objects on top of huge backgrounds. Some backgrounds are low level and dealt with easily in hardware and
software: statistical noise, detector artifacts, cosmic rays, sky brightness, etc.  But at high level, one person’s background transient is another person’s signal, so you can’t just throw away asteroids if you only personally care about supernovae.  Hence, another challenge is
helping a (probably) geographically distributed team of researchers to keep track of both discovery, and each other’s follow-up observations.

PTF has met these challenges, but what about the future?  It seems very likely that scaling up to the Large Synoptic Survey Telescope in 10 years will require breakthroughs in networking, computing architectures, software paradigms, artificial intelligence, and more automated distributed scientific work-flows.  I encourage participants at the Massive Datasets workshop to think about how their interests match with those challenges now.

Apply Now for the 2-Day Undergraduate Workshop at SAMSI October 26-27

group of undergraduate students from 2011

Last year’s undergraduate workshop group.

SAMSI is accepting applications for the two-day undergraduate workshop that will focus on Statistical and Computational Methodology for Massive Datasets. The workshop will be held October 26-27 at SAMSI in Research Triangle Park, NC. The program begins at 9:30am on Friday, October 26 and ends at noon on Saturday, October 27.

Applications received by Friday, September 28 will receive full consideration. SAMSI will reimburse appropriate travel expenses, including food and lodging. Participants are urged to arrive on Thursday evening.

The Statistical and Computational Methodology for Massive Datasets program focuses on fundamental methodological questions of statistics, mathematics and computer science posed by massive datasets, with applications to astronomy, high energy physics, and the environment. Serious challenges posed by massive datasets have to do with “scalability” and “data streaming.”