Nuala’s Impressions from the Astrostatistics Workshop

The following post was written by Nuala McCullagh who is a graduate student in the Physics & Astronomy department at Johns Hopkins.

Nuala McCullagh sitting on a bench

Nuala McCullagh

I was thrilled to have the opportunity to visit SAMSI for the Massive Datasets program for three weeks in September. One of the most positive aspects of my visit was my exposure to several flavors of diversity, the most salient of which was diversity of expertise and discipline. As a graduate student in the Physics & Astronomy department at Johns Hopkins, I have been active in promoting diversity within my department. Physics and astronomy, along with most math and science fields, have traditionally lacked racial and gender diversity, and while the benefits of diversity are well established and generally accepted, it can still be difficult to convince scientists that it is an issue they should care about. The benefits of the diversity I observed at SAMSI were very clear, and my experience there really reinforced my belief that diversity can inspire creativity and productivity.

At the opening workshop, we heard talks from experts in statistics, computer science, applied math, neuroscience, environment & climate science, high energy physics, and astronomy. While the conference covered a wide range of disciplines, there was a common thread of having to deal with massive datasets. I was surprised to learn about the similarities between my work in cosmology and work in other fields such as climate studies and neuroscience. Hearing about the problems and solutions in those fields have helped me think about my own problems in a different way.

At the astrostatistics workshop, we heard about large galaxy surveys, computer simulations, multi-dimensional datasets, time-domain astronomy, and more. It was helpful to hear about the different statistical problems with massive datasets in the context of astronomy, and interesting to see the similarities and differences between them. For example, just within cosmology, the statistical problems that arise when working with large dark matter simulations are different from those that arise in detecting weak lensing in galaxy surveys. Meanwhile, people who study exoplanets work with large simulations with many parameters, much like the simulations in cosmology. Hearing about the various statistical problems astronomers have encountered allowed me to make connections between different areas in astronomy that I would not have noticed otherwise.

I appreciated the opportunity to learn about a wide variety of problems concerning massive data. It was interesting to note the statistical similarities in seemingly disparate scientific problems. It was also reaffirming to see the positive impact that diversity can have in inspiring creativity and productivity in science.

Ilse Ipsen Speaks at the Science Communicators of North Carolina and the RTP Chapter of Sigma Xi Pizza Lunch

Ilse Ipsen speaking to the SCONC and RTP chapter of Sigma Xi

Ilse Ipsen spoke to the SCONC and RTP chapter of Sigma Xi on October 9.

Associate Director of SAMSI and professor of mathematics at North Carolina State University, Ilse Ipsen, recently spoke at the Sigma Xi pizza lunch. The lunch is a monthly gathering co-sponsored by the Science Communicators of North Carolina (SCONC) and the RTP chapter of Sigma Xi.

Ilse’s talk, “Rolling the Dice on Big Data” focused on how big data is permeating all aspects of our daily lives. From going to the grocery store, where supermarkets are gathering data on our personal buying habits, to analyzing images from space, to the Internet where Google receives 2 million inquiries a minute and 347 blog posts are happening every minute of the day. Facebook processes 500 terrabytes of information each day and 30 billion pieces of information are shared on Facebook each month.

To give her audience an understanding of how applied mathematicians approach this enormous problem of sifting through data, she used an example of trying to match an e-mail that comes from an unknown source to a series of e-mails that were received from known authors. The e-mail from the unknown source has three key words in it. In her example, she looks at the three e-mails and counts the number of times the key words were used. Then, the length of the sentence is measured to see how many words were used in each e-mail and in the query. Each word in each e-mail is counted and multiplied by the query to get a number. The words that are found in each e-mail and the query are squared and then divided by the sum. This method will help determine which of the e-mails is the author of the query.

If one were to look at every e-mail written each day, there would be about 294 billion e-mails to sort through and there is about 250,000 words in the English language, so it would be an enormous task to accomplish, but many mathematicians and statisticians use the Monte Carlo method to sample and narrow down the search.

She explained that using a randomized algorithmic approach was fast, easy to implement and simple to use and is as good as, or perhaps even better, than using a deterministic approach.

Room full of people listening to Ilse Ipsen's talk

The room was at full capacity for Ilse Ipsen’s talk “Rolling the Dice on Big Data.”

Ilse spoke to a packed room, including many science writers from the Triangle region, members of Sigma Xi and a high school physics class from Kestrel Heights, a local charter school.

Astronomy and Big Data

The following is from G. Jogesh Babu, Professor of Statistics Director, Center for Astrostatistics at Penn State University and one of the organizers of the astrostatistics workshop.

Jogesh Babu working at the astrostatistics workshopo

Jogesh Babu at the recent astrostatistics workshop held at SAMSI.

Astronomers are among the first researchers to encounter Big Data. Until a few decades ago astronomers would typically compete for observation time on telescopes, spending cold nights on distant mountain top observatories to collect data on few stars and galaxies. This has changed substantially. Today, they pour over massive data through high speed internet connection to their office computers, thinking of automated procedures to identify objects. They have become computer scientists, developing algorithms to read through massive data and make sense of it by inventing algorithms specific to their task.  In addition, some astrophysicists make massive simulations  under assumptions dictated by physical models; for example, the  Millennium Simulation calculates the formation of galaxies in an expanding Universe dominated by Dark Matter and Dark Energy.
The simulations must then be compared to the massive datasets to see if the assumed model explain the data.

The Sloan Digital Sky Survey (SDSS), designed in the 1990s and still active today, really brought astronomy into the massive data era. SDSS data rate for imaging was 17GBytes/hour, and much less for spectroscopy. Thus, SDSS produces about 200 GB of data every night, adding to a database that stands at around 50 TB today. The Sloan project has produced several thousand research papers, revolutionizing many fields of astronomy. Massive data in astronomy is thus producing a paradigm shift the way astronomy research is done. It is bringing  information scientists, statisticians and astronomers together to  collaborate on scientific investigations. For more on massive data in astronomy, see the recent article `Big data in astronomy’ by Eric D.  Feigelson and the author in the August 2012 issues of `Significance‘.

While astronomical data traditionally consists of images and spectra,  the time domain is adding a new dimension to the astronomical imaging (

Repeated images of the sky reveal a wealth of information about our ever changing universe: dozens of species of variable stars; thousands of moving asteroids in the Solar System;  tens of thousands of quasars, supermassive black holes in distant galaxies; and hundreds of supernova explosions from dying stars. Type Ia supernovae are particularly important, as their numbers shed light on Dark Matter and Dark Energy.

An alphabet soup of time domain surveys in visible light are underway: SDSS III, PTF, CRTS, SNF, Pan-STARRS, VISTA, and more. The largest of the planned projects based on multi-epoch imaging is the Large Synoptic Survey Telescope (LSST), recently approved by the National Science Foundation as the largest U.S. ground-based project in astronomy.  It is expected to start around 2020. The LSST images half of the sky every 3 nights, producing a video of the sky with hundreds of millions of variable objects.  The data flow from this project will be around several Terabytes each night. The challenges from this project is putting hundreds of astronomers, engineers, computer scientists to think challenging problems, both in the management of massive data streams and in the data mining to emerge with strong scientific findings.

Recently, I co-organized a workshop on `Astrostatistics’  with Prajval  Shastri of Indian Institute of Astrophysics, as part of 2012-13 SAMSI  Program on Statistical and Computational Methodology for Massive Datasets.

Prajval Shastri and Ann Lee

Ann Lee (L) and son, and Prajval Shastri (R).

The three day workshop was held at SAMSI during September 19-21. Though it has a mix of talks by statisticians like Jim Berger and David Donoho, the majority of the talks were by astronomers who have ongoing collaborations with statisticians. Each talk ended with a lively, stimulating discussions. The audience consisted of a good mix of statisticians and astronomers. Presentations concentrated on Bayesian methods, faint source detection, learning from massive multidimensional data, exoplanets, time-domain astronomy, sparsity, reproducible research etc. The workshop concluded with a good discussion on future directions.

It is nice to get back to SAMSI and interact with friends and collaborators;  I had organized a very stimulating semester-long  Astrostatistics program at SAMSI in Spring 2006. Astronomers were already familiar with the concept of large scale electronic integration
of astronomy data, tools, and services on a global scale in a manner  that provides easy access by individuals around the world, via the  Virtual Observatory (VO). They are thus enabling science on massive data. Even in 2006, astronomers were grappling with massive data much before the terms `Big data’ or `megadatasets’ became vogue.