Andrea G. Campos Bianchi’s Impressions of the Massive Datasets Opening Workshop

The following is remarks from Andrea G. Campos Bianchi, Visiting Researcher, Lawrence Berkeley National Laboratory, who attended the Massive Datasets Opening Workshop in September.

Andrea Bianchi, visiting professor, Lawrence Berkeley National Laboratory.

What a delight to attend the 2012 SAMSI Workshop on Massive Datasets. The tutorials and talks were impressive, and they exposed me to different approaches of Massive Dataset, since the lectures covered a broad spectrum of cutting-edge topics, ranging from randomized methods in statistics to complex high energy physics problems.

As a visiting researcher at Lawrence Berkeley National Laboratory, and, originally, a professor at Federal University of Ouro Preto-Brazil, I want to express my sincere appreciation for SAMSI for sponsoring my trip, which make my attendance possible.
Certainly, I will benefit from this workshop and ideas for years to come, specially regarding large data analysis, theory and applications. Looking forward to participate of the Imaging Working Group, and establish collaboration with researchers from SAMSI.
 Big Thanks!

Dimension Reduced Modeling of Space-Time Processes with Application to Statistical Downscaling

Jenny Brynjarsdottir, SAMSI postdoctoral fellow, gave a talk at the postdoc seminar on September 26. Her talk was, “Dimension Reduced Modeling of Space-Time Processes with Application to Statistical Downscaling.” The following is the abstract from her talk:

Jenny Brynjarsdottir giving her talk at the postdoc seminar.

The field of spatial and spatio-temporal statistics is increasingly faced with the challenge of very large datasets. The classical approach to spatial and spatio-temporal modeling is extremely computationally expensive when the datasets are large. Dimension-reduced modeling approach has proved to be effective in such situations.

In this talk we focus on the problem of modeling two spatio-temporal processes where the primary goal is to predict one process from the other and where the datasets for both processes are large. We outline a general dimension-reduced Bayesian hierarchical approach where the spatial structures of both processes are modeled in terms of a low number of basis vectors, hence reducing the spatial dimension of the problem. The temporal evolution of the spatio-temporal processes and their dependence is then modeled through the coefficients (also called amplitudes) of the basis vectors.

We present a new method of obtaining data-dependent basis vectors that are geared to the goal of predicting one process from the other: (Orthogonal) Maximum Covariance Patterns. We apply these methods to a statistical downscaling example, where surface temperatures on a coarse grid over the Antarctic are downscaled onto a finer grid.

Thoughts on Bayesian Statistical Inference for Regional Climate Projections in North America

The following blog entry is from Noel Cressie Professor of Statistics University of Wollongong and Director, Program in Spatial Statistics and Environmental Statistics The Ohio State University.

Noel Cressie

Noel Cressie, Professor of Statistics, University of Wollongong and Director, Program in Spatial Statistics and Environmental Statistics, The Ohio State University

Two weeks ago I attended and spoke at the opening workshop for Statistical and Computational Methodology for Massive Datasets. It’s always good to get back to SAMSI and see old friends. (I was a visitor there in spring 2010 for a program on Space-Time Analysis for Environmental Mapping, Epidemiology, and Climate Change.) This time I spoke in a session on Environment and Climate, and I had the opportunity to present recent work on statistical inference for regional climate projections in North America.

A few features of the problem: The data are outputs from several regional climate models, and hence they are deterministic. To carry out inference on important questions, like, “Where and in which season will the temperature increase be most severe?”, we (Emily Kang, U. Cincinnati, and I) used a Bayesian hierarchical modeling approach. The data are spatial, at a 50 km resolution over North America. The dataset is large, about 100,000 in size, even after summarization! The problem is important: it involves projecting temperature change in 50 km x 50 km regions of North America by 2070, for the four seasons and over the whole year.

Our results are quite sobering…the website: can be consulted for more details.

image of North American with a wide swath of red to indicateprobability of  temperature increase

In this image, the color red indicates regions of North America for which our Bayesian statistical analysis gives a 97.5 percent posterior probability that average temperatures will rise by at least 2 degrees Celsius (3.6 degrees Fahrenheit) by 2070. Image by Noel Cressie and Emily Kang, courtesy of Ohio State University.

In a follow up to the Bayesian inference we did, I was asked the following questions by a reporter for a popular science magazine, Science et Vie:

“I would like to know how central was the role of Bayesian statistics in your work. That is : What is the improvement brought by the use of these statistics when compared to “classical” statistics?

More generally, I’d like to know if Bayesian statistics have emerged only recently in climatology, and if yes, why now ? How would you qualify the (current or potential) contribution of Bayesian statistics to climatology?”

I thought that readers of the SAMSI blog might be interested in my responses (lightly edited for the context of this blog):

In our statistical analysis, we are considering output from different climate models at a very regional scale and seasonally. The outputs are deterministic and complete over the North American region, at a 50 km x 50 km scale. No continental-scale or global-scale averaging is being done. At this level, communities and even individuals can see the impact of climate change on their lives. Moreover, because the output can be presented seasonally, the impact of climate change on water storage, agriculture, pest control, and so forth can be considered.

There is model-to-model variability and spatial variability in the climate-model output, but the output is deterministic. That is, if the models were run again with the same boundary conditions, the same values would be obtained. That is where a Bayesian analysis is essential, because without it we could just summarize the data but not do any inference on it. In a paper (Kang and Cressie, 2012) published this year in the International Journal of Applied Earth Observation and Geoinformation, we give examples of inference based on samples from the posterior distribution.

There is a generic approach to climate modeling that has emerged relatively recently (last 15 years) based on hierarchical statistical modeling. This approach uses Bayes Theorem and modern computing technology (e.g., Markov chain Monte Carlo algorithms) to allow us to answer climate-related questions (e.g., “Will temperatures increase by 2070 beyond a sustainabilty threshold of 2 deg. C?”), in the presence of data uncertainty, and scientific-model uncertainty. There is a version of this called Bayesian hierarchical modeling (BHM) that we used.

I have a particular interest in remote sensing data, which can be massive in size. The last 5 years of my research program has been directed towards dimension reduction in hierarchical statistical models, with a particular emphasis on climate questions. The dataset analyzed in the paper above is large, about 100,000, and we had to use dimension-reduction techniques to solve the problem. Bayesian statistics is computationally intensive, and usually problems of only moderate data sizes can be solved. Our work and that of others, in dimension reduction, has been a breakthrough that has allowed BHMs to be used in very complex models with massive datasets.

The Bayesian “movement” is growing in science in general. Good scientists are honest about what they know, and they are aware of the uncertainties in their work. There has been a general trend in science towards “Uncertainty Quantification” or “UQ,” and the Bayesian approach allows uncertainties to be expressed through conditional probabilities. Bayes Theorem is a coherent way to combine all these sources of uncertainty.

Many climatologists will have difficulty with fitting BHMs because there’s a big statistical investment involved. The savvy ones are partnering with statisticians in research teams to answer parts of grand-challenge questions in the presence of uncertainties. This movement is small but growing, but I expect it will be accepted in 5-10 (hopefully 5) years by the climate community as being essential.

Brian Caffo Shares His Impression of the Massive Datasets Opening Workshop

The following post is from Brian Caffo, Associate Professor, Department of Biostatistics at the Johns Bloomberg Hopkins School of Public Health at Johns Hopkins University. He recently attended the opening workshop for the Massive Datasets Program.

Close up of Brian Caffo

Brian Caffo, Johns Hopkins University. Photo by Jay VanRensselaer.

SAMSI’s opening workshop for its Massive Datasets program was a resounding success. It brought together applied mathematicians, computer scientists and statisticians of various stripes together to self organize into likely fruitful collaborations. The conference included many talks and panels on methods developments and applications for analyzing large data sets. The conference was absolutely top notch. The speakers and organizers should be applauded.

To stop this blog post from being an entirely congratulatory exercise, however, I’d like to raise some points of discussion for the relevant fields.  It was inescapable to notice that a prevailing style of methodology research exists from the talks at SAMSI. Particularly, researchers search for the core elements of commonality in massive data set problems and go after them with general solutions that would apply across settings.  The goal is then new methodology that applies broadly. This has been the model for statistics, applied mathematics and computer science for quite some time.

My question is, “Is this model sufficient for the challenge of big data?” An alternative strategy would be to focus on specific big data problems. The concern being that when focusing on the general, interesting specific aspects of a unique large data set get lost. So, for example, instead of worrying about general theory, such as computational orders of magnitude, asymptotics and optimality,  worrying about successfully fitting meaningful models on giant benchmark data sets that serve as rallying points. My guess, is that this strategy might serve better for bringing researchers together to work collaboratively, or even competitively towards common goals.

These points notwithstanding it was a great conference.  My experience probably mirrored others in that I learned quite a bit, got a chance to catch up with friends and got to meet great researchers that will likely result in collaborations.


Astrophysicsts are using Palomar Transient Factory Wide-Field Imaging Data

The following is from Rollin Thomas, Lawrence Berkeley National Laboratory, as he shares a little bit of what he talked about at the opening workshop of the Massive Datasets program at SAMSI.

Rollin Thomas sitting in chair

Rollin Thomas

It was my pleasure to tell the audience at the Opening Workshop for the Massive Datasets Program about how astrophysicists are using wide-field time-domain sky surveys to make discovery of short-lived “transient” objects routine, and much less a result of serendipity.  It was fun to talk about how Palomar Transient Factory (PTF) wide-field imaging data are moved, stored, processed, subtracted, and then scrutinized by robots and humans to identify new targets for triggered follow-up with other specialized instruments and bigger telescopes.  PTF is an instructive case study in how instrumentation, high-performance networking and computing, machine learning, and scientific work-flows combine to help us
do better science faster.  The tale of SN 2011fe shows what can be accomplished when all those things have been lined up.

Pinwheel Galaxy image from Las Cumbres Observatory Global Telescope Network

SN 2011fe in the Pinwheel Galaxy (M101) at maximum brightness, a composite of optical data from the Las Cumbres Observatory Global
Telescope Network 0.8m Byrne Observatory Telescope at Sedgwick
Reserve and hydrogen emission data from the Palomar Transient
Factory (red). Image credit: BJ Fulton (LCOGT) / PTF.

The whole 4-year PTF imaging data set takes up 100+ TB of disk space; not that big of a data set when compared to some found in other domains. The challenge is to plow through it all as it streams in without delay, to find faint signatures of real objects on top of huge backgrounds. Some backgrounds are low level and dealt with easily in hardware and
software: statistical noise, detector artifacts, cosmic rays, sky brightness, etc.  But at high level, one person’s background transient is another person’s signal, so you can’t just throw away asteroids if you only personally care about supernovae.  Hence, another challenge is
helping a (probably) geographically distributed team of researchers to keep track of both discovery, and each other’s follow-up observations.

PTF has met these challenges, but what about the future?  It seems very likely that scaling up to the Large Synoptic Survey Telescope in 10 years will require breakthroughs in networking, computing architectures, software paradigms, artificial intelligence, and more automated distributed scientific work-flows.  I encourage participants at the Massive Datasets workshop to think about how their interests match with those challenges now.

SAMSI DDDHC Workshop: My experience

The following blog entry is from John Olaomi, Associate Professor from the University of South Africa. He recently attended the Data-Driven Decisions in Healthcare Opening Workshop.

John Olaomi

John Olaomi, Associate Professor from U. of South Africa and guest blogger.

In my quest for knowledge and international exposures into the current trend in statistics and its applications, I first got in contact with SAMSI in 2007, responding to the 2008-2009 Postdoc position request, thanks to Google Search.  I was invited to participate in 2008-2009 Workshop on Sequential Monte Carlo Methods but could not, due to last minute funding challenges.  Since then, I have always been notified about all SAMSI programs.

The Data-Driven Decisions in Healthcare (DDDHC) opening workshop was another opportunity to participate in SAMSI programs, which I eventually succeeded with, thanks to the funding from my institution, the University of South Africa. From the organization to the execution and the forming of working groups, the workshop was a real success.  The tutorials and the technical sessions were informative, practical and posed many relevant research challenges which is really worth coming for.  Meeting erudite scholars and colleagues (both academic and professional) that can contribute to one’s academic prowess is priceless.

people looking at a poster

There was a poster session and reception held Monday night for the DDDHC opening workshop.

Technically, the expositions on Comparative Effectiveness Research (CER)- Observational Studies, Patient Flow, Personalized Healthcare and Healthcare Databases really opened my eyes to many applications and the need for statisticians to be at the forefront to achieving the healthcare goals.  Although the presentations were good, my expectations were that a lot of methodological issues (with Statistical and Operations Research perspectives) will be raised and tackled.  This was lacking, as virtually all presentations were like 80 percent healthcare issues and 20 percent (statistical or operations research) methodology, which I think should be vise-versa. This is necessary to avoid errors of the third kind: the error committed by giving the right answer to the wrong problem (Kimball, 1957).

In all, I think my trip really worth it. Thanks to SAMSI for the opportunity and the good facilitation.

Olaomi, John. O. (PhD)
Associate Professor
Department of Statistics,
University of South Africa (UNISA)
P. O. Box 392 UNISA 0003
Pretoria, South Africa