DPDA Workshop: Reinforcing the Importance of Statistics and Applied Mathematics in Distributed Computing

alexander-terenin_headshot

Contributed by: Alexander Terenin, Statistics and Applied Mathematics PhD student, University of California – Santa Cruz

I am a PhD student in Statistics and Applied Mathematics at the University of California – Santa Cruz (UCSC). My research focuses on Bayesian statistics – specifically, Markov Chain Monte Carlo methods at scale in parallel and distributed environments for big data applications. I had heard about the workshop from a fellow graduate student in my department, and attending was a very natural choice given my area of research.

The Workshop…
On September 20 – 23, I had the privilege of attending a 4-day Workshop on Distributed and Parallel Data Analysis (DPDA) hosted by Statistical and Applied Mathematical Sciences Institute (SAMSI) at North Carolina State University in Raleigh, N.C. I would like to take this time to reflect on my observations after attending the workshop in this piece.

Upon arrival, the workshop proceeded as workshops usually do: various speakers gave talks on different topics, intertwined with breaks that give participants the opportunity to take a moment to think about the talks, as well as time to talk to one another about ideas. I was intrigued to see that the DPDA workshop had no parallel sessions – a format I much prefer because it brings people together that may otherwise never end up in the same room.

img_1098

Participants at the 2016 DPDA Workshop network during one of the scheduled times of the series. Participants used these opportunities to network and collaborate on ideas.

Informative and Engaging Discussions…
A number of these talks and discussions stood out to me – I’ll highlight three of them, in order of occurrence.

Wotao Yin, a faculty from UCLA’s Mathematics Department, gave a talk on “Asynchronous Parallel Coordinate Update Algorithms.” In this talk, he described a particular class of parallel versions of optimization algorithms – asynchronous iterative algorithms.

To understand what these are, let’s first back up and speak for a moment on iterative algorithms: these are algorithms where some sequence of steps is repeated until convergence. To take the next step, we need to have completed the previous one – so how can iterative algorithms be parallelized? It turns out, one way to do so is to make them asynchronous. For example, a set of workers perform a set of iterative steps as fast as they can, talking to each other as much as possible, with no control over what order these steps occur in. So then the question is asked, can such processes converge? Sometimes this is possible. If the algorithm’s state space forms a box, and if individual steps shrink the box, then the algorithm will converge even if performed asynchronously. After recalling these results, Prof.Yin illustrated that certain coordinate ascent algorithms satisfy these conditions. This talk was very interesting for me to listen to as I have written a paper about the asynchronous variant of Gibbs Sampling, an algorithm for Bayesian computation, the analysis of which is complicated but involves the same conditions. Seeing the same ideas used in a different context was very interesting and got me to think about similarities and differences with my own work.

Eric Xing, a faculty from Carnegie Mellon’s Computer Science Department, gave a talk on “Strategies and Principles for Distributed Machine Learning.” His lecture focused on a description of a variety of computational software environments used in big data setting, and how different implementation choices can yield vastly different levels of performance. This topic was interesting, because it bridged the theory of statistical computation with software engineering considerations that end up having substantially more implications for performance than might be expected. For example, in a distributed setting, having a master node that manages and coordinates workers can yield different performance characteristics than a peer-to-peer model where all of the workers talk to each other – even if the exact same algorithm is used in both cases. Similar lines of thought have been highly relevant in my own work as well. Having written papers on performing Markov Chain Monte Carlo algorithms in two different parallel settings – compute clusters and graphics cards – I have learned that software engineering considerations are an inherent part of parallel computing and it is important to study them.

I also found the discussion panel toward the end of the workshop to be particularly memorable. My PhD Advisor at UCSC, David Draper, was on the panel, along with a number of distinguished faculty members from several universities – moderated by Sujit Ghosh, Deputy Director of SAMSI. Draper made the point that for the field of statistical computation to advance, “statisticians need to become better computer scientists, and computer scientists need to become better statisticians.” This point resonated with me because as a student in a graduate program in statistics, we are largely not taught anything about high performance computing, whether in traditional supercomputer or Silicon Valley style hardware environments. I however, have been fortunate that I have had the privilege of working in both settings through an academic collaboration with Shawfeng Dong, an astrophysicist at UCSC, and my time at eBay, Inc. – many statisticians have not had this comparable opportunity.

This makes statistical high performance computing a specialty area, which in my view causes two discipline-wide consequences: (1) it’s easy for non-specialists to write code and design algorithms that scale poorly, and (2) the typical software stack that statisticians are taught and use in practice is filled with out-of-date tools and programmatic concepts that make coding and debugging unnecessarily difficult.

It was very interesting to hear similar ideas brought up and discussed as part of the panel. The experience was vital because the panel emphasized the implications on statistical education, a topic I do not have many opinions about, because I am still a student. The discussion panel gave me the opportunity to think about our field as statisticians and applied mathematicians and where our discipline is headed.  This new information and insight is important for a young person, such as myself, because it tells me what to study and spend my time on throughout my graduate program.

Participants at the 2016 DPDA Workshop discuss various topics on distributed computing during the Workshop Reception and Poster session.

Participants at the 2016 DPDA Workshop discuss various topics on distributed computing during the Workshop Reception and Poster session.

“Statisticians need to become better computer scientists, and computer scientists need to become better statisticians.”

A Good Experience Overall…
Overall, I found the workshop highly memorable. The points highlighted merely scratched the surface of topics I wanted to discuss. An honorable mention was the lecture by Han Liu, a faculty at the Statistical Machine Learning Lab at Princeton University. Liu’s talk was called “Blessing of Massive Scale” and he demonstrated that some problems become much easier when they are big. Faming Liang, a faculty at the University of Florida’s Department of Biostatistics, spoke about “Bayesian Neural Networks for High Dimensional Variable Selection.” I found  Liang’s treatment of Bayesian asymptotics interesting.

Finally, Samuel Franklin’s, of 360i: Digital Marketing Agency, presented a talk called “HDPA Growth Constraints in Digital Marketing.” The subject was surprisingly interesting for a talk that involved no mathematics. He called upon all of us in the room, the next generation of statisticians, engineers and applied mathematicians to be champions for increased education on high performance computing foreshadowed some of what was later said in the panel.

Data Science at 360i, lectures on the importance high speed computing as a resource for digital marketing strategies.

Samuel Franklin, Vice President of Data Science at 360i, lectures on the importance of high speed computing as a resource for digital marketing strategies.

I was thankful that I had the opportunity to attend and listen to all of the wonderful perspectives that were offered on our field of study, as well as the opportunity to try North Carolina BBQ during one of the evenings. I would also like to thank SAMSI for compiling and sharing the approved lectures from this event online. For more information about the DPDA Workshop or simply to review what was presented, visit: www.samsi.info/dpda.