Online Mini-symposium on Probabilistic and Topological Methods for Biological Data

as part of the SIAM Conference on Mathematics of Data Science (MDS20)

Probability session

Date: June 11, 2020.
Time: 3:00 PM - 5:00 PM (EST)


Wasiur R. KhudaBukhsh
The Ohio State University

Dynamic Survival Analysis of Epidemics and How COVID-19 shaped it.

Arindam Fadikar
Argonne National Laboratory

Clustering based Gaussian Process Emulation and Calibration of a Stochastic Agent based Model.

Pragya Sur
Harvard University

Modern Likelihood based Approaches for High-Dimensional Non-Linear Models

Yuekai Sun
University of Michigan

Communication-Efficient Integrative Regression in High-Dimensions.


This talk will introduce the notion of dynamic survival analysis (DSA). We show that solutions to ordinary differential equations (ODEs) describing the large-population limits of Markovian stochastic epidemic models can be interpreted as survival or cumulative hazard functions when analysing data on individuals sampled from the population. We refer to the individual-level survival and hazard functions derived from population-level equations as a survival dynamical system (SDS). To illustrate how population-level dynamics imply probability laws for individual-level infection and recovery times that can be used for statistical inference, we show numerical examples based on synthetic data as well as the COVID-19 outbreak data.

The second part of the talk will focus on developing the DSA methodology for non-Markovian dynamics. Measure-valued processes play a key role in this endeavour. For the non-Markovian set-up, the DSA-likelihood is shown to depend on solutions to partial differential equations instead of ODEs as before. Preliminary numerical results for parameter inference will be shown. Finally, extension to non-Markovian epidemics on configuration model random graphs will be discussed.


Gaussian process (GP) model is an effective tool for emulating complex computer simulations. Heterogeneous gaussian process (Binois et al, 2017) has been shown to be superior in the presence of input dependent noise as in the case of any stochastic computer simulation. However, all GP models impose a gaussian variability assumption in the emulator. In this talk, we propose a new approach based on heterogeneous GP and a clustering based technique to emulate and hence calibrate a stochastic agent based simulation. The basic idea is to relax the normality assumption by borrowing the standard gaussian mixture model and coupling that with a traditional GP. The study is motivated by with an example taken from the 2015 Ebola challenge workshop which simulated an Ebola epidemic to evaluate methodology.


Generalized linear models are a class of widely used non-linear models in statistics. Classical maximum-likelihood theory based statistical inference is ubiquitous in this context. This theory hinges on well-known fundamental results: (1) the maximum-likelihood-estimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihood-ratio-test (LRT) is asymptotically a Chi-Squared. In this talk, I will consider the specific setting of logistic regression models and show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact, (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a Chi-Square. Consequently, p-values and confidence intervals based on classical theory are completely invalid. I will describe a new theory that precisely characterizes the asymptotic behavior of the MLE and the LRT under certain assumptions on the covariate distribution; this in turn yields valid p-values and confidence intervals in such high-dimensional settings. If time permits, I will discuss general techniques that may enable the study of likelihood based estimators in other non-linear models under similar high-dimensional asymptotics. This is based on joint works with Emmanuel Candes, Yuxin Chen and Qian Zhao.


We consider the task of meta-analysis in high-dimensional settings in which the data sources we wish to integrate are similar, but nonidentical. To borrow strength across such heterogeneous data sources, we introduce a global parameter that remains sparse even in the presence of outlier data sources. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset.