  | 
           
            
               
                
      |  
        THE FIELDS 
        INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES | 
               
               
                
       
        
           
              
               
               
              
               | 
             
               
                April 9 (3:30 p.m.) & April 10 (11:00 a.m.), 
                2015  
                Distinguished Lecture Series in Statistical Science 
                Room 
                230, Fields Institute  
               
                  
                  Terry Speed 
                  Walter & Eliza Hall Institute of Medical Research, Melbourne 
               
             | 
              | 
           
         
                 | 
               
             
              
          
  
   
    Terry Speed is currently a Senior Principal Research Scientist at the Walter 
    and Eliza Hall Institute of Medical Research. His lab has a particular focus 
    on molecular data collected by cancer researchers, but also works with scientists 
    studying immune and infectious diseases, and those who do research in basic 
    biomedical science. His research interests are broad, but include the statistical 
    and bioinformatic analysis of microarray, DNA sequence and mass spectrometry 
    data from genetics, genomics, proteomics, and metabolomics. The lab works 
    with molecular data at several different levels, from the lowest level where 
    the data come directly from the instruments that generate it, up to the tasks 
    of data integration, and of relating molecular to clinical data. Speed has 
    served on the editorial board of many publications including the Journal of 
    Computational Biology, JASA, Bernoulli and the Australian and New Zealand 
    Journal of Statistics, and has been recognized by the Australian Prime Minister's 
    Prize for Science and Eureka Prize for Scientific Leadership among other awards. 
   
  General lecture: Epigenetics: A New Frontier 
   
     
      Abstract: Scientists have now mapped the human genome - the next 
        frontier is understanding human epigenomes; the 'instructions' which tell 
        the DNA whether to make skin cells or blood cells or other body parts. 
        Apart from a few exceptions, the DNA sequence of an organism is the same 
        whatever cell is considered. So why are the blood, nerve, skin and muscle 
        cells so different and what mechanism is employed to create this difference? 
        The answer lies in epigenetics. If we compare the genome sequence to text, 
        the epigenome is the punctuation and shows how the DNA should be read. 
        Advances in DNA sequencing in the last 5-8 years have allowed large amounts 
        of DNA sequence data to be compiled. For every single reference human 
        genome, there will be literally hundreds of reference epigenomes, and 
        their analysis will occupy biologists, bioinformaticians and biostatisticians 
        for some time to come. In this talk I will introduce the topic and the 
        data, and outline some of the challenges. 
     
   
  Specialized lecture: Normalization of omic data after 2007 
    (joint with Johann Gagnon-Bartsch and Laurent Jacob) 
   
   
     
       Abstract:  For over a decade now, normalization of transcriptomic, 
        genomic and more recently metabolomic and proteomic data has been something 
        you do to "raw" data to remove biases, technical artifacts and 
        other systematic non-biological features. These features could be due 
        to sample preparation and storage, reagents, equipment, people and so 
        on. It was a "one-off" fix to what I'm going to call removing 
        unwanted variation. Since around 2007, a more nuanced approach has been 
        available, due to JT Leek and J Storey (SVA) and O Stegle et al (PEER). 
        These new approaches do two things differently. The first is that they 
        do not assume the sources of unwanted variation are known in advance, 
        they are inferred from the data. And secondly, they deal with the unwanted 
        variation in a model-based way, not "up front." That is, they 
        do it in a problem-specific manner, where different inference problems 
        warrant different model-based solutions. For example, the solution for 
        removing unwanted variation in estimation not necessarily being the same 
        as doing for prediction. Over the last few years, I have been working 
        with Johann Gagnon-Bartsch and Laurent Jacob on these same problems through 
        making use of positive and negative controls, a strategy which we think 
        has some advantages. In this talk I'll review the area, and highlight 
        some of the advantages of working with controls. Illustrations will be 
        from microarray, mass spec and RNA-seq data. 
     
   
     
   
  
  
     
      |  
        
        
         | 
       
         
          April 23 
          (3:30 p.m.) & 24 
          (11:00 a.m.), 2015 
          Distinguished Lecture Series in Statistical Science 
          Room 
          230, Fields Institute  
         
            
            Bin Yu 
            University of California, Berkeley  
         
       | 
        | 
     
   
  Bin Yu is Chancellors Professor in the Departments of Statistics and 
    of Electrical Engineering & Computer Science at the University of California 
    at Berkeley. Her current research interests focus on statistics and machine 
    learning theory, methodologies, and algorithms for solving high-dimensional 
    data problems. Her group is engaged in interdisciplinary research with scientists 
    from genomics, neuroscience, and remote sensing. She is Member of the U.S. 
    National Academy of Sciences and Fellow of the American Academy of Arts and 
    Sciences. She was a Guggenheim Fellow in 2006, an Invited Speaker at ICIAM 
    in 2011, and the Tukey Memorial Lecturer of the Bernoulli Society in 2012. 
    She was President of IMS (Institute of Mathematical Statistics) in 2013-2014. 
   
  April 23, 2015 at 3:30: Stability  
  Reproducibility is imperative for any scientific discovery. More 
    often than not, modern scientific findings rely on statistical analysis of 
    highdimensional data. At a minimum, reproducibility manifests itself in stability 
    of statistical results relative to reasonable perturbations to 
    data and to the model used. Jacknife, bootstrap, and cross-validation are 
    based on perturbations to data, while robust statistics methods deal with 
    perturbations to models. 
    In this talk, a case is made for the importance of stability in statistics. 
      Firstly, we motivate the necessity of stability of interpretable encoding 
      models for movie reconstruction from brain fMRI signals. Secondly, we find 
      strong evidence in the literature to demonstrate the central role of stability 
      in statistical inference. Thirdly, a smoothing parameter selector based 
      on estimation stability (ES), ES-CV, is proposed for Lasso, in order to 
      bring stability to bear on cross-validation (CV). ES-CV is then utilized 
      in the encoding models to reduce the number of predictors by 60% with almost 
      no loss (1.3%) of prediction performance across over 2,000 voxels. Last, 
      a novel stability argument is seen to drive new results that 
      shed light on the intriguing interactions between sample to sample variability 
      and heavier tail error distribution (e.g. double-exponential) in high dimensional 
      regression models with p predictors and n independent samples. In particular, 
      when p/n belongs to (0.3, 1) and error is double-exponential, the Least 
      Squares (LS) is a better estimator than the Least Absolute Deviation (LAD) 
      estimator. 
    This talk draws materials from papers with S. Nishimoto, A. T. Vu, T. Naselaris, 
      Y. Benjamini, J. L. Gallant, with C. Lim, and with N. El Karoui, D. Bean, 
      P. Bickel, and C. Lim. 
      
   
  April 24, 2015 at 11: The multi-facets of a data science project 
    to answer: how are organs formed? 
   
    Genome wide data reveal an intricate landscape where gene actions and interactions 
      in diverse spatial areas are common both during development and in normal 
      and abnormal tissues. Understanding local gene networks is thus key to developing 
      treatments for human diseases. Given the size and complexity of recently 
      available systematic spatial data, defining the biologically relevant spatial 
      areas and modeling the corresponding local biological networks present an 
      exciting and on-going challenge. It requires the integration of biology, 
      statistics and computer science; that is, it requires data science. 
    In this talk, I present results from a current project co-led by biologist 
      Erwin Frise from Lawrence Berkeley National Lab (LBNL) to answer the fundamental 
      systems biology question in the talk title. My group (Siqi Wu, Antony Joseph, 
      Karl Kumbier) collaborates with Dr. Erwin and other biologists (Ann Hommands) 
      of Celnikers Lab at LBNL that generate the Drosophila spatial expression 
      embryonic image data. We leverage our groups prior research experience 
      from computational neuroscience to use appropriate ideas of statistical 
      machine learning in order to create a novel image representation decomposing 
      spatial data into building blocks (or principal patterns). These principal 
      patterns provide an innovative and biologically meaningful approach for 
      the interpretation and analysis of large complex spatial data. They are 
      the basis for constructing local gene networks, and we have been able to 
      reproduce almost all the links in the Nobel-prize winning (local) gap-gene 
      network. In fact, Celnikers lab is running knock-out experiments to 
      validate our predictions on genegene interactions. Moreover, to understand 
      the decomposition algorithm of images, we have derived sufficient and almost 
      necessary conditions for local identifiability of the algorithm in the noiseless 
      and complete case. Finally, we are collaborating with Dr. Wei Xu from Tsinghua 
      Univ to devise a scalable open software package to manage the acquisition 
      and computation of imaged data, designed in a manner that will be usable 
      by biologists and expandable by developers. 
    
   
    
   
    
   
   
   
  Distinguished Lecture 
    Series in Statistical Science Index 
    
     
 | 
  |