Speaker 
    Abstracts
    
   
  Hugh Chipman, Acadia University, 
  Statistical and computational challenges in networks and cybersecurity
   
    Networks and cybersecurity are producing varied, rich, complex and BIG 
      data. Great research opportunities are opening up in the statistical, computational 
      and mathematical sciences. The workshop held at CRM in early May showcased 
      the latest statistical and machine learning research for social networks 
      (eg facebook) and cybersecurity. I will provide an overview of some of the 
      most intersting problems.
     
     
  
  
  
  
  
  
   
  Jean-Francois Plante, HEC Montréal, Challenges, 
  
  Tools and Examples for Big Data Inference
   
    The Opening Conference and Boot Camp of the Thematic Program on Statistical 
      Inference, Learning, and Models for Big Data was held at the Fields Institute 
      in Toronto from January 12th to Janaury 23rd. A total of 35 scientific talks 
      were presented, providing an overview of the main themes of the Program. 
      Even if big data problems from numerous fields were covered, common challenges 
      emerged and some tools were seen in many different contexts. A number of 
      successful applications of big data inference were also presented. In this 
      talk, I will describe those challenges and tools who stood out frequently 
      and will summarize some examples of application that were presented during 
      Opening Conference and Boot Camp. This work is based on a manuscript under 
      preparation by the postdoctoral fellows and long-term visitors of the Fields 
      institute that participated in the Big Data Program.
     
  
  
  
  
  
  
   
  Lisa Lix, University of Manitoba, 
  How Big Data and Causal Inference Work Together in Health Policy
   
    Population-based administrative, clinical, and survey databases have long 
      been used to conduct policy-relevant research about population health and 
      health service use. However, there is an increasing emphasis in recent years 
      on person-specific linkages of multiple, complex databases to address novel 
      questions in such areas as drug and medical product safety, chronic disease 
      risk prediction, and comparative effectiveness of medical treatments. Causal 
      inference techniques are routinely applied to observational health databases 
      because of issues of cost, ethics, and selection bias in randomized trials. 
      The March workshop on Big Data in Health Policy explored causal inference 
      methods and applications in the presentation, design, and analysis of health 
      policy research. Related topics pertaining to enabling/disabling factors 
      in the use of Big Data in health policy contexts, methods to combine or 
      synthesize databases or results, and the interdisciplinary nature of the 
      health policy research environment, were explored. This presentation will 
      provide an overview of the sessions, key learnings from participants, and 
      future directions for collaborative research and training.
     
  
  
  
  
  
   
   Stephanie Shipp, Virginia Tech, 
  Policy meets Social and Decision Informatics
   The exponential growth of digital data has created an all data 
    revolution that is allowing us to view the world at a scale and level of granularity 
    that is unprecedented. A similar revolution occurred in the 1930s with the 
    emergence of regularly conducted surveys and probability sampling primarily 
    by federal statistical agencies. Though the timeframe for these survey data 
    are constrained to a monthly, quarterly, or annual basis, the surveys became 
    the primary source of data for large-scale social science research, and over 
    the 80 years we developed principles for managing, analyzing, interpreting, 
    and applying these data that have come to feel intuitive. Digital data are 
    now providing a highly detailed window into our lives on a daily and even 
    minute-by minute basis. The implications for policy are both exciting and 
    challenging. We are offered opportunities to inform social policy development 
    through new insights into (1) how individuals and organizations make choices 
    using, for example, a combination of credit card transactions, GPS tracking, 
    and demographic data and (2) how the opinions, preferences, and interests 
    of individuals interact in collective decisions using social media data. We 
    are challenged to manage transparency, quality, and representativeness of 
    the data. As the all data revolution provides new sources of data to inform 
    policy, it simultaneously requires policy changes. These policy changes will 
    need strategic statistical thinking and innovation to develop pragmatic solutions 
    to use these data for social policy. We lack the 80 years of principles to 
    guide in the reasonable (i.e., scientific), objective, and sensitive management 
    and application of these data to social policy development. The opportunities 
    and challenges for developing these principles are outlined.
  
  
  
  
  
   
  
   
  Stan Matwin, Dalhousie University, 
  Big Data meets Big Water: Mining Ocean Vessel Trajectory Data
   In this presentation we will focus on the ongoing work in exploration 
    and analysis of data from ocean vessel movements, using the Automatic Identification 
    System (AIS) data. We will discuss some of the challenges and benefits related 
    to the large-scale exploration and analysis of AIS data. We will look at detection 
    of anomalous trajectories of ships in mid-ocean and in port vicinity, and 
    at the ecologically-oriented detection and analysis of data related to fishing 
    activities. We will discuss our early results in these select applications, 
    including data representation and data modeling techniques, particularly the 
    clustering techniques, classification, and attribute engineering used in our 
    work. We will round up with discussion of potential future work with AIS data. 
  
  
  
  
  
  
    
  Evangelos Milios, Dalhousie University, 
  Exploiting Semantic Analysis of Documents for the Domain User
   
    Many document organization tasks, such as a student writing the related 
      work chapter of a thesis, a professor surveying the state of the art in 
      a proposal or planning a reading course, or a conference chair organizing 
      sessions would be performed more efficiently through the use of document 
      clustering. In this work, we present (a) interactive document clustering 
      algorithms that allow the user to steer clustering to her point of view, 
      including an ensemble algorithm based on Wikipedia concepts; (b) named entity 
      recognition and disambiguation using the multilingual Wikipedia category 
      structure; (c) a simple but effective computation of semantic relatedness 
      between words and documents based on the Google n-gram corpus, which is 
      competitive to human performance on standard word pair data sets. 
    This is joint work with H. Nourashraf, D. Arnold, M. Lipczak, A. Koushkestani, 
      A. Islam and V. Keselj.
     
  
  
   
   Andrew Rau-Chaplin, Dalhousie University, 
  
  Scaling up to Big Data: Algorithmic Engineering + HPC
   
    Big data analytics projects apply machine learning techniques to the analysis 
      of large data sets to help uncover relationships and predict outcomes and 
      behaviors. From a research perspective, these projects typically start by 
      using small data sets and focus on identifying those machine learning techniques 
      that are best suited to the problem. Once a promising approach has been 
      identified the next key challenges are performance and scalability  
      can the method be made to work on truly big data sets in a timely manner?
    This talk focusses on the application of algorithmic engineering and high 
      performance computing (HPC) techniques to big data analytics. It draws on 
      practical experience in a range of projects from text analytics to catastrophic 
      risk analysis and tries to highlight algorithmic engineering and HPC approaches 
      that are both widely applicable and often lead to fast scalable applications.
  
  
  
  
  
  
   
  Rosane Minghim, Dalhousie University and University 
    of São Paulo
  Multidimensional Projections and Tree-based Techniques for Visualization 
    and Mining
   
    A Multidimensional Projection is a type of technique ultimately aimed at 
      mapping data onto a visual space, usually bi- or tree dimensional. Many 
      algorithms for that task have been developed in recent years, aimed at user 
      control as well as precision and scalability improvement. Tree based techniques 
      are also largely used in the visualization of abstract data with or without 
      hierarchical content. The types of data that can be mapped using these strategies 
      vary widely, and are usually represented either by a set of attributes of 
      by a similarity matrix. In this talk, we approach these two types of algorithms, 
      illustrate their applications for interpretation of complex data, and discuss 
      their capabilities and drawbacks. Additionally, we show how these types 
      of visual approaches to data analysis can be used to support tasks in data 
      mining, such as clustering and classification. We illustrate most of the 
      concepts applying them to the visual analysis of text and image collections. 
    
      
  
  
  
  
  
    Rob Beiko, Dalhousie University
  Microbial genomics for rapid investigation of infectious disease
   
    In Canada, several agencies carry out surveillance activities to monitor 
      for new infectious disease outbreaks, and coordinate responses to control 
      and eliminate them. These activities are time critical, and delays in infectious 
      agent identification and outbreak mapping can have serious public health 
      consequences. Sequencing the DNA of pathogens will accelerate this response, 
      both by providing rapid and complete information about which specific strain 
      is responsible for a clinical case, and by providing a fine-scale view of 
      the origin and spread of an outbreak. The Integrated Rapid Infectious Disease 
      Analysis (IRIDA) project aims to automate genome sequencing, processing, 
      and pattern inference during a potential outbreak. Realizing the potential 
      of these new approaches requires advances on several fronts, and in my presentation 
      I will focus on the bioinformatic challenges of analyzing thousands of genomes 
      to generate the relevant outbreak data as quickly, reliably, and securely 
      as possible.
     
  
  
  
  
  
    Roger Grosse, University of Toronto
  Highlights from the deep learning workshop
   
    I will give an overview of some highlights from the Deep Learning Workshop 
      in the Big Data Thematic Program. Deep learning has seen much success recently 
      at automatically finding hierarchical representations of complex, high-dimensional 
      datasets, and has revolutionized application areas from computer vision 
      to speech recognition. Some topics from the workshop include scalable optimization 
      methods for deep learning, interpretable representations, learning fair 
      representations, and applications to reinforcement learning. I will finish 
      by discussing some recent advances in evaluating restricted Boltzmann machines 
      and other Markov random fields as generative models.
     
  
  
  
   
  
  
  
  
  
  Einat Gil, University of Toronto
  Learning about Big Data among Secondary School Students in a technology-supported 
    collaborative learning environment
   
    Alongside the thematic program at the Fields Institute, a short unit onlearning 
      about Big Data was designed and implemented in a Toronto secondary school. 
      This three-week interdisciplinary informal statistics unit was developed 
      to allow students in a 12th grade Mathematics for Data Management course 
      to explore both small and Big Data using inquiry and collaborative approaches. 
      In one of the activities, the learning trajectory was guided through an 
      Interactive Orchestrated Learning Space (IOLS; Gil & Slotta, 2015), 
      inspired by recent smart classroom and knowledge community approaches (Slotta, 
      2010; Slotta, Tissenbaum & Lui, 2013). The design and pedagogical approach 
      allowing for the introduction of ideas related to the use of Big Data in 
      secondary school will be discussed and initial findings about student learning 
      from this mixed methods study will be presented.
  
  
  
  
  
   
   
   
   Back 
    to top 
   
   
  Return to main page