Workshop on Missing Data Problems
          August 5-6, 2004
        
        Speaker Abstracts  
        
        
 Jinbo 
        Chen (NIH)
        
        Semiparametric efficiency and optimal estimation for missing data problems, 
        with application to auxiliary outcomes
        This expository talk emphasizes the link between semiparametric efficient 
        estimation and optimal estimating functions in the sense of Godambe and 
        Heyde. We consider models where the linear span of influence functions 
        for regular, asymptotically linear estimators of a Euclidean parameter 
        may be identified as a space of unbiased estimating functions indexed 
        by another class of functions. Determination of the optimal estimating 
        function within this (largest possible) class, which is facilitated by 
        a theorem of Newey and McFadden, then identifies the efficient influence 
        function. The approach seems particularly useful for missing data problems 
        due to a key result by Robins and colleagues: all influence functions 
        for the missing data problem may be constructed from influence functions 
        for the corresponding full data problem. It yields well known results 
        for the conditional mean model in situations where the covariates, the 
        outcome or both are missing at random. When only the outcome is missing, 
        but a surrogate or auxiliary outcome is always observed, the efficient 
        influence function takes a simple closed form. If the covariates and auxiliary 
        outcomes are all discrete, moreover, a Horwitz-Thompson estimator with 
        empirically estimated weights is semiparametric efficient. 
        
Back to top 
        Shelley B. Bull
          Uniiversity of Toronto, Dept of Public Health Sciences and Samuel Lunenfeld 
          Research Institute
          Missing Data in Family-Based Genetic Association Studies
          A standard design for family-based association studies involves genotyping 
          of an affected child and their parents, and is often referred to as 
          the case-parent design. More generally, multiple affected children and 
          their unaffected siblings (ie. the whole nuclear family) may also be 
          genotyped. Genotyping provides data concerning what allele at a specific 
          genetic locus is transmitted from each parent to a child, and excess 
          transmission of an allele to affected children provides evidence for 
          genetic association. Immunity to the effects of population stratification 
          is generally achieved by conditioning on parental genotypes in the analysis. 
          Knowledge of transmission may be incomplete however, when one or both 
          parents are unavailable for genotyping, or when parents are genotyped 
          but the genetic marker is less than fully informative, so that transmission 
          is ambiguous. A number of approaches to address this form of missing 
          data have been proposed, ranging from exclusion of families with incomplete 
          data to reconstruction of parental genotypes with use of genotypes from 
          unaffected children to maximum-likelihood-based missing data methods 
          (such as E-M). A conceptually different approach to handling missing 
          data in this setting relies on conditioning on a sufficient statistic 
          for the null hypothesis. Formally, the phenotypes (affected/unaffected) 
          of all family members and the genotypes of the parents constitute a 
          sufficient statistic for the null hypothesis of no excess transmission 
          when parental genotypes are observed. When parental genotypes are missing, 
          a sufficient statistic can still be found, such that under the null, 
          the conditional distribution does not depend on the unknown mode of 
          inheritance or the allele distribution in the population. In this talk 
          we will compare and contrast this alternative approach to some existing 
          methods with respect to test efficiency and robustness to population 
          stratification.
        
        Nilanjan Chatterjee
          National Cancer Institute
          Missing Data Problems in Statistical Genetics
          Missing data problems are ubiquitous in statistical genetics. In 
          this talk, I will review the missing data problems posed in a range 
          of topics including segregation analysis, kin-cohort analysis, haplotype-based 
          association studies and use of genomic controls to account for population 
          stratification. In each area, I will briefly review the scientific problem, 
          the data structure, required assumptions and the computational tools 
          that are currently being used. If time permits, I will further describe 
          some work in progress in the area of gene-environment interaction where 
          it may be desirable to use genomic controls to adjust for population 
          stratification.
          Back to top 
        Cook, Richard and Yi, Grace
          University of Waterloo
        Weighted Generalized Estimating Equations for Incomplete Clustered 
          Longitudinal Data 
        Estimating equations are widely used for the analysis of longitudinal 
          data when interest lies in estimating regression or association parameters 
          based on marginal models. Inverse probability weighted estimating equations 
          (e.g. Robins et al., 1995) have been developed to deal with biases that 
          may result from incomplete data which are missing at random (MAR). We 
          consider the problem in which the longitudinal responses arise in clusters, 
          generating a cross-sectional and longitudinal correlation structure. 
          Structures of this type are quite common and arise, for example, in 
          any cluster randomized study with repeated measurements of the response 
          over time. When data are incomplete in such settings, however, the inverse 
          probability weights must be obtained from a model which allows estimation 
          of the pairwise joint probability of missing data for individuals from 
          the same cluster, conditional on their respective histories. We describe 
          such an approach and consider the importance 
          of modeling such a within cluster association in the missing data process. 
          The methods are applied to data from a motivating cluster-randomized 
          school-based study called the Waterloo Smoking Prevention Project.
        
          Back to top 
        Joe DiCesare
          University of Waterloo
          Estimating Diffusions with Missing Data
          In this talk the challenges associated with imputation methods for 
          general diffusion processes will be discussed. A method for imputing 
          the values of a square root diffusion process is then presented along 
          with some applications to financial data.
        Grigoris Karakoulas
          University of Toronto, Department of Computer Science
          Mixture-of-Experts Classification under Different Missing Label Mechanisms
          There has been increased interest in devising classification techniques 
          that combine unlabeled data with labeled data for various domains. There 
          are different mechanisms that could explain why labels might be missing. 
          It is possible for the labeling process to be associated with a selection 
          bias such that the distributions of data points in the labeled and unlabeled 
          sets are different. Not correcting for such bias results in biased function 
          approximation with potentially poor performance. In this paper we introduce 
          a mixture-of-experts technique that is a generalization of mixture modeling 
          techniques previously suggested for learning from labeled and unlabeled 
          data. We emprically show how this technique performs under the different 
          missing label mechanisms and compare it with existing techniques. We 
          use the bias-variance decomposition to study the effects from adding 
          unlabeled data when learning a mixture model. Our empirical results 
          indicate that the biggest gain from using unlabeled data comes from 
          the reduction of the model variance, whereas the behavior of the bias 
          error term heavily depends on the correctness of the underlying model 
          assumptions and the missing label mechanism.
          Back to top 
        Jerry Lawless
          Department of Statistics and Act. Sci., University of Waterloo
          Some Problems Concerning Missing Data in Survival and Event History 
          Abalysis
          There has been considerable recent development of estimation methodology 
          for incomplete survival and event history data. This talk will discuss 
          some areas which deserve attention. They include (i) the assessment 
          and treatment of non-independent losses to followup in longitudinal 
          surveys and other studies with widely spaced inspection times, (ii) 
          the treatment of censoring, loss to followup, and delayed ascertainment 
          in observational cohort studies based on clinic data bases, and (iii) 
          simulation-based model assessment, requiring simulation of the observation 
          process.
        
        Alan Lee
          Department of Statistics, University of Auckland 
          Asymptotic Efficiency Bounds in Semi-Parametric Regression Models
          We outline an extension of the Bickel, Klaassen, Ritov and Wellner theory 
          of semi-parametric efficiency bounds to the multi-sample case. The theory 
          is then applied to derive efficient scores and information bounds for 
          several standard choice-based sampling situations, including case-control 
          and two-phase outcome-dependent sampling designs.
          Back to top 
        Roderick Little
          University of Michigan 
          Robust likelihood-based analysis of multivariate data with missing 
          values
          The model-based approach to inference from multivariate data with missing 
          values is reviewed. Regression prediction is most useful when the covariates 
          are predictive of the missing values and the probability of being missing, 
          and in these circumstances predictions are particularly sensitive to 
          model misspecification. The use of penalized splines of the propensity 
          score is proposed to yield robust model-based inference under the missing 
          at random (MAR) assumption, assuming monotone missing data. Simulation 
          comparisons with other methods suggest that the method works well in 
          a wide range of populations, with little loss of efficiency relative 
          to parametric models when the latter are correct. Extensions to more 
          general patterns are outlined. KEYWORDS: double robustness, incomplete 
          data, penalized splines, regression imputation, weighting.
        McLeish, Don and Struthers, Cyntha
          Regression with Missing Covariates: Importance Sampling and Imputation
          
          In regression, it is common for one or more covariates to be unobserved 
          for some of the experimental subjects, either by design or by some random 
          censoring mechanism. Specifically, suppose Y is a response variable, 
          possibly multivariate, with a density function f(y|x,v; ) conditional 
          on the covariates (x,v) where x and v are vectors and is a vector of 
          unknown parameters. We consider the problem of estimating the parameters 
          when data on the covariate vector v are available for all observations 
          while data on the covariate x are missing for some of the observations. 
          We assume MAR, i.e. ? =1or 0 as the covariate x is observed or not and 
          $E (? i|Y,X,V)=p(Y,V) where p is a known function depending only on 
          the observable quantities (Y,V). Variations on this problem have been 
          considered by a number of authors, including Chatterjee et al. (2003), 
          Lawless et al. (1999), Reilly and Pepe (1995), Robins et al. (1994, 
          1995), Carrol and Wand (1991), and Pepe and Fleming (1991). We motivate 
          many of these estimators from the point of view of importance sampling 
          and compare estimators and algorithms for bias and efficiency with the 
          profile estimator when the observations and covariates are discrete 
          or continuous. 
        
        Back to top 
        Bin Nan
          University of Michigan
          A new look at some efficiency results for semiparametric models with 
          missing data
          Missing data problems arise very often in practice. Many ad hoc useful 
          tools have been developed in estimating finite dimensional parameters 
          from semiparametric regression models with data missing at random. In 
          the mean while, efficient estimation has been paid more and more attention, 
          especially after the landmark paper of Robins, Rotnitzky, and Zhao (1994). 
          We review several examples on information bound calculations. Our main 
          purpose is to show how the general result derived by Robins, Rotnitzky, 
          and Zhao (1994) can apply to different models.
          Back to top 
        Anastesia Nwankwo
          Enugu State University 
          Missing multivariate data in banking computations
          In processing data emanating from multiple files in financial markets, 
          ranking methods are called into play if set probability indices are 
          to be maintained. Horizontal computations yield many evidences of missing 
          entries from nonresponse and other factors
        
        James L. Reilly
          Department of Statistics, University of Auckland 
          Multiple Imputation and Complex Survey Data
          Multiple imputation is a powerful and widely used method for handling 
          missing data. Following imputation, analysis results for the imputed 
          datasets can easily be combined to estimate sampling variances that 
          include the effect of imputation. However, situations have been identified 
          where the usual combining rules can overestimate these variances. More 
          recently, variance underestimates have also been shown to occur. A new 
          multiple imputation method based on estimating equations has been developed 
          to address these concerns, although this method requires more information 
          about the imputation model than just the analysis results from each 
          imputed dataset. Furthermore, the new method only handles i.i.d. data, 
          which means it would not be appropriate for many surveys. In this talk, 
          this method is extended to accommodate complex sample designs, and is 
          applied to two complex surveys with substantial amounts of missing data. 
          Results will be compared with those from the traditional multiple imputation 
          variance estimator, and the implications for survey practice will be 
          discussed.
          Back to top 
        James Robins, Professor of Epidemiology 
          and Biostatistics
          Harvard School of Public Health 
          (this talk is based on joint work with Aad van der Vaart)
        
          Application of a Unified Theory of Parametric, Semi, and Nonparametric 
          Statistics Based On Higher Dimensional Influence Functions to Coarsened 
          at Random Missing Data Models 
          
          The standard theory of semi-parametric inference provides conditions 
          under which a finite dimensional parameter of interest can be estimated 
          at root-n rates in models with finite or infinite dimensional nuisance 
          parameters. The theory is based on likelihoods, first order scores, 
          and first order influence functions and is very geometric in character 
          often allowing results to be obtained without detailed probabilistic 
          epsilon and delta calculations. 
          The modern theory of non-parametric inference determines optimal rates 
          of convergence and optimal estimators for parameters (whether finite 
          or infinite dimensional) that cannot be estimated at rate root-n or 
          better. This theory is based largely based on merging mini-max theory 
          with measures of the size of the parameter space e.g.. its metric entropy 
          and makes little reference to the likelihood function for the data. 
          It often makes great demands on the mathematical and probabilistic skills 
          of its practioners.
        In this talk I extend earlier work by Small and McLeish (1994) and 
          Waterman and Lindsay (1996) and present a theory based on likelihoods, 
          higher order scores (i.e., derivatives of the likelihood), and higher 
          order influence functions that applies equally to both the root-n and 
          non-root n regimes, reproduces the results previously obtained by the 
          modern theory of non-parametric inference, produces many new non-root- 
          n results, and most importantly is very geometric, opening up the ability 
          to perform optimal non-root n inference in complex high dimensional 
          models without detailed probabilistic calculation. 
          The theory is applied to estimation of functionals of the full data 
          distribution in coarsened at random missing data models. 
        Andrea Rotnitsky
          Doubly-robust estimation of the area under the operating characteristic 
          curve in the presence of non-ignorable verification bias. 
          The area under the receiver operating characteristic curve (AUC) is 
          a popular summary measure of the efficacy of a medical diagnostic test 
          to discriminate between healthy and diseased subjects. A frequently 
          encountered problem in studies that evaluate a new diagnostic test is 
          that not all patients undergo disease verification because the verification 
          test is expensive, invasive or both. Furthermore, the decision to send 
          patients to verification often depends on the new test and on other 
          predictors of true disease status. In such case, usual estimators of 
          the AUC based on verified patients only are biased. In this talk we 
          develop estimators of the AUC of markers measured on any scale that 
          adjust for selection to verification that may depend on measured patient 
          covariates and diagnostic test results and additionally adjust for an 
          assumed degree of residual selection bias. Such estimators can then 
          be used in a sensitivity analysis to examine how the AUC estimates change 
          when different plausible degrees of residual association are assumed. 
          As with other missing data problems, due to the curse of dimensionality, 
          a model for disease or a model for selection is needed in order to obtain 
          well behaved estimators of the AUC when the marker and/or the measured 
          covariates are continuous. We describe estimators that are consistent 
          and asymptotically normal (CAN) for the AUC under each model. More interestingly, 
          we describe a doubly robust estimator that has the attractive feature 
          of being CAN if either the disease or the selection model (but not necessarily 
          both) are correct. We illustrate our methods with data from a study 
          run by the Nuclear Imaging Group at Cedars Sinai Medical Center on the 
          efficacy of electron beam computed tomography to detect coronary artery 
          disease. 
        
        Donald B. Rubin
          John L. Loeb Professor of Statistics, Department of Statistics
        
        Multiple Imputation for Item Nonresponse: Some Current Theory and 
          
          Application to Anthrax Vaccine Experiments at CDC
        Multiple imputation has become, since its proposal a quarter of a century 
          ago (Rubin 1978), a standard tool for dealing with item nonresponse. 
          There is now widely available free and commercial software for both 
          the analysis of multiply-imputed data sets and for their construction. 
          The methods for their analysis are very straightforward and many evaluations 
          of their frequentist properties, both with artificial and real data, 
          have
          supported the broad validity of multiple imputation in practice, at 
          least relative to competing methods. The methods for the construction 
          of a multiply-imputed data set, however, either (1) assume theoretically 
          clean
          situations, such as monotone patterns of missing data or a convenient 
          multivariate distribution, such as the general location model or t-based 
          extensions of it; or (2) use theoretically less well justified, fully 
          conditional "chained equations," which can lead to "incompatible" 
          distributions in theory, which often seem to be harmless in practice. 
          Thus, there remains the challenge of constructing multiply-imputed data 
          sets in situations where the missing data pattern is not monotone or 
          the distribution of the complete data is complex in the sense of being 
          poorly approximated by standard analytic multivariate distributions. 
          A current example that illustrates current work on this issue involves 
          the multiple imputation of missing immunogenicity and reactogenicity 
          measurements in ongoing randomized trials at the US CDC, which compare 
          different versions of vaccinations for protection against lethal doses 
          of inhalation anthrax. The method used to create the imputations involves 
          capitalizing on approximately monotone patterns of missingness to help 
          implement the chained equation approach, thereby attempting to minimize 
          incompatibility; this method extends the approach in Rubin (2003) used 
          to multiply impute
          the US National Medical Expenditure Survey.
         
        
        Daniel Scharfstein
          Johns Hopkins Bloomberg School of Public Health
          Sensitivity Analysis for Informatively Interval-Censored Discrete 
          Time-to-Event Data 
        
        Coauthors: Michelle Shardell, Noya Galai, David Vlahov, Samuel A. Bozzette
        In many prospective studies, subjects are evaluated for the occurrence 
          of an absorbing event of interest (e.g., HIV infection) at baseline 
          and at a common set of pre-specified visit times after enrollment. Since 
          subjects often miss scheduled visits, the underlying visit of first 
          detection may be interval censored, or more generally, coarsened. Interval-censored 
          data are usually analyzed using the non-identifiable coarsening at random 
          (CAR) assumption. In some settings, the visit compliance and underlying 
          event time processes may be associated, in which case CAR is violated. 
          To examine the sensitivity of inference, we posit a class of models 
          that express deviations from CAR. These models are indexed by nonidentifiable, 
          interpretable parameters, which describe the relationship between visit 
          compliance and event times. Plausible ranges for these parameters require 
          eliciting information from scientific experts. For each model, we use 
          the EM algorithm to estimate marginal distributions and proportional 
          hazards model regression parameters. The performance of our method is 
          assessed via a simulation study. We also present analyses of two studies: 
          AIDS Clinical Trial Group (ACTG) 181, a natural history study of cytomegalovirus 
          shedding among advanced AIDS patients, and AIDS Link to the Intravenous 
          Experience (ALIVE), an observational study of HIV infection among intravenous 
          drug users. A sensitivity analysis of study results is performed using 
          information elicited from substantive experts who worked on ACTG 181 
          and ALIVE.
        
        Alastair Scott
          University of Auckland 
          Fitting family-specific models to retrospective family data
          Case-control studies augmented by the values of responses and covariates 
          from family members allow investigators to study the association of 
          the response with genetics and environment by relating differences in 
          the response directly to within-family differences in the covariates. 
          Most existing approaches to case-control family data parametrize covariate 
          effects in terms of the marginal probability of response, the same effects 
          that one estimates from standard case-control studies. This paper focuses 
          on the estimation of family-specific effects. We note that the profile 
          likelihood approach of Neuhaus, Scott & Wild (2001) can be applied 
          in any setting where one has a fully specified model for the vector 
          of responses in a family and, in particular, to family-specific models 
          such as binary mixed-effects models. We illustrate our approach using 
          data from a case-control family study of brain cancer and consider the 
          use of conditional and weighted likelihood methods as alternatives. 
        
        Tulay Koru-Sengul
          Department of Statistics at the University of Pittsburgh
          The Time-Varying Autoregressive Model With Covariates For Analyzing 
          Longitudinal Data With Missing Values
          Researchers are frequently faced with the problem of analyzing data 
          with missing values. Missing values are practically unavoidable in large 
          longitudinal studies and incomplete data sets make the statistical analyses 
          very difficult. 
          A new composite method for handling missing values on both the outcome 
          and the covariates has been developed by combining the methods known 
          as the multiple imputation and the stochastic regression imputation. 
          Composite imputation method also uses a new modeling approach for longitudinal 
          data called as the time-varying autoregressive model with time-dependent 
          and/or time-independent covariates. The new model can be thought of 
          as a version of the transition general linear model used to analyze 
          longitudinal data. Simulation results will be discussed to compare the 
          traditional methods to the new composite method of handling missing 
          values on both the outcome and the covariates. 
          Application of the model and the composite method will be studied by 
          using a dataset from a longitudinal epidemiological study of the Maternal 
          Health Practices and Child Development Project that has been conducted 
          at the Magee-Women Hospital in Pittsburgh and the Western Psychiatric 
          Institute and Clinic at the University of Pittsburgh Medical Center 
          Health System. 
          Back to top 
        Jamie Stafford
          Department of Public Health Sciences, University of Toronto 
          ICE: Iterated Conditional Expectations
          The use of local likelihood methods (Loader 1999) in the presence of 
          data that is either interval censored, or has been aggregated into bins, 
          leads naturally to the consideration of EM-type strategies. We focus 
          primarily on a class of local likelihood density estimates where one 
          member of this class retains the simplicity and interpretive appeal 
          of the usual kernel density estimate for completely observed data. It 
          is computed using a fixed point algorithm that generalizes the self-consistency 
          algorithms of Efron (1967), Turnbull (1976), and Li et al. (1997) by 
          introducing kernel smoothing at each iteration. 
          Numerical integration permits a local EM algorithm to be implemented 
          as a global Newton iteration where the latter's excellent convergence 
          properties can be exploited. The method requires an explicit solution 
          of the local likelihood equations at the M-step and this can always 
          be found through the use of symbolic Newton-Raphson (Andrews and Stafford 
          2000). Iteration is thus rendered on the E-step only where a conditional 
          expectation operator is applied, hence ICE. Other local likelihood classes 
          considered include those for intensity estimation, local regression 
          in the context of a generalized linear model, and so on.
          Back to top 
        Mary Thompson
          University of Waterloo 
          Interval censoring of event times in the National Population Health 
          Survey
          The longitudinal nature of the National Population Health Survey allows 
          the use of event history analysis techniques in order to study relationships 
          among events. For example, let T1 and T2 be the times of becoming pregnant 
          and smoking cessation respectively. Thompson and Pantoja Galicia (2002) 
          propose a formal nonparametric test for a partial order relationship. 
          This test involves the estimation of the survivor functions of T1 and 
          T2, as well as the joint distribution of (T1, T2-T1). However, with 
          longitudinal survey data, the times of occurrence of the events are 
          interval censored in general. For example, starting at the second cycle 
          of the NPHS, it is possible to know within an interval of length at 
          most a year whether a smoker has ceased smoking with respect to the 
          previous cycle. Also information about the date of becoming pregnant 
          can be inferred within a time interval from cycle to cycle. Therefore, 
          estimating the joint densities from the interval censored times becomes 
          an important issue and our current problem. We propose a mode of attack 
          based on extending the ideas of Duchesne and Stafford (2001) and Braun, 
          Duchesne and Stafford (2003) to the bivariate case. Our method involves 
          uniform sampling of the censored areas, using a technique described 
          by Tang (1993).
        
        Chris Wild
          Dept of Statistics, University of Auckland
          Some issues of efficiency and robustness
          We investigate some issues of efficiency, robustness and study design 
          affecting semiparametric maximum likelihood and survey-weighted analyses 
          for linear regression under two-phase sampling and bivariate binary 
          regressions, especially those occurring in secondary analyses of case-control 
          data. 
        Back to top
        
        Grace Yi
          Dept. of Stat. and Act. Sci., University of Waterloo
          Median Regression Models for Longitudinal Data with Missing Observations
          Recently median regression models have received increasing attention. 
          The models are attractive because they are robust and easy to interpret. 
          In this talk I will discuss using median regression models to deal with 
          longitudinal data with missing observations. The inverse probability 
          weighted generalized estimating equations (GEE) approach is proposed 
          to estimate the median parameters for incomplete longitudinal data, 
          where the inverse probability weights are estimated from the regression 
          model for the missing data indicators. The consistency and asymptotic 
          distribution of the resultant estimator are established. Numerical studies 
          for the proposed method will be discussed in this talk.
        Back to top
        
         Yang Zhao
          Maximum likelihood Methods for Regression Problems with Missing Data
        Parametric regression models are widely used for the analysis of a 
          response given a vector of covariates. However in many settings certain 
          variable values may go unobserved either by design or happenstance. 
          For
          the case that some covariates are missing at random (MAR) we discuss 
          maximum likelihood methods for estimating the regression parameters 
          using an EM algorithm. Profile likelihood methods for estimating variances 
          and confidence intervals are also given. The case when response is MAR 
          can be treated similarly.
        
        Back to top