Online Program


Informing the design of host-pathogen interaction studies when infectious disease clinical research participants consent to use of their clinical specimens and electronic health records
Brian Agan, Infectious Disease Clinical Research Program 
Grace Macalino, Infectious Disease Clinical Research Program 
Gregory Martin, Infectious Disease Clinical Research Program 
Martin Ottolini, Infectious Disease Clinical Research Program 
David Tribble, Infectious Disease Clinical Research Program 
*Kenneth J. Wilkins, Infectious Disease Clinical Research Program 

Keywords: electronic health records, host genetics, infectious diseases, missing at random, causal inference, personal health data, host-pathogen interaction

We present how the design of studies targeting host-pathogen interactions may be enhanced when research participants consent to providing samples of genetic material (host or pathogen), as well as access to components of their electronic health record (EHR). The tendency of pathogens to develop resistance to prescribed treatment continues to be a vexing problem in the clinical management of infectious diseases. While pathogen-specific factors sometimes predict treatment failure, such predictions vary widely among population subgroups. To explain a portion of this variability, host factors are thus examined whenever available, whether they be environmental (e.g., adherence to prescribed regimens) or genetic (e.g., genotypes having a putative association with disease). In hopes of more closely examining how host factors interact with pathogen characteristics, cohort studies often prospectively collect specimens within long-term repositories, retaining them for yet-to-be-designed future studies. When planning the study of a given host-pathogen interaction of interest, however, the question remains: how can designs be better informed by available data to make effective use of repository materials? The approach proposed here applies recent advances in statistical methodology to capitalize on available data. Specifically, extant host genetics data on a sub-cohort allow prediction of time-invariant host characteristics using missing data methods, providing a strategy for more targeted sampling from the broader cohort. Under the (likely tenable) assumption that genetic data are missing at random for a remainder of the cohort, a model-based sampling probability can be estimated using host/pathogen characteristics. Potential confounders such as treatment adherence are handled by applying causal inference techniques to cohort data supplemented with EHR components (e.g., coupling clinic visit notes with pharmacy transaction databases), thus appealing to the assumption of no unmeasured confounders. Infectious Disease Clinical Research Program (IDCRP) cohort members often obtain healthcare in systems with centralized management of EHR data; members often consent to their EHR components being accessed for the purpose of enhanced research. Infectious pathogens are also routinely isolated, whether from wounds in a combat trauma cohort or swabs taken from an acute respiratory infection cohort. The approach is illustrated in IDCRP sub-studies of the U.S. Military HIV Cohort.