Spatiotemporal Visualization and Modeling of Nosocomial Infections

The Centers for Disease Control and Prevention (CDC) estimates that more than two million people contract antibiotic-resistant infections every year, and at least 23,000 die as a result of these infections in the U.S. alone. Traditionally, tracking hospital outbreaks with drug-resistant pathogens focuses on transmission chains of infected or colonized patients as the reservoir for organisms to be transferred to new patients via healthcare workers, but it has become increasingly recognized that non-patient reservoirs within the hospital may play a larger role than previously realized in acting as a niche for the transmission of drug-resistant pathogens. Non-patient sources for pathogen acquisition may require incorporating environmental culture data into existing transmission models. However, the number of risk factors, potential interactions and inherent complexity of the data continue to increase, and thus, exploratory analysis is required to aid in knowledge discovery. Interactive visualization of these data over space and time enables exploration and hypothesis generation to better inform transmission models. This thesis presents an interactive visualization system for the analysis of spatiotemporal environmental and patient data to aid in understanding nosocomial infection. Interactive dashboards allow users to view patient movement through hospital environments while overlaying multivariate environmental microbiological data as it evolves over time. Furthermore, a multivariate logistic regression model is constructed to understand the factors associated with sink contamination. The results show that temporal factors, including the presence of infected patients in the past 14 days and use of interventions in the past 7 days, and spatial factors, including the presence of infected patients in adjacent rooms and the presence of contaminated sinks in adjacent rooms, are significant factors in sink contamination.


Abstract
The Centers for Disease Control and Prevention (CDC) estimates that more than two million people contract antibiotic-resistant infections every year, and at least 23,000 die as a result of these infections in the U.S. alone. Traditionally, tracking hospital outbreaks with drug-resistant pathogens focuses on transmission chains of infected or colonized patients as the reservoir for organisms to be transferred to new patients via healthcare workers, but it has become increasingly recognized that non-patient reservoirs within the hospital may play a larger role than previously realized in acting as a niche for the transmission of drug-resistant pathogens. Non-patient sources for pathogen acquisition may require incorporating environmental culture data into existing transmission models. However, the number of risk factors, potential interactions and inherent complexity of the data continue to increase, and thus, exploratory analysis is required to aid in knowledge discovery. Interactive visualization of these data over space and time enables exploration and hypothesis generation to better inform transmission models. This thesis presents an interactive visualization system for the analysis of spatiotemporal environmental and patient data to aid in understanding nosocomial infection. Interactive dashboards allow users to view patient movement through hospital environments while overlaying multivariate environmental microbiological data as it evolves over time. Furthermore, a multivariate logistic regression model is constructed to understand the factors associated with sink contamination. The results show that temporal factors, including the presence of infected patients in the past 14 days and use of interventions in the past 7 days, and spatial factors, including the presence of infected patients in adjacent rooms and the presence of contaminated sinks in adjacent rooms, are significant factors in sink contamination.
ii Positive Patient in last 7 days P lag14

List of Abbreviations
Positive Patient in last 14 days P lag21 Positive Patient in last 21 days P lag28 Positive Patient in last 28 days P 1 Patient Status in adjacent rooms that share common plumbing P 2 Patient Status in adjacent rooms that do not share common plumbing S 1 Sink Status in adjacent rooms that share common plumbing S 2 Sink Status in adjacent rooms that do not share common plumbing I 3 Intervention in the last 3 days in Bakersfield, California [2].
CRE transmission from patient-to-patient through the hands of healthcare personnel is the main transmission route in healthcare settings. However, the application of the traditional patient-to-patient route of nosocomial infection transmission does not fully account for the observed cases [3]. Environmental reservoirs, particularly sinks, may play a major role in transmission [4][5][6]. It is therefore important to understand the risk factors associated with the spread of this infection and to develop an approach to prevent it.

Environmental Risk Factors
A study from Spain [4] described an outbreak due to multidrug-resistant Klebsiella oxytoca in an ICU where damp environmental reservoirs were linked to bacterial transmission. Samples collected from sinks drainpipes and traps showed that only one storage sink, which had its drainpipes connected to two other sinks, was found to be positive. The connecting drainpipes were also found to be positive. Furthermore, this study showed that the outbreak was completely eradicated after replacing the horizontal drainage system that connected the two impacted sinks. In conclusion, this study stated that wet environmental reservoirs should be considered when strictly applied traditional control measures are not efficacious.
A study from France [6] found that sinks were frequently contaminated in ICUs as a result of their use in disposing of patient bodily fluids and were a potential source of extended-spectrum beta lactamase-producing Enterobacteriaceae (ESBLE), thus increasing risk in the environment of patients as a consequence of the splash-back effect. Recent research from Wolf et al. [5] demonstrated that sinks acted as a source of infection by verifying that the ESBLEs recovered from patients were identical to those that had been previously recovered from sinks. The outbreak described in a Colombian [9] study found that the likely cause of the infections was the improper design of sinks; in particular, the joints of the sinks to the walls were not sealed, leading to facilitation of colonization.

Logistic Regression Modeling in Infection Transmission
A previous study [10] that considered spatial variables as factors in explaining the efficacy of infection control measures in preventing the transmission of multidrugresistant tuberculosis used exposure to specific infected patients in a logistic regression model. This study found that exposure distance is a significant predictor of nosocomial transmission. The logistic regression approach has also been used in some studies [11,12]  showed that distance to the infected room was one of the significant predictors. One of the models showed that the proximity to sinks is important in predicting infection. However, the model was constructed using a limited data and time range and requires further research into the role of sinks in infection transmission. The most recent work performed in this area included the development of imputation methods [14] that would allow extrapolating the time series of sink status to resolve any gaps between two time periods. This work also found that the cumulative presence of positive patients in the same room as a sink, distance from the bed to sink and sink design are significant predictors of sink positivity.
Studies [4][5][6]9] have demonstrated that sinks play a role in infection transmission but did not highlight additional environmental risk factors responsible for sink contamination. This thesis improves on the understanding of sink contamination by highlighting important variables. This study is similar to [13,14] in terms of the modeling approach used, but it differs in the level of spatiotemporal variables from environmental data. In addition to examining the presence of positive patients in the same room, this study also considers the status of neighboring rooms and sinks as potential risk factors.

Spatiotemporal Data Visualization
Visualization is important when analyzing spatiotemporal data because it can help humans discover complex relationships in the data. Formally, spatiotemporal data are defined as high-dimensional records of data in which different dimensions can be classified into three components: spatial (geographical coordinates), temporal (time stamps) and attributes (patient status) [15]. The objective behind analyzing this type of data is to study a process in the context of both space and time. To effectively explore multivariate datasets and to be able to arrive at the potential list of variables that may not be otherwise easily identifiable, visualization techniques help by using the power of the human visual system to decode visual representations of data for analysis.  While previous studies [16,17] focus on highlighting and summarizing temporal events, our thesis goes further by adding a spatial element to the visualization, such as hospital units, floors and patient rooms. Additionally, previous studies [18][19][20] utilize GIS for depicting spatiotemporal visualization, whereas this thesis uses detailed views of floor plans to project patient and environment status.

Chapter 3
Problem, Objective, and Contribution

Problem Description
There have been ongoing multi-species CRE outbreaks at a major medical center in the United States. The conventional method of infection transmission through patient to patient contact does not fully account for all the cases of patient acquisition.
Samples from sinks (drain and p-trap) have been found to be positive with CRE.
Despite aggressive interventions, the bacterial strain continued to reappear, making it unclear whether other environmental or patient-specific risk factors could be involved in continued sink contamination.

Objective
The objective of this research is to identify significant risk factors for understanding sink contamination through interactive visualization incorporating multidimensional patient and environmental data. We aim to accomplish this goal through the use of predictive models that incorporate spatial and temporal risk factors.

Spatiotemporal Visualization Metrics
Evaluating a visualization is a difficult task, but there are a few guidelines that one can follow based on the research of [21]. One of the methods that the authors suggest is a case study approach, where the visualization is tested on a new dataset that is distinct from the one that was used for visualization. A similar method is employed to test visualization based on the most recent data and to determine whether it could perform all the animations and display dashboards without failing.

Infection Risk Modeling
Models will be evaluated using area under the receiver operating curve (AUC), accuracy, sensitivity (number of positive cases correctly identified) and specificity (number of negative cases correctly identified).

Contribution
This dissertation has analytically investigated risk factors responsible for contamination of non-patient reservoirs in hospital settings using interactive spatiotemporal visualization. We construct a model that shows

Chapter 4 Data Collection and Transformation
The data sources for this study can be broadly divided into three categories: environmental data, patient data, and geospatial data. The time period for which the data were extracted ranged from Sep 2013 to Feb 2016. For the purpose of our analysis, we selected the minimum observation interval to be one day for both environmental and patient datasets because that is the smallest interval for which records exist.   performed were drains and p-traps. Previous studies [13,14] have focused on using sink status as one of the environmental variables for modeling purposes. We decided to use the two most common environment sampling sites, namely, drain and p-trap, for our visualization purposes because these sites could provide additional information regarding sink contamination or patient acquisition. As shown in Figure 4.1, the environmental sampling was not performed on a regular basis. There are noticeable gaps in the sampling during the days of the study. We decided to use the mid-point imputation method described in the work of [14] to ensure that the series is consistent. However based on feedback obtained from ICPs, we made an adjustment to the imputation methodology to adjust cases of consecutive positive-negative-positive culture sequence found in one month to convert to positive-positive-positive. The understanding behind the adjustment was that since the period is too short the negative observation in between the two positives could have been a case of environmental sampling error.   It is observed that most positive patients in the study visited the ED or anesthesia facility at least once during their stay, which is why we observe that the number of

Data Transformations
To arrive at the variables for our analysis, we created the following transformations based on the existing variables in the dataset: • CRE after 48hours: This variable is created based on the difference between the first positive date and the admission date. If the difference between the two dates is greater than 2 days, then the variable receives a value of 1; otherwise, it is coded as 0.
• Extending patient series to daily: The patient bed transfer table has one record for every admitted patient in the hospital that moves into a room. The variables in dttm and out dttm denote a patient's entry and exit from the room, respectively. Because our modeling method uses daily records for every patient, we extended the series for every patient to have a daily interval. The purpose is to test whether the same/adjacent patient or adjacent environmental attributes play a role in sink contamination. The following variables were created: adj dr1 status,adj dr2 status, adj drt3 status, and adj dr4 status, which represented the status of the drain sink in the first adjacent, second adjacent, room next to first adjacent and room next to second adjacent room, respectively, during the same time as the patient under consideration. Similarly, adj pt1 status, adj pt2 status, adj pt3 status, and adj pt4 status represent variables for adjoining room sinks. We also derived a rolled up list of variables to include adjacent sinks (rather than drain and p-trap) for the convenience of using them if we did not find any significant difference between the drain and p-trap attributes.
• Lag variables for sinks: Lag variables for sinks were also created similar to room variables to determine whether such variables play a role in sink contamination. The sink lag variables are as follows: adj sink status lag7, adj sink status lag14, adj sink status lag14 and adj sink status lag28.

Visualization Design Requirements
To be able to effectively explore multivariate data, we identify a set of objectives that the spatiotemporal visualization should be able to achieve. Because our study focuses on understanding how adjacent environmental sites around a patient change over the course of time, we need a design that can integrate patient and environmental microbiological data and be able to show us views and summary statistics over

Visualization Interface Components
The key components that a user is able to interact with in this system are as follows: • Floor plans: Addition of floor plans to the visualization provides spatial context to the user. It also provides a canvas on which the user is able to see queries.
• Time selection: It allows the user to specify temporal constraints for patient and environmental data. This is in the form of a slider that a user can choose to drag two ends to indicate the 'start' and 'end' times.
• Unit: It allows the user to specify any unit(s) of the hospital and display visualization or summary charts.
• Status: The user can choose to see only infected patients/environmental sites • Source type: Users can select drain or p-trap source types.
• Species: Allows the user to choose species for any environmental site.

Visualization Dashboards
Dashboards are an important component of spatiotemporal visualization. Dashboards created in Tableau software [23] serve the purpose of integrating different views on a single page that are tied together by interface components. We show some of the dashboards created in this study for exploring data and arriving at the hypothesis.   For each room, the dashboard shows two rows per sampling site. In the figure, it can be observed that for room 3188, the actual sampling is on the top row and the green or red squares represent the status of sampling on that date. The bars corresponding to the imputed sampling series are on the second row with the label 'Impute'. The calculation of imputed series is based on the midpoint method recommended in the most recent work [14]. to be on the dashboard main menu with the understanding that these could be a potential group or cluster that could explain patient risk factors.

Chapter 6
Modeling Methodology and Results

Variable Selection
The environment sampling dataset used for this analysis was converted into a daily time series of all sinks in STBICU and MICU for the period September 2013 through February 2016. Thus, each day represented a sampling datapoint that showed the status of a sink along with the predictor variables. The sink status has a value of "1" for positive and "0" for negative. To arrive at a potential list of variables for modeling consideration, we used the spatiotemporal visualization module (described in Chapter 5). We also considered the variables used in the previous research on the subject [14]. A complete list of the variables considered for the purposes of modeling is presented in Table 6.1.
We performed stepwise regression for variable selection using all the variables from

Logistic Regression Modeling
We performed a logistic regression using variables from final model in section 6.1. The response variable represents sink status which has value 0(Negative) or 1(Positive). We perform regression using 10-fold cross validation on the dataset.The results from 10-fold cross validation are averaged to produce a single estimate. One of the advantage of this methods is that all observations are used from the dataset for training and validation and each observation is used only once. The parameter estimates are shown in Table 6.3.
To interpret the estimates from Table 6.3, we express them in terms of the odds ratio with a 95% confidence interval. The odds ratio [24] compares the relative odds of an event occurring given the exposure of other variables. Thus, for the purposes of this study, the odds ratio will signify the increase in probability of a sink becoming positive given the presence of any variable. To arrive at the confidence interval associated with Interventions carried out 0.50 0.45 -0.56 Table 6

Random Forest Model
To compare the variables and accuracy obtained through logistic regression, we construct a random forest model. The model is run with 10-fold cross validation, using 500 trees with a maximum node size of 5. The results of the model are shown in Figure 6  The confusion matrix based on the random forest model is shown in Table 6.8.   with the odds of just 1.13-fold for rooms that do not share common plumbing. Previous research [25] has shown that biofilms found in sinks were linked to outbreaks.
Some research works [26,27] have also shown that these biofilms are resistant to traditional disinfectant methods. Our results indicating the significance of adjacent rooms that share common plumbing can be explained by the probable presence of biofilms in sink drain walls,ptraps or drainpipes connecting the two room sinks. Despite the timely implementation of intervention strategies, it could be likely that the presence of biofilms contributed to the adjacent sinks becoming positive.
Another spatial factor that we found to be significant in the model is the presence of a positive patient in the adjacent room. The model results show that the odds of sink positivity increase by 1.13-1.70 fold when the adjacent rooms have a positive patient. We would expect this to be true given that positive patients in the adjacent room would use the sink (adjacent room sink). Thus, it is likely that they would contaminate the sink of the room that they are staying in.
Results from Random forest model show a slight increase in accuracy (81% vs 80%) compared to Logistic regression. Both models show identical True positive rate which is important in our case since the cost associated with misclassifying a positive sink as negative is higher than falsely classifying negative sink as positive.
Additionally, we see a higher specificity based on random forest model compared to logistic regression. As a result, we see that the Area under curve(AUC) is higher for Random forest (Shown in Figure 6.3).

Limitations and Future Work
There has been a fair amount of work devoted to recording of cultures for environment sampling. However, much of this work has been manual and maintained in excel spreadsheets. If there have been any translation errors from the actual environment sampling to recording in the spreadsheet, there could be a shift in the modeling results. Any changes to the outcome of the sampling performed at a later stage and not updated in the environment spreadsheet or anything missing from the environment database could also impact the results.
Furthermore, it was also found out that some of the individual patient records that were provided to us had mixing of patient identifiers and medical record numbers (MRN) for the control population. We have tried our best to reconcile the data at our end to ensure that the PIDs were converted to MRN, but further research should ensure that such cases are addressed.
Our analysis considered only sinks as the environmental reservoirs for modeling purposes. Other environmental attributes, such as toilets, hoppers, and showers, were not considered. Future modeling attempts should consider these attributes to show the extent of infection transmission contribution through such reservoirs.
Our modeling includes interventions as one of the key variables. The information provided to us regarding interventions was at a unit level for specific dates. Hence, we have performed this analysis with the assumption that interventions were implemented for all rooms within units. However, it was later learned that some of the room sinks acted as 'control' sinks and were not subjected to the interventions. For feature selection, we only used AIC as the sole criteria for arriving at the final model.
While there are many advantages to using AIC, it also has disadvantages including evaluating only those candidate models that are pre-specified before analysis.
The spatiotemporal visualization tool currently incorporates patient and environment data. We have brought both the datasets together to provide a combined view for analysis. However, the tool does not include provider movement and contact information. Future work on improving the tool would be to integrate the provider dataset. Additionally, we have done our best to provide users with an optimized view of patient transfer history once the user makes a patient selection. However, in cases when a patient has made a large number of visits and transferred through many rooms, the initial view can appear cluttered with a considerable amount of information. This is found in instances when the patient has made visits multiple times over more than 3 years. One approach to make the tool to display better would be to shorten the visit duration and focus on a smaller time period.