|
|
||||||||
RESEARCH AND PRACTICE |
Karen L. Olson and Kenneth D. Mandl are with the Childrens Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology, Childrens Hospital Boston, Boston, Mass; the Division of Emergency Medicine, Childrens Hospital Boston; and the Department of Pediatrics, Harvard Medical School, Boston. Shaun J. Grannis is with Regenstrief Institute Inc and the Indiana University School of Medicine, Indianapolis.
Correspondence: Requests for reprints should be sent to Karen L. Olson, PhD, Informatics Program, Childrens Hospital Boston, 1 Autumn St, Box 721, Boston, MA 02215 (e-mail: karen.olson{at}childrens.harvard.edu).
| ABSTRACT |
|---|
|
|
|---|
Objectives. Patient data that includes precise locations can reveal patients identities, whereas data aggregated into administrative regions may preserve privacy and confidentiality. We investigated the effect of varying degrees of address precision (exact latitude and longitude vs the center points of zip code or census tracts) on detection of spatial clusters of cases.
Methods. We simulated disease outbreaks by adding supplementary spatially clustered emergency department visits to authentic hospital emergency department syndromic surveillance data. We identified clusters with a spatial scan statistic and evaluated detection rate and accuracy.
Results. More clusters were identified, and clusters were more accurately detected, when exact locations were used. That is, these clusters contained at least half of the simulated points and involved few additional emergency department visits. These results were especially apparent when the synthetic clustered points crossed administrative boundaries and fell into multiple zip code or census tracts.
Conclusions. The spatial cluster detection algorithm performed better when addresses were analyzed as exact locations than when they were analyzed as center points of zip code or census tracts, particularly when the clustered points crossed administrative boundaries. Use of precise addresses offers improved performance, but this practice must be weighed against privacy concerns in the establishment of public health data exchange policies.
| INTRODUCTION |
|---|
|
|
|---|
Although there is compelling justification to accurately monitor clinical data for public health purposes, it is important to protect identifiable patient information. The Privacy Rule of the Health Insurance Portability and Accountability Act14 requires that disclosed health information be restricted to the minimum necessary to satisfy its intended purpose. The minimum amount of information necessary for effective syndromic surveillance has not been well investigated. However, the issue has been explored in the context of cancer surveillance. A recent study revealed few differences when late-stage breast and prostate cancer results were compared for different area-specific units (town, census tract, block group) and exact coordinates (the studys objective was not to search for small area clusters).15 An earlier study showed that small clusters did not characterize breast cancer incidence rates in the region assessed.16
The current practice in syndromic surveillance, in which there is great interest in detecting small, localized clusters, is to store patient locations as either latitude and longitude coordinates of home addresses or, more commonly, as points within administrative regions such as zip code areas or census tracts. The latter practice presumably results in patients being less identifiable as individuals, although extent of anonymity is certain to vary.17,18 A recent study using simulated risk data showed that, even when anonymity is ensured, assigning individuals to census tracts results in maps that do not accurately portray disease risk.19
The goal of this study was to investigate the effects of blurring identifiable patient data by converting a patients home address from an exact location to a regional centroid. We assessed outbreak detection by adding synthetic, spatially clustered emergency department visits to authentic background hospital emergency department surveillance data, creating semisynthetic data.20 The clusters were placed in a region densely populated by patients. In previous work, we found that small clusters near hospitals were difficult to detect.21 Yet, one goal of a real-time surveillance system is to detect unusual events early, possibly when only a few individuals have been affected. Depending on the nature of the outbreak, early detection may be critical in minimizing morbidity and mortality.
We used a spatial scan statistic22,23 to determine whether the simulated clusters could be detected. Pilot work indicated that this metric would detect relatively small, compact clusters in the present data. We examined 2 dimensions of cluster detection. One was detection rate, defined simply as the percentage of the semisynthetic data sets containing clusters detected by the spatial scan statistic. The other was accuracy, which we assessed by comparing characteristics of detected clusters with characteristics of simulated clusters. Transferring addresses to the centroids of administrative regions might increase detection rates by essentially amplifying clusters when many cases are concentrated at a single point. By contrast, detection might be more difficult in this case because not only would the simulated cluster points be concentrated, so would points from the background emergency department data.
| METHODS |
|---|
|
|
|---|
ArcGIS 9.0 (Environmental Systems Research Institute Inc, Redlands, Calif) was used to geocode (convert to latitude and longitude coordinates) the home address of each patient, and addresses were mapped to census tract and zip code regions defined by the US Census Department. We used XTools Pro (Data East LLC, Novosibirsk, Russia) to calculate the centroid of each region included in the study. The final data set included visits made by patients living within 80 km (50 mi) of the hospital (38 122 visits; 90% of all cases meeting the criteria for respiratory illness). Patient densities were higher closer to the hospital.21 Among the patients included, 3806 (10%) lived 0 to 2 km from the hospital, and 18 634 (49%) lived within 2 to 8 km. Simulated cluster points were inserted into the 2- to 8-km band.
When addresses were converted from their exact locations to centroids, the distance from the original location to a zip code centroid (mean = 1.37 km, SD = 1.03, maximum = 12.39) was greater than the distance to a census tract centroid (mean = 0.64 km, SD = 0.68, maximum = 7.80). The same was true of the band containing the simulated cluster points; the average distance to a zip code centroid was 0.96 km (SD = 0.49), and the average distance to a census tract centroid was 0.39 km (SD = 0.30).
Simulated Clusters
We created 2 sets of simulated disease clusters, one for zip code analyses and one for census tract analyses. We added these clusters to the baseline data to test the effect of moving a point from its exact location to the center of each respective administrative region. We selected cluster parameters that would mimic an early signal of an outbreak first appearing as a small geographic cluster. All simulated clusters contained 10 points and were located along the edge of a circle with a radius of 5 km centered at the hospital, as illustrated in Figure 1
.
|
Points from a single cluster may reside in more than 1 administrative region. As a means of testing the effects of cluster points crossing administrative boundaries when these points were analyzed as centroids, the 10 points that made up each cluster were selected so that they fell into a total of 1, 2, 3, or 4 administrative regions. By design, these points were distributed as evenly as possible when they fell into more than 1 region, because this pattern was considered most difficult to detect. For example, when points were dispersed into 2 regions, 5 points were included in each region.
Simulated clusters varied on 2 parameters: radius size and dispersion across administrative boundaries. To allow selection of 5 samples for each radius size and dispersion value, we initially created 17 280 simulated clusters. The underlying geography of the study region affected the range of these parameters. For example, simulated clusters with a 0.5-km radius that included all 10 points in a single zip code area could be readily obtained. There were 1603 initial clusters with these characteristics, and 5 were randomly sampled. However, no initial clusters with a radius of 2 or 3 km included 10 points within a single census tract. Therefore, these 2 radius sizes were not analyzed for census tract regions.
Cluster Detection Test Sets
We added simulated cluster points to authentic emergency department data for the initial target date, June 23, 2002, and the preceding 6 days. This single week of data, which contained the simulated outbreak, was compared with the previous 6 weeks of baseline data. The target date then increased by 5 days. This procedure was repeated until the final target date, June 22, 2005, yielding 220 data sets with which to test each simulated cluster.
To evaluate spatial cluster detection rates in actual emergency department data when no simulated clusters were added, we prepared additional data sets to compare encounters from each target date and the previous 6 days with encounters from the preceding 6 weeks. Although these rates may have reflected previously undetected spatial clustering in the background emergency department data, we treated them as false-positive events.
Cluster Detection
We used a spatial scan statistic22 implemented in the SaTScan program26 to detect spatial clustering. SaTScan creates circles of various sizes around each point and evaluates whether location inside as opposed to outside a given circle is associated with a higher risk of classification as a case (as defined subsequently). For each data set, the program identified the most likely clusters and assigned P values on the basis of 999 Monte Carlo replications. When the P value was less than.05, the presence of clustering was assumed. The output from SaTScan included information regarding individual points contained in each cluster. Consequently, it was possible to compare features of the simulated cluster with features of the most likely cluster identified by SaTScan.
SaTScan was configured to detect purely spatial clusters with a Bernoulli (casecontrol) model. Cases were defined as all encounters in each data set that occurred during the final (seventh) week assessed. Controls were defined as all encounters in each data set that occurred during the first 6 weeks, during which time it was assumed that no spatial clustering took place. This assumption could not be verified, however, because there was no documentation of known clusters of respiratory cases in the present data. We ran SaTScan 35 200 times (80 simulated clusters x 220 data sets x 2 levels of address precision) to assess the effect of moving a point from its exact location to a zip code centroid and 17 600 times (40 x 220 x 2) to assess the effect of moving a point from its exact location to a census tract centroid.
Other Statistical Analyses
All other statistical analyses were performed with SAS (SAS Institute, Cary, NC). We conducted separate analyses for zip code areas and census tracts so that we could compare the 2 levels of the independent variable, address precision (exact coordinates vs a regional centroid). One dependent variable was detection rate, which was defined as the percentage of significant spatial clusters, that is, those with SaTScan P values below.05. We assessed accuracy with 2 additional dependent variables: proportion of significant clusters containing at least half of the simulated points and number of additional authentic emergency department visits drawn into the clusters.
Two other independent variables were radius size of the simulated cluster and number of regions into which simulated cluster points fell. Generalized estimating equations were used to account for the covariance between observations at the 2 levels of address precision. Preliminary analyses revealed significant interactions between the independent variables. Consequently, we conducted separate analyses for each radius size and number of regions, focusing on the comparison of exact coordinates with regional centroids.
| RESULTS |
|---|
|
|
|---|
Detection of Simulated Clusters by Level of Address Precision
We analyzed overall detection results for exact coordinates and regional centroids. The clusters identified by SaTScan could contain the simulated cluster points, the cluster points from the background emergency department data, or both types of cluster points. Exact coordinates yielded more (12 858; 73%) significant clusters than zip code centroids (7876; 45%; OR = 3.35; 95% CI = 3.20, 3.50). Similarly, exact coordinates yielded more significant clusters (8126; 92%) than census tract centroids (7117; 81%; OR = 2.85; 95% CI = 2.59, 3.13).
As a measure of accuracy, we required that significant clusters contain at least half of the original simulated points. A larger absolute number and a larger proportion of the significant clusters met this requirement when exact coordinates were analyzed. Of the 12858 significant clusters, 12016 (93%) contained 5 to 10 simulated points when they were analyzed as exact coordinates; when these clusters were analyzed as zip code centroids, 6842 (87%) contained 5 to 10 simulated points (OR=2.16; 95% CI=1.96, 2.37). Results were similar when we compared exact coordinates (n= 7997; 98%) with census tract centroids (n= 6796; OR=2.93; 95% CI=2.38, 3.60).
As another measure of accuracy, we calculated the numbers of additional points from the background emergency department data that were drawn into the significant clusters. The clusters contained fewer additional emergency department visit points (i.e., points that were not part of the original simulated cluster) when addresses were analyzed as exact locations (mean = 4, SD = 10, range = 0111) than when they were analyzed as zip code centroids (mean = 10, SD = 21, range = 0157). Similarly, fewer additional emergency department visits (mean = 2, SD = 6, range = 0100) were included in the cluster when these visits were analyzed as exact locations than when they were analyzed as census tract centroids (mean = 4, SD = 11, range = 0147).
Additional Independent Variables
Effects on detection rates.
The overall results were complicated by interactions between address precision and the other 2 independent variables (simulated cluster radius and number of regions into which the simulated cluster points fell). Therefore, we conducted separate analyses exploring the effects of these variables. Cluster detection rates for precise locations and zip code and census tract centroids are shown in Table 1
. The odds ratios indicate that exact coordinates yielded higher rates than centroids.
|
|
|
As can be seen in Figure 2
, use of exact locations involved at least 2 advantages over use of centroids. First, a greater proportion of the significant clusters contained 5 to 10 of the original simulated points when they were analyzed as exact locations. Second, relatively few additional emergency department visits were drawn into these clusters. However, there remained some noteworthy portions of clusters with many additional points. Also, when the simulated cluster had a 0.5-km radius and all of its points fell into a single census tract, there appeared to be some advantage to using centroids, given that more of the clusters contained no additional emergency department visits. Nevertheless, the cumulative number of clusters with 0 to 9 additional visits was almost the same for exact coordinates and census tract centroids.
| DISCUSSION |
|---|
|
|
|---|
Characteristics of the disease cluster itself dictate whether reporting case locations as exact coordinates or as administrative region centroids leads to the highest likelihood of detection. There are clearly circumstances in which knowledge of exact locations yields superior outbreak detection performance. The present results highlight the effects of forgoing address precision and, via scan statistics, using regional centroids for spatial cluster detection.
When a small number of clustered points were dispersed over 1 to 4 regions, the simulated clusters were more accurately detected when they were analyzed as exact locations than when they were analyzed as centroids; that is, more significant clusters were identified, and these clusters were more likely to include at least half of the simulated points. Furthermore, they often contained few surplus points from the background emergency department data.
By contrast, when clusters were analyzed as centroids, it was possible for the detection algorithm to miss all of the simulated points from 1 or more zip code or census tracts, resulting in fewer clusters with at least half of the simulated points being identified. This possibility was especially apparent when the simulated points fell into 2 zip code areas. In this situation, until the simulated cluster radius was quite large (3 km), the number of significant clusters for centroids greatly decreased (relative to exact locations) when the points for one of the zip code areas were missed.
Because census tracts are smaller than zip code areas, distances were smaller when a point moved from its exact location to a census tract centroid than when it moved to a zip code centroid. Consequently, the decrease in detection rates for census tract centroids when the simulated cluster points crossed administrative boundaries was not as dramatic as that observed for zip code areas. Nonetheless, such decreases did occur with increasing numbers of boundaries crossed.
There were some limitations associated with our study. For example, the clusters created were simulated, and thus, they represent only one of many possible scenarios for an actual outbreak. In addition, the simulated clusters were limited to a single size and circular shape, and they were placed within a specific band around a single hospital. This approach enabled us to focus on an important cluster parameter, its dispersion across administrative boundaries. However, other parameters may be important, such as population density around the cluster, which will differ from region to region. Furthermore, other spatial analytic techniques may perform differently than the scan statistic used in this study, particularly if the cluster is not circular in shape.
Also, we focused on 1 form of geographic masking, that is, moving a point to a regional centroid. This approach allowed us to evaluate current syndromic surveillance practices. However, other masking techniques exist for moving points either deterministically or stochastically to new locations, and the effects of these transformations on the results of spatial analyses remain important areas of study.3,2830
In terms of detecting spatial clusters in the present semisynthetic surveillance data, we found that use of exact locations was generally advantageous, although there were some exceptions when cluster points were contained in a single zip code or census tract. This result illustrates that there are clearly conditions under which the power of spatial cluster detection is improved when exact address information is available. In particular, exact locations yielded improved power when the cluster crossed the artificial, administrative boundaries associated with census tracts and zip code areas. This improved power should be considered and balanced against privacy considerations in determining level of address precision in public health data exchange policies.
| Acknowledgments |
|---|
We thank Martin Kulldorff for his advice regarding SaTScan and Christopher A. Cassa for modifications that he made to his cluster creation software tool for use in this study.
Human Participant Protection
This study was approved by Childrens Hospital Boston, Committe on Clinical Investigation.
| Footnotes |
|---|
Contributors
K. L. Olson contributed to the design of the study, performed all analyses, and was the lead author. S.J. Grannis and K. D. Mandl contributed to the design of the study and interpretation of the results.
Accepted for publication January 22, 2006.
| References |
|---|
|
|
|---|
2. Henning KJ. What is syndromic surveillance? MMWR Morb Mortal Wkly Rep. 2004;53(suppl):511.[Medline]
3. Armstrong MP, Rushton G, Zimmerman DL. Geographically masking health data to preserve confidentiality. Stat Med. 1999;18:497525.[CrossRef][Web of Science][Medline]
4. Drociuk D, Gibson J, Hodge J, Jr. Health information privacy and syndromic surveillance systems. MMWR Morb Mortal Wkly Rep. 2004;53(suppl):221225.[Medline]
5. Rushton G, Elmes G, McMaster R. Considerations for improving geographic information system research in public health. USISA J. 2000;12:3149.
6. Lober WB, Karras BT, Wagner MM, et al. Roundtable on bioterrorism detection: information system-based surveillance. J Am Med Inform Assoc. 2002;9:105115.
7. Heffernan R, Mostashari F, Das D, Karpati A, Kulldorff M, Weiss D. Syndromic surveillance in public health practice, New York City. Emerg Infect Dis. 2004; 10:858864.[Web of Science][Medline]
8. Tsui F-C, Espino JU, Dato VM, Gesteland PH, Hutman J, Wagner MM. Technical description of RODS: a real-time public health surveillance system. J Am Med Inform Assoc. 2003;10:399408.
9. Lombardo JS, Burkom H, Pavlin J. ESSENCE II and the framework for evaluating syndromic surveillance systems. MMWR Morb Mortal Wkly Rep. 2004; 53(suppl):159165.[Medline]
10. Platt R, Bocchino C, Caldwell B, et al. Syndromic surveillance using minimum transfer of identifiable data: the example of the National Bioterrorism Syndromic Surveillance Demonstration Program. J Urban Health. 2003;80(suppl 1):i25i31.[Web of Science][Medline]
11. Bradley CA, Rolka H, Walker D, Loonsk J. BioSense: implementation of a national early event detection and situational awareness system. MMWR Morb Mortal Wkly Rep. 2005;54(suppl):1119.[Medline]
12. Jacquez GM, Jacquez JA. Disease clustering for uncertain locations. In: Lawson A, Biggeri A, Böhning D, Lesaffre E, Viel J-F, Bertollini R, eds. Disease Mapping and Risk Assessment for Public Health. London, England: John Wiley & Sons Inc; 1999:151168.
13. Jacquez G. Current practices in the spatial analysis of cancer: flies in the ointment. Int J Health Geogr. 2004;3:22.[CrossRef][Medline]
14. Centers for Disease Control and Prevention. HIPAA privacy rule and public health: guidance from CDC and the US Department of Health and Human Services. MMWR Morb Mortal Wkly Rep. 2003;52 (suppl):120.[Medline]
15. Gregorio DI, Dechello LM, Samociuk H, Kulldorff M. Lumping or splitting: seeking the preferred areal unit for health geography studies. Int J Health Geogr. 2005;4:6.[CrossRef][Medline]
16. Gregorio DI, Kulldorff M, Barry L, Samociuk H. Geographic differences in invasive and in situ breast cancer incidence according to precise geographic coordinates, Connecticut, 199195. Int J Cancer. 2002;100:194198.[CrossRef][Web of Science][Medline]
17. Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain Fuzziness Knowledge Based Syst. 2002;10:557570.[CrossRef]
18. Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail reidentification to evaluate and design anonymity protection systems. J Biomed Inform. 2004;37:179192.[CrossRef][Web of Science][Medline]
19. Kamel Boulos MN, Cai Q, Padget JA, Rushton G. Using software agents to preserve individual health data confidentiality in microscale geographical analyses. J Biomed Inform. 2006;39:160170.[CrossRef][Web of Science][Medline]
20. Mandl KD, Reis BY, Cassa C. Measuring outbreak-detection performance by using controlled feature set simulations. MMWR Morb Mortal Wkly Rep. 2004; 53(suppl):130136.[Medline]
21. Olson KL, Bonetti M, Pagano M, Mandl KD. Real time spatial cluster detection using interpoint distances among precise patient locations. BMC Med Inform Decis Mak. 2005;5:19.[CrossRef][Medline]
22. Kulldorff M. A spatial scan statistic. Commun Stat Theory Methods. 1997;26:14811496.
23. Kulldorff M, Heffernan R, Hartman J, Assuncao R, Mostashari F. A space-time permutation scan statistic for disease outbreak detection. PLoS Med. 2005;2:e59.[CrossRef][Medline]
24. Beitel AJ, Olson KL, Reis BY, Mandl KD. Use of emergency department chief complaint and diagnostic codes for identifying respiratory illness in a pediatric population. Pediatr Emerg Care. 2004;20:355360.[CrossRef][Web of Science][Medline]
25. Cassa C, Olson KL, Mandl KD. A software tool for creating simulated outbreaks to benchmark surveillance systems. BMC Med Inform Decis Mak. 2005;5:22.[CrossRef][Medline]
26. Kulldorff M. SaTScan Version 5.0: Software for the Spatial and Space-Time Scan Statistics. Silver Spring, Md: Information Management Services; 2004.
27. Rushton G. Public health, GIS, and spatial analytic tools. Annu Rev Public Health. 2003;24:4356.[CrossRef][Web of Science][Medline]
28. Cassa CA, Grannis SJ, Overhage JM, Mandl KD. A context-sensitive approach to anonymizing spatial surveillance data: impact on outbreak detection. J Am Med Inform Assoc. 2006;13:160165.
29. Kwan M-P, Casas I, Schmitz BC. Protection of geoprivacy and accuracy of spatial information: How effective are geographical masks? Cartographica. 2004;39:1528.
30. Leitner M, Curtis A. Cartographic guidelines for geographically masking the locations of confidential point data. Cartographic Perspect. 2004;49:2239.
This article has been cited by other articles:
![]() |
K. El Emam, A. Brown, and P. AbdelMalik Evaluating Predictors of Geographic Area Population Size Cut-offs to Manage Re-identification Risk J. Am. Med. Inform. Assoc., March 1, 2009; 16(2): 256 - 266. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger Revealing the spatial distribution of a disease while preserving privacy PNAS, November 18, 2008; 105(46): 17608 - 17613. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Sengupta, N. S. Calman, and G. Hripcsak A Model for Expanded Public Health Reporting in the Context of HIPAA J. Am. Med. Inform. Assoc., September 1, 2008; 15(5): 569 - 574. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. McMurry, C. A. Gilbert, B. Y. Reis, H. C. Chueh, I. S. Kohane, and K. D. Mandl A Self-scaling, Distributed Information Architecture for Public Health, Research, and Clinical Care J. Am. Med. Inform. Assoc., July 1, 2007; 14(4): 527 - 533. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Wieland, J. S. Brownstein, B. Berger, and K. D. Mandl Density-equalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes PNAS, May 29, 2007; 104(22): 9404 - 9409. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |