Not Available
Copyright 2004 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.
Despite concerns in some scientific fields, data sharing has come of age. In 1985, the National Academy of Sciences' Committee on National Statistics endorsed the goal of wide access to research data and proffered recommendations for how and when data should be shared.1 Nearly 2 decades later, after numerous similar reports,2 - 4 a clear mandate is emerging. The National Science Foundation,5 the National Institutes of Health,6 and numerous professional societies and scientific journals have all taken steps to encourage, if not require, research scientists to share their research data and materials with other scientists. Some questions remain, however, as scientists struggle to learn how to carry out this mandate in ways that are effective, efficient, and ethically sound.
The article in this issue of the ARCHIVES by van Dyck et al7 uses data from a Maternal and Child Health Bureau study conducted by the National Center for Health Statistics, a pioneer in data sharing. van Dyck and colleagues use data from the National Survey of Children with Special Health Care Needs to provide a descriptive portrait of children with special health care needs and of disparities among these children with respect to access to care, satisfaction with care, and family impact. The publication in the journal Pediatrics of another article8 using the same data just months before the van Dyck article went to press raised the concern that data sharing may create inefficiencies in science by fostering duplication of effort. We examine this concern and what might be done to address it.
The idea that data sharing could lead to duplication was raised in the 1985 National Academy of Sciences report but has not been widely discussed. However, it is closely related to the often-expressed concern that data sharing could enable secondary analysts to publish key results from a data set before the producers of the data are able to do so.1 ,3 This concern has generally been addressed by allowing a limited period of exclusive use for the original study investigators, usually a year or until the first major publication. However, this practice does not address the more general question of duplication because in most cases the potential uses of a data set far exceed what can be accomplished within this limited period.
Before considering fixes to the problem of potential duplication, it is useful to assess whether and to what extent there is a problem. Consider the article by van Dyck et al.7 Although this analysis uses the same data as the article by Mayer et al8 in Pediatrics, and although both consider some of the same outcomes, the articles are in fact very different. The van Dyck article estimates the percent of US children who have special health care needs across demographic and economic groups, and it describes a wide range of outcomes for these populations. The Mayer article develops and tests a theory-based predictive model of unmet need for specific types of medical service among children with special health care needs. It provides no descriptive data regarding prevalence and no information on most of the outcomes examined by van Dyck and colleagues. In short, rather than providing an example of duplication, this pair of articles illustrates the value of having data in the public domain: researchers with different ideas, different approaches, and different goals can use the same data to develop different products with very different values and applications for science, practice, and public policy.
Does data sharing generally lead to duplication? Because the uses of shared data are not uniformly tracked, quantitative measures of duplication are out of reach. To address this question in a limited way, we undertook a survey of the published and forthcoming articles emerging from analyses of the National Longitudinal Study of Adolescent Health (Add Health). In this study, J. Richard Udry, PhD, and colleagues collected comprehensive information on the health and health-related behaviors of a large national sample of adolescents as well as information on their social contexts.9 The data were made available to both study and nonstudy scientists at the same time, and currently over 2000 individuals nationwide are analyzing the data. If duplication is a problem, it should be evident here. Our analysis of the first 180 publications from this project (assembled through a query to all users of the data) found none that exactly duplicated each other and 17 pairs of authors who addressed similar questions. Some of these pairs differed in their theoretical and statistical approach, whereas others used similar approaches but elaborated the questions in complementary ways.
In one instance, 2 sets of researchers tackled the question of whether parents encouraged sexual activity in their teenage children by recommending to them a method of contraception. One analysis found a positive effect10 ; the other, no effect.11 An analysis of the discrepancy showed that it was the result of different ways of defining the study population, defining the outcome, and constructing the model. The fragility of the result in the face of these differences produced knowledge that each analysis standing alone could not have produced. In fact, we don't know for sure whether parental advice on contraception encourages teenage sexual activity. This is important to understand—for parents, practitioners, and policy makers—because of a widespread assumption that information about sex and contraception is best left to parents.
Our analysis of Add Health data suggests that, when valuable data sets are widely available, researchers will occasionally use the same data to address similar questions. In some cases, this may have costs for the researchers themselves and for science in general. The researchers may suffer if they are unable to publish their work; science may suffer if the time spent in conducting similar analyses could have been spent in conducting analyses that produced greater gains in knowledge. On the other hand, these similar analyses often advance science significantly by highlighting the sensitivity of results to an analytic approach. They reveal what we do not know and stimulate debate about how to advance our scientific theories and methods. They reinforce accountability among researchers and provide an incentive for researchers to communicate with each other.
Could potential duplication be effectively managed in the context of data sharing? Theoretically, data providers might monitor ongoing uses of a data set and prevent new users from undertaking duplicative work. This would require timely, complete, and specific information from secondary users about the analyses they are conducting. Some data providers achieve this by restricting secondary users to specific analyses of the data that are agreed upon in advance. Although this is feasible, it is hardly efficient. The requirements imposed on providers to document and approve specific uses, the risk that productive lines of research will be disallowed, and the inflexibility placed on users create high costs for the research enterprise.
Even in the absence of limits on the uses of shared data, tracking ongoing work would remain difficult. Secondary users may be reluctant to provide detailed analysis plans to data providers: doing so on a continuous basis would divert time and resources from their work, and sharing their original ideas could be seen as risky. The system would impose a substantial burden on data sharers, who would have to maintain elaborate information systems and screen for related interests on an ongoing basis. The costs of a truly effective system would greatly outweigh the benefits, and the requirements it would demand of data users would likely undermine its effectiveness and discourage use of the data.
Although systems to prevent duplication are problematic, steps can be taken (and often are) to help potential users identify whether other scientists have already used a data set to address their questions. In many cases, data sharers maintain lists of publications, presentations, working papers, and dissertations that have used the data set. Some studies maintain listservs that can help data users identify co-users with similar interests. Others hold user conferences at which scientists can network. A further step that, to our knowledge, has not been adopted is asking data users to provide and periodically update keywords or topic sentences summarizing their general research interests and to post these on Web sites available to potential data users. This approach is relatively inexpensive and would provide users a tool for identifying other users with similar interests. It would leave decisions about sharing information concerning specific analyses to the users themselves.
Data sharing is here to stay, but best practices for data sharing remain a work in progress. As technologies, scientific methods, and scientific cultures change, norms for data-sharing practices will inevitably evolve. The concern about duplication may reflect, in part, lingering proprietary attitudes toward investigator-collected data, but it raises legitimate questions about how to manage the sharing of research resources in ways that are optimally efficient for the scientific enterprise. Certainly, more can be done to reduce the likelihood of unprofitable duplication. As such efforts are undertaken, we must be careful to ensure that they are not more costly than the problem they seek to address.
Correspondence: Dr Bachrach, Demographic and Behavioral Sciences Branch, National Institute of Child Health and Human Development, 6100 Executive Blvd, Rm 8B07, MSC 7510, Bethesda, MD 20892-7510 (cbachrach@nih.gov).
Country-Specific Mortality and Growth Failure in Infancy and Yound Children and Association With Material Stature
Use interactive graphics and maps to view and sort country-specific infant and early dhildhood mortality and growth failure data and their association with maternal
Instructions
Comments are moderated and will appear on the site at the discretion of the Archives of Pediatrics and Adolescent Medicine editors. Comments should not exceed 500 words of text and 10 references.
Do not submit personal medical questions or information that could identify a specific patient, questions about a particular case, or general inquiries to an author. Only content that has not been published, posted, or submitted elsewhere should be submitted. By submitting this Comment, you and any coauthors transfer copyright to the journal if your Comment is posted.
* = Required Field
Disclosure of Any Conflicts of Interest* Indicate all relevant conflicts of interest of each author below, including all relevant financial interests, activities, and relationships within the past 3 years including, but not limited to, employment, affiliation, grants or funding, consultancies, honoraria or payment, speakers’ bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued. If all authors have none, check "No potential conflicts or relevant financial interests" in the box below. Please also indicate any funding received in support of this work. The information will be posted with your response.
Register and get free email Table of Contents alerts, saved searches, PowerPoint downloads, CME quizzes, and more
Subscribe for full-text access to content from 1998 forward and a host of useful features
Activate your current subscription (AMA members and current subscribers)
Some tools below are only available to our subscribers or users with an online account.
Download citation file:
Web of Science® Times Cited: 2
Customize your page view by dragging & repositioning the boxes below.
and access these and other features:
Register Now
Enter your username and email address. We'll send you a reminder to the email address on record.
Athens and Shibboleth are access management services that provide single sign-on to protected resources. They replace the multiple user names and passwords necessary to access subscription-based content with a single user name and password that can be entered once per session. It operates independently of a user's location or IP address. If your institution uses Athens or Shibboleth authentication, please contact your site administrator to receive your user name and password.