Ownership and Control of Data
Ownership and Control of Data
Ownership and Control of Data
Conceptual application of the term property to scientific knowledge is not new, but advances in science and technology and economic factors have fueled disputes and concerns over ownership, control, and access to original data.1-7 Data used in biomedical research, increasingly complex, now include large data sets, software, algorithms, and metadata (data that provide information or characteristics about other data). With the exception of commercially owned information, scientific data are viewed as a public good, allowing others to benefit from knowledge of and access to the information without decreasing the benefit received by the individual who originally developed the data.8 Ideally, scientific data would become a public good, regardless of the source of funding.9 The US National Institutes of Health (NIH) policy on data sharing states that “data should be made as widely and freely available as possible while safeguarding the privacy of participants and protecting confidential and proprietary data.”10 However, personal, professional, financial, and proprietary interests can often interfere with the altruistic goals of data sharing.5,6,11-13
Ownership of Data.
For purposes herein, data include but are not limited to written and digital laboratory notes, documents, research and project records, experimental materials (eg, reagents, cultures), descriptions of collections of biological specimens (eg, cells, tissue, genetic material), descriptions of methods and processes, patient or research participant records and measurements, results of bibliometric and other database searches, illustrative material and graphics, analyses, surveys, questionnaires, responses, data sets (eg, protein or DNA sequences, microassay or molecular structure data), databases, metadata (data that describe or characterize other data), software, and algorithms. The NIH policy defines final research data as “recorded factual material commonly accepted in the scientific community as necessary to document, support, and validate research findings,” which might include raw data and derived variables.10 The NIH definition does not include summary statistics; rather, it pertains to the data on which summary statistics are based. In scientific research, 3 primary arenas exist for ownership of data: the government, the commercial sector, and academic or private institutions or foundations. Although an infrequent occurrence, when data are developed by a scientist without a relationship to a government agency, a commercial entity, or an academic institution, the data are owned by that scientist.
Any information produced by an office or employee of the US federal government in the course of his or her employment is owned by the government.14 The Freedom of Information Act (FOIA), enacted in 1966, is intended to ensure public access to government-owned information (except trade secrets, financial data, national defense information, and personnel or medical records protected under the Privacy Act).2,15 Access to documents with such data that are otherwise unavailable may be obtained through an FOIA request.
Data produced by employees in the commercial sector (eg, a pharmaceutical, device, or biotechnology company, health insurance company, or for-profit hospital or managed care organization) are most often governed by the legal relationship between the employee and the commercial employer, granting all rights of data ownership and control to the employer. However, if the data have been used to secure a government grant or contract, such data may be obtained by an outside party through an FOIA request or by a court-ordered subpoena.3,15
According to guidelines established by Harvard University in 1988 and subsequently adopted by other US academic institutions, data developed by employees of academic institutions are owned by the institutions.16 This policy allows access to data by university scientists and allows departing scientists to take copies of data with them, but the original data remain at the institution.
Data Sharing and Length of Storage.
The notion that data should be shared with others for review, criticism, and replication is a fundamental tenet of the scientific enterprise. Sharing research data encourages scientific inquiry, permits reanalyses, promotes new research, facilitates education and training of new researchers, permits creation of new data sets when data from multiples sources are combined, and helps maintain the integrity of the scientific record.2,4,10 Yet the practice of data sharing has varied widely, and it was not until relatively recently that guidelines for data sharing were developed.4,7,10
Although data sharing is essential for research, costs and risks may result in restrictions on access to certain data imposed by the owner or initial investigator. Potential costs and risks to the owner or initial investigator include technical and financial obstacles for data storage, reproduction, and transmission; loss of academic or financial reward or commercial profit; unwarranted or unwanted criticism; risk of future discovery or exploitation by a competitor; the discovery of error or fraud; and breaches of confidentiality. The discovery of error or fraud and breaches of confidentiality have important relevance in scientific publishing. Discovery of error or fraud, if corrected or retracted in the literature, is clearly beneficial, and for research involving humans, epidemiologic and statistical procedures are available to maintain confidentiality for individual study participants9,17-19 (see also 5.4, Scientific Misconduct, and 5.8, Protecting Research Participants' and Patients' Rights in Scientific Publication). A number of research sponsors and governmental agencies have developed policies to encourage data sharing. For example, in 2003, the NIH began requiring investigators to include a plan for data sharing in all grant applications requesting $500 000 or more in direct costs.10 The Wellcome Trust encourages its funded investigators to release data to the public from large-scale biological research projects, such as the International Human Genome Sequencing Consortium.20
A number of proposals prescribe the minimum optimal time to keep data (for example, 2–7 years). However, there is no universally accepted standard for data retention by academic and research institutions. For example, the NIH requires its funded scientists to keep data for a minimum of 3 years after the closeout of a grant or contract agreement and recognizes that an investigator’s academic institution may have additional policies regarding the required retention period for data.10 The NIH also gives the right of data management, including the decision to publish, to the principal investigator.10
Data Sharing, Deposit, Access Requirements of Journals.
In 1985, the US Committee on National Statistics, which is part of the National Research Council (NRC),17 released a report on data sharing that continues to serve as a useful guide for authors and editors. Among the committee’s recommendations, the following have specific relevance for scientific publication.
Data sharing should be a regular practice.
Initial investigators should share their data by the time of the publication of initial major results of analyses of the data except in compelling circumstances, and they should share data relevant to public policy quickly and as widely as possible.
Investigators should keep data available for a reasonable period after publication of results from analyses of the data.
Subsequent analysts who request data from others should bear the associated incremental costs and they should endeavor to keep the burdens of data sharing to a minimum. They should explicitly acknowledge the contribution of the initial investigators in all subsequent publications.
Journal editors should require authors to provide access to data during the peer review process.
Journals should give more emphasis to reports of secondary analyses and to replications.
Journals should require full credit and appropriate citations to original data collections in reports based on secondary analyses.
Journals should strongly encourage authors to make detailed data accessible to other researchers (although some may view this as outside the purview of a journal’s responsibilities).
Similar to policies on data sharing and storage for academic and research institutions, policies for scientific journals are highly variable and not always available. In 2002, a US NRC review of 56 of the most frequently cited life science and medical journals reported that 39% had policies on data sharing and 45% had no stated policy.4 Of the 18 medical journals in this review, only 22% had policies on data sharing. To address the lack of standard policies for data sharing among scientific journals and recognizing that no standards are expected given the diversity of disciplines in the life sciences, the NRC recommends the following4:
Scientific journals should clearly and prominently state (in their instructions for authors and on their websites) their policies for distribution of publication-related materials, data, and other information.
Policies for sharing materials should include requirements for depositing materials in an appropriate repository.
Policies for data sharing should include requirements for deposition of complex data sets in appropriate databases and for the sharing of software and algorithms integral to the finding being reported.
The policies should also clearly state the consequences for authors who do not adhere to the policies and the procedure for registering complaints about noncompliance.
The NRC also has proposed a set of principles that may be useful to journals developing policies on data sharing4:
Authors should include in their publications data, algorithms, or other information that is central or integral to the publication—that is, whatever is necessary to support the major claims of the paper and would enable one skilled in the art to verify or replicate the claims.
If central or integral information cannot be included in the publication for practical reasons (for example, because a data set is too large), it should be made freely (without restriction of its use for research purposes and at no cost) and readily accessible through other means (for example, online). Moreover, when necessary to enable further research, integral information should be made available in a form that enables it to be manipulated, analyzed, and combined with other scientific data.
If publicly accessible repositories for data have been agreed on by a community of researchers and are in general use, the relevant data should be deposited in one of these repositories by the time of publication.
Authors of scientific publications should anticipate which materials integral to their publications are likely to be requested and should state in the “Materials and Methods” section or elsewhere how to obtain them.
If material integral to a publication is patented, the provider of the material should make the material available under a license for research use.
A number of scientific journals (eg, Science, Nature) require authors to submit large data sets (eg, protein or DNA sequences, microrray or molecular structure data) to approved, accessible databases and to provide accession numbers as a condition of publication. It is appropriate for authors and journals to include links to public repositories for such data in the Acknowledgment sections of articles (see also 2.10.13, Manuscript Preparation, Acknowledgment Section, Additional Information [Miscellaneous Acknowledgments]).
Some journals have other conditions of publication that require authors to deposit specific information about their research in a public repository or archive, although this is not data sharing per se. For example, following the recommendations of the International Committee of Medical Journal Editors (ICMJE),21 biomedical journals that publish clinical trials require authors to have registered their trials in approved, publicly accessible trial registries and to provide registration identifiers as a condition of publication (see also 2.5.1, Manuscript Preparation, Abstract, Structured Abstracts, and 20.4, Study Design and Statistics, Meta-analysis). In addition, a number of funders require authors to post articles describing the results of their funded research in publicly available archives (see also 5.6.2, Open-Access Publication and Scientific Journals).
Some journals require authors to provide data available on request for examination by the editors or peer reviewers (see 5.4, Scientific Misconduct). For example, JAMA requires all authors to sign the following as part of their authorship responsibility statement:
If requested, I shall produce the data on which the manuscript is based for examination by the editors or their assignees.
In addition, for reports containing original data (eg, research articles, systematic reviews, and meta-analyses), JAMA requires at least 1 author who is independent of any commercial funder (eg, the principal investigator) to indicate that she or he “had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.” (See also 5.5.4, Conflicts of Interest, Access to Data Requirement.)
Manuscripts Based on the Same Data.
On occasion, an editor may receive 2 or more manuscripts based on the same data (with concordant or contradictory interpretations and conclusions). If the authors of these manuscripts are not collaborators and the data are publicly available, the editor should consider each manuscript on its own merit (perhaps asking reviewers to examine the manuscripts simultaneously). Authors should attempt to resolve disputes over contradictory interpretations of the same data before submitting manuscripts to journals. When more than 1 manuscript is submitted by current or former coworkers or collaborators who disagree on the analysis and interpretation of the same unpublished data, the recipient editors are faced with a difficult dilemma.21 The ICMJE has stated that, since peer review will not necessarily resolve the discrepant interpretations or conclusions, editors should decline to consider competing manuscripts from coworkers until the dispute is resolved by the authors or the institution where the work was done.21 Arguments against publishing both papers include that doing so could confuse readers and waste journal pages. However, publishing the competing manuscripts with an explanatory editorial may allow readers to see and understand both sides of the dispute. Alternatively, publishing the paper deemed of higher quality could result in biasing the literature and postponing publication of legitimate research.
Record Retention Policies for Journals.
Journals should develop and implement consistent policies for retention of records and data related to the content that they publish. Legal documents (eg, copyright transfers, licenses, and permissions) should be kept indefinitely. All other records should be kept for a consistent period. For example, JAMA and the Archives Journals keep print and online copies of rejected manuscripts, correspondence, and reviewer comments up to 1 year to permit consideration of appeals of decisions. Print and digital copies of accepted manuscripts and related correspondence and reviews are kept for 3 years. Journals also should develop consistent policies for the retention of online metadata associated with manuscript submissions, authors, and peer reviewers. (See also, 5.7.3, Confidentiality, Confidentiality in Legal Petitions and Claims for Privileged Information.)