Abstract
In all fields of study, as well as government and commerce, high-quality data enables informed decision-making. Linking data from disparate sources multiplies the opportunities for novel insights and evidence-based decision-making for an increasingly large range of administrative, clinical, research, and population health use cases. In recent years, novel methods, including privacy-preserving record linkage methods, have emerged. However, regardless of the method, successful data linkage is highly dependent on data quality and completeness and has to be balanced by the increased risk of re-identification of the subsequently linked data. Opportunities for the future include sharing tools for responsible linkage across silos, enhancing data to improve quality and completeness, and ensuring linkage leverages inclusive and representative datasets to ensure a balance between individual privacy and representation in research and novel discoveries. Here we provide a brief overview of the history and current state of data linkage, highlight the opportunities created by linked population data across critical research sectors, and describe the technology and policies that govern its usage.
Similar content being viewed by others
Introduction
Enhanced and linked data that are fully integrated and shared, in compliance with privacy regulations, can dramatically multiply research insights1. Prior to the development of data linkage tools, data scientists were restricted to single source datasets, which limited the breadth and depth of research possibilities1. For example, one could ask whether marital status impacted lifespan using civil records of birth, death, and marriage licensure. However, it would be impossible to determine the underlying root cause of marriage-related changes in longevity in a given population from these records. Changes in the female population could have been due to pregnancy or childbirth-related mortality. Alternatively, perhaps alterations in accidental death totals or mental illness were more relevant to long-term survival. To trace the etiology of longevity outcomes over time in a given population, census and health records would need to be linked.
Despite the potential benefits of data linkage, health care data in the US and other countries are often fragmented and siloed. In contrast, countries with universal health care (e.g., UK, Denmark etc.) and strong incentives to coordinate care, employ centralized data systems that can greatly reduce, but not eliminate data fragmentation. In the US specifically, data fragmentation is a result of the multiple distinct and disconnected clinical settings where patients receive care, a problem worsened by historically paper-based record keeping or use of multiple isolated electronic health record systems2,3. By preventing the flow of essential patient information, fragmented data can negatively affect clinical outcomes in a variety of disease contexts4,5. One tool used to combat fragmentation for health care purposes are Health Information Exchanges (HIE), systems by which health care professionals share patient data to maintain continuity of care6. However, HIEs primarily link data for clinical transactions (i.e., treatment, payment and operations) and are not commonly used for biomedical research or commercial benefit. Robust data linkage approaches with privacy protection and/or patient consent could significantly reduce healthcare fragmentation and improve care coordination. Equally important, data linkage that breaks data silos, can greatly multiply potential new research findings across many other health and economic sectors2,3. Obstacles to successful data linkage include inability to match data for the same entity (individual, patient etc.) due to incomplete or incorrect data, lack of data interoperability, inability to identify duplicate entities, and patient privacy concerns. Linking data for the same individual is often referred to as Record Linkage (RL) and is defined as the process of identifying and combining records from different sources that relate to the same individual.
Here we discuss the origins, utility, and future of record linkage technology for diverse healthcare applications. We also highlight key features of data sharing that require significant regulation to assure privacy protection as well as current challenges to RL in research and clinical decision making. RL is relatively new technology for many if not most academic health care/biomedical researchers. Our objective is to introduce this powerful tool, as well as its capacities and limitations to biomedical and clinical research professionals.
Brief history of record linkage
The need to overcome data fragmentation around the world has motivated the development of novel approaches to data linkage, and in the United States, it also required adoption of lessons learned from other countries or industries. The underlying concept of population data in the United States as well as its collection, integration, and sharing originated with the US Constitution. Population-based representation in Congress, resulting in the creation of the US Census Bureau, was the nation’s first foray into evidence-based governance and decision-making. Collecting data to classify and quantify the population evolved into collecting information on the health and welfare of Americans as it pertains to public policy7. In 1987, the National Center for Health Statistics emerged under the auspices of the Centers of Disease Control as a core component of the complex decentralized Federal Statistical System. The tools to document patient data have also evolved dramatically since the early days of modern medicine when physicians kept handwritten notes in a non-standardized format, and in the hospital setting they were dispersed among various wards and individuals. Standardized forms for recording medical history and data developed in Europe and were first adopted at New York Hospital8. Then in 1907 the Mayo brothers together with Dr. Henry Plummer developed a medical record collection and distribution system that revolutionized medical data keeping9,10. To this day, the Mayo Clinic is a leader in healthcare data collection and integration.
The concept of modern linked data was first defined by biostatistician Halbert Dunn. Dunn worked for the US Public Health System in the Census Bureau and other statistical agencies, finishing his career as an assistant Surgeon General. In 1946, Dunn espoused the value of linked data through a publication simply titled “Record Linkage”11. He elegantly outlined his observation that each person represents a “book of life” containing all the relevant records of a person’s existence beginning with birth, ending with death, and he discussed the utility of those records. Fellegi and Sunter followed in 1969 with a mathematical theory for probabilistic linkage of “matched records”, an approach to apply Dunn’s hypothesis to existing data using multiple pieces of information used to triangulate and identify matching records12. Next, the widespread proliferation of computers precipitated major developments in the field. In 1991 Computer scientists took a significant step toward linking disparate sources of data at the University of Minnesota, where US and international census data were consolidated and linked to survey data in the Integrated Public Use Microdata Series (IPUMS)13. Computer technology and widespread application of internet technology slowly transformed hospital record keeping from paper based to electronic. In 2009, as part of the Health Information Technology for Economic and Clinical Health (HITECH) Act, the US federal government spent $27 billion to incentivize hospitals and other medical providers to transition from paper to electronic health records systems (EHR)14,15,16. This pivotal transformation to electronic data opened the door to large scale RL, sharing, and integration. In more recent years the National Science Foundation has invested heavily in cyberinfrastructure development through funded projects at various academic institutions, particularly in software infrastructure to facilitate the development of these tools. As a result of public and private investments, data sharing and the related technology has improved significantly over the last decade, but just as importantly large datasets have reached the necessary scale and availability such that users benefit substantially from linkage across diverse contexts and analyses (Fig. 1).
Early RWD were limited to commercial claims, which were digitized earlier than hospital and medical records. Federal funding post 2010 incentivized hospitals to switch to electronic record keeping. Simultaneously, genomic tools for identification of genetic mutations altered gene expression became available from consented patients. These records were digital from their inception and though most frequently de-identified, have the potential to be linked with other matched records. Recent years have brought an explosion of consumer/patient data from internet applications (i.e., weight loss tools), wearable devices (i.e., apple watches, glucose monitors) and other digital records. Together these linked data comprise a massive digital fingerprint for participating individuals.
The importance of data quality and completeness for data linkage
Today, multiple types of data are connected using a variety of linkage methods with different data types having different levels of data completeness. Data completeness refers to the number of gaps or missing information in the dataset. The presence of these gaps can lead to mistakes and false conclusions. Medical/prescription claims data are the most mature data types to be aggregated and linked currently. Laboratory data, EHR data, mortality data, and other data sources are also approaching the necessary level of “data completeness” to impact a diverse set of research and applications (Fig. 2). Beyond data completeness, another important aspect to consider is the quality of a given dataset, which is a measure of whether the data by format, or accuracy, reflects the true condition of a medical concept or condition. Achieving a high-quality dataset from a given source is a major challenge. Lack of high-quality data and the costs surrounding quality control are significant obstacles to building widely utilized datasets17. For example, disease registries are built on highly structured EHR from individual patients. However, the necessary information from medical reports and physician notes are not exclusively found in structured data fields but rather can be hidden within freeform text. Resolving this issue currently requires individualized review of patient records by a record retrieval professional, which is a costly and time-consuming process. However, newer natural language processing and artificial intelligence tools are being used to automate the process where possible18. Whereas there are several possible methods for data linkage, all of them are dependent on high quality and complete data. To improve data quality and facilitate accurate linkage between datasets, efforts are underway to standardize various features of individual records. For example, the mailing address is a common tool for data linkage but there are many ways that addresses can be reported, which complicates data linkage. Some records will list abbreviations such as Dr. or Ave. rather than full length words such as drive or avenue. Recognizing the importance of data standards for data linkage and interoperability, the Office of the National Coordinator (ONC) of Health Information Technology has defined a United States Core Data for Interoperability (USCDI) and has specifically promoted standardized address record keeping through Project US@. These national standardization efforts are critical for enhancing data quality and linkage accuracy.
Diverse data inputs can be linked together to inform a variety of sectors. Featured here are a series of examples. Mortality data can be linked to insurance claims and medical costs to extract outcomes of clinical trials and rare diseases etc., which can then inform funding and policy decisions. Also, purchasing and behavior data can be linked to facilitate Health Economics and Outcomes Research (HEOR) studies. These are simplified examples of data-driven research, commercial, and policy decision making.
Linking identifiable data
Today data linkage is performed on identified or de-identified data. Personally Identifiable Information (PII) is defined by the US National Institute for Standards and Technology as any information about an individual that can be used to distinguish or trace an individual’s identity. PII is provided by patients themselves or is made available for specific uses with patient consent. Identified data is commonly collected in disease registries, wherein patients register voluntarily. These data are useful for studies wherein the PII is critical for study design or when it’s essential that the patient be contacted to answer survey questions or provide updates. Linkage of data with PII uses deterministic, probabilistic, or referential matching approaches. Matching refers to the identification of the same individual across multiple datasets so that their information can be linked (Fig. 3). Hybrid approaches combining deterministic, probabilistic, or referential matching are also used to facilitate RL19. Specifically, identified matching uses the underlying PII. However, it is also possible to match de-identified records leveraging encrypted hashes/codes, which represent the underlying PII but cannot be traced back to the PII (Fig. 4a). Deterministic matching is a rules-based approach to RL often using one or more unique identity features20,21. In practice, these features may be an administrative number (e.g., in the US, a Social Security Number) an insurance ID number, or any other combination of values that are known to correspond to a single person (such as last name, first name and date of birth). Deterministic algorithms generate matches based on a contextually appropriate combination of partial or full matches between one or more identity features19 (e.g., last 4 digits of a SSN). Probabilistic matching entails using values that may not be exactly the same (e.g., due to alternate spellings), but, in combination, give a high probability that the correct records are matched22. Names, birth dates, and other identifying but non-unique identifiers can be used (typically in combination) to facilitate probabilistic matching. Finally, referential matching leverages additional large demographic datasets such as address change records or state or federal administrative records. Such algorithms use the referential dataset to connect to and enhance an individual’s available records when PII elements, like name and gender, change due to life events. By increasing the available features for matching, referential matching may perform better than traditional probabilistic or deterministic matching approaches22.
An individual can have their personal data recorded in many different datasets. For example, in the healthcare system a patient may appear in electronic health records (EHR), insurance claims, and mortality records. To be used in a manner that protects personally identifiable information (PII) and maintain privacy, these datasets undergo a process of normalization, identity resolution, and matching. Once records are matched and assigned a unique identifier/hash these data can be queried in a variety of ways.
a Individual records, even for the same person, may be distinct enough to introduce linkage challenges. Here, multiple unique datasets containing records for one individual are compared. For successful Privacy Preserving Record Linkage (PPRL) linkage of the two datasets followed by de-identification, identity resolution is performed to confirm that the relevant records capture the same individual. b Identical information, captured in slightly different formats, must be modified to uniformity and then necessary personally identifiable information (PII) removed for general encryption. Next, the normalized and encrypted data can be linked and further modified for specific uses.
In real-world practice, the choice between matching methods may be more reliant on the availability, quality, and standardization of patient identifiers. Patient identifiers, especially those captured as part of routine clinical care, continue to evolve over time and awareness of these trends may be an important determinant in selecting RL methods23,24. Another key consideration for selection of matching methods is identifying a threshold for the probability of a match, which is driven by specific use cases. For example, certain use cases such as commercial analytics are frequently looking for as many matched individual records as possible. In that case, they would have more tolerance for false matches and would therefore be more willing to use a lower confidence match threshold. For other use cases, patient matching may be used to ascertain which health care profiles satisfy a specific set of criteria (e.g., a narrow research question, or clinical trial inclusion). In these instances, it may be a priority to avoid false positives and exclude them if they cannot be resolved with a high degree of confidence. Limiting data linkage to identified data significantly reduces, and in some cases eliminates, available data to answer important research questions or develop policy, or may not be allowed due to privacy or regulatory concerns. Linkage of records using de-identified data has emerged as a potential alternative to reliance on identified data linkage alone.
Linking de-identified data
The 1996 Health Insurance Portability and Accountability Act (HIPAA) defines two methods for de-identifying health data, the “Safe Harbor” or “Expert Determination” methods. Using the “Safe Harbor” (SH) approach, de-identification involves removing all personal identifiers from a dataset (e.g., exact age, complete zip code)25 (Fig. 4b). This method can be limiting, preventing researchers from asking questions that require the information that was dropped during SH such as full dates of medical service. However, it has proven useful in its simplicity by providing clear guidance on what constitutes personal identifiers and requiring their complete removal In contrast, Expert Determination (ED) may allow for the preservation of the specific data elements, removed under Safe Harbor, that are important for addressing a maximum number of specific research questions, if it can be statistically shown that the dataset still has very low privacy risk. Thus, Expert Determination is more advantageous as it employs bespoke de-identification of datasets, only deleting what is necessary to ensure privacy while preserving the necessary variables for the completion of a given study.
Under the Expert Determination route, HIPAA also allows for de-identification of PII via the use of cryptographic hash functions which uniquely represent the underlying PII, while being irreversible back to the original PII. This approach to data connectivity encompasses several different computational techniques, together known as Privacy Preserving Record Linkage (PPRL). This innovative method of encrypting data has created new opportunities for linkage of de-identified data across domains previously thought impossible due to concerns of sharing PII21,26. Compared with established linkage methods, PPRL performs well for a variety of use cases, including linkage of administrative, clinical, and public health data27,28,29,30. By balancing privacy and linked data utility, real-world use of PPRL has grown and today is used as the linkage method in multiple national clinical research networks including the All of Us Research Program31,32, the National Patient-Centered Clinical Research Network (PCORNET)33,34, National Clinical Cohort Collaborative (N3C)35, and others36,37.
PPRL allows researchers and others to access data sources that would otherwise be unavailable, to ensure patient privacy and protection. However, this technology is evolving and the risk of exposing patient information by inadvertently permitting re-identification is being evaluated and minimized continuously.
Re-identification risk
The risk of inappropriate potential re-identification of an individual via their personal data is a major challenge in the world of data linkage and sharing. Strict technical and governance rules for protecting these patient records and privacy are essential. The governance of data collection, maintenance, and sharing stems from several federal laws in the US. The Privacy Act of 1974 establishes rules for the disclosure and use of PII contained in systems of records maintained by the federal government. The Act contains 12 exceptions, includes an exception that broadly permits the “routine uses” of the protected data. Subsequently, the HIPAA Privacy Rule introduced detailed standards and rules for the disclosure and use of patients’ protected health information.
As more data, including de-identified data, are connected, these linked data may require additional redaction or obscuration, to ensure that the risk of re-identification remains very small. Expert Determination offers optimal de-identification but is significantly more nuanced as no specific set of rules for removing identifiers is applied and statistical modeling is necessary to ensure low risk of re-identification. Several groups are developing tools to automate components of ED statistical analysis38,39,40, which for many datasets are able to reduce the time required for ED so that processing times for ED and SH are comparable. This need to protect PII while pursuing RL has led to ongoing development of multiple tools designed to maintain patient privacy known as privacy-enhancing technologies41.
Cross-sector examples of data linkage, integration and sharing dependent on RL
Data linkage tools impact many different areas of healthcare decision making. In countries with centralized health data systems, such as the UK, RL has opened the door to a new level of patient-oriented research. The US is also currently making strides to collect and safely disseminate RWD (real world data) for healthcare purposes. Below are some examples of advances possible using RL.
Basic scientific research to understand disease biology
New developments in RL technology, together with disease-specific registries, have the potential to unlock new discoveries that dramatically impact patient care. A compelling example is a study that evaluated medical records from 10 million military service personnel using the Department of Defense (DoD) registry, which houses longitudinal patient data and tissue samples42. Infection with Epstein Barr Virus (EBV) was found to increase a patient’s risk of developing Multiple Sclerosis (MS) by ~31 fold. The mechanism by which EBV-infected B cells promote MS development was subsequently published, and these findings are now being used to develop novel therapeutic interventions for MS43. The DoD registry was uniquely positioned to facilitate this discovery because it provided both patient samples and longitudinal medical history. Buy-in from academic institutions and federal support for this linked approach could be applied outside the DoD framework to researchers and policy makers.
Clinical decision-making (using current therapies)
Longitudinal clinical studies that intend to capture the natural history of a specific disease are powerful tools. These clinical studies linked to RWD sources including EHR and insurance claims can assist researchers to identify risk factors and biomarkers of disease. The information contained in these studies can have a significant impact on clinical decision-making. A recent example emerged from the amyotrophic lateral sclerosis (ALS) community. A small percentage of ALS patients have a known genetic predisposition arising from familial mutations in the superoxide dismutase (SOD1) enzyme and several other genes44. These patients have been enrolling in The Pre-symptomatic Familial Amyotrophic Lateral Sclerosis (Pre-fALS) Study (Pre-fALS) clinical trial since April 2006. This landmark prospective observational study revealed increased serum levels of Neurofilament light (Nfl) as a key biomarker of phenoconversion to symptomatic ALS. Capturing family history and blood samples from Pre-fALS individuals with linked data from multiple healthcare institutions, the investigators determined that Nfl levels are elevated in patients as early as 12 months prior to phenoconversion to active ALS. This information led to the design of a randomized, placebo-controlled, phase 3 trial called “A Study of BIIB067 (Tofersen) Initiated in Clinically Presymptomatic Adults With a Confirmed Superoxide Dismutase 1 Mutation (ATLAS)”45. In this study the investigators evaluate the efficacy of tofersen, an anti-sense oligonucleotide, designed to reduce expression of SOD1. The ATLAS study may alter clinical decision-making for pre-fALS patients in the next few years and shed light on mechanisms of ALS development in general. This example highlights the utility of large linked datasets in the clinical space.
Population health and policy
The COVID-19 pandemic exposed major gaps in population health care infrastructure and related data integration and sharing. Effective data sharing and linkage is essential for pandemic preparedness and response. During emergencies of this scale, it is critical that health care providers be able to accurately link test samples with the correct patients and feed information to regulatory and monitoring groups. Moreover, modern and effective data sharing infrastructure is necessary to communicate immunization plans, hospital bed availability, and mortality to the community. As a result of the knowledge gained during the recent pandemic, the US government, via the Centers for Disease Control and Prevention (CDC), is investing billions of dollars in an effort to enhance the coordination between public health and frontline care. Going forward, these CDC-driven initiatives will support innovative prediction and prevention plans. Specifically, contact tracing for infected persons and identification of high-risk areas are a high priority. Ideally, institutions and agencies would be able to quickly and accurately share travel manifests and outbreak detection to comprehensively mitigate the spread of new viruses and other diseases. During outbreaks, such as COVID-19, it’s also critical that patient data be aggregated, linked, and shared with researchers as quickly as possible to generate treatment options and identify risk factors. For example, the N3C COVID database, one of the largest aggregations of deidentified clinical data in the United States, was also one of the most successful applications of PPRL during the pandemic35,46.
If and when outbreaks do occur, this enhanced data linkage protocol will assist with case follow-up, and with necessary mitigation by gathering hospital and nursing home employment records for multi-employed individuals at risk of cross-institutional spread. Similarly, teacher and school/university employment data could be aggregated to calculate and mitigate risk of spread. In these cases, the technological capabilities around data linkage and sharing will likely guide public policy in this area. State and federal government agencies will need to play a regulatory role in establishing policies that protect communities while preserving individual privacy.
Health economics and outcomes research (HEOR)
HEOR-related studies strive to quantify and describe a patient’s engagement with commercial entities and the healthcare system. For example, a researcher might ask how many hours of clinical care and health-related services a diabetes patient consumes in one year and what the associated costs are. This line of investigation may reveal the factors that increase or decrease these costs. These studies open new areas of investigation into tools that can improve care and reduce costs. Such tools include new devices, technologies, and therapeutic interventions that require extensive testing to determine their safety, quality, and efficacy. These tools require comparison to previous iterations and validation to justify changes to commercial, policy, and clinical decision-making. For example, a medical device company may design a study to determine if a new glucose monitor is more accurate and has more benefits than a previous model. Specifically, it may wish to determine whether enhanced glucose monitoring results in less kidney dialysis or less diabetic neuropathy and surgery. HEOR researchers use linked RWD and medical records to answer these questions and follow a product post FDA approval to ascertain its effect on patient outcomes and the marketplace. There are several additional specific areas where RL and PPRL can support and enhance HEOR studies: cross-institutional collaboration, patient-centered outcomes research and economic modeling.
Cross-institutional collaboration is required because healthcare data is often siloed across different institutions and organizations. RL can facilitate cross-institutional collaboration by allowing researchers to securely link and analyze data from multiple healthcare providers, insurers, and research centers. One example that demonstrates the value of data linkage in HEOR comes from the United Kingdom. Researchers used medical records, prescriptions, and healthcare resource utilization (physician appointments etc.) to distinguish patients with the epileptic disorder Lennox-Gastaut Syndrome (LGS) from generalized seizure patients without a specific diagnosis47. They were able to determine an accurate prevalence of LGS in the UK and create an algorithm capable of separating LGS patients from those with other seizure disorders. Ultimately, commercial enterprises measure their financial investment against potential profits based on the utility of the product and decide how to proceed. Similarly, government healthagencies determine where to spend research funds based on disease incidence, for example as done for LGS, and utility of available treatments.
Due to US government investment and priorities, many areas of HEOR have become focused on patient-centered outcomes, including patient preferences, quality of life, and treatment satisfaction. RL can facilitate the collection and linkage of patient-reported outcomes data, allowing researchers to better understand the patient experience and preferences while ensuring data privacy.
HEOR often involves economic modeling to estimate the cost-effectiveness of healthcare interventions. RL and PPRL can provide researchers with access to cost and resource utilization data from multiple sources, allowing for more accurate modeling and cost-effectiveness assessments.
The future of data sharing
The future of data sharing and RL will involve a balance between technological innovation, regulatory compliance, ethical considerations, and a heightened focus on data privacy and security. The key to responsibly apply RL to major research endeavors will be to balance the desire for valuable data-driven insights while maintaining the trust and privacy of individuals whose data is being shared or linked, made more challenging at a time when trust is shrinking48.
Connecting data across sectors
The COVID-19 pandemic brought the importance of cooperation in research and data sharing to the forefront of science across many sectors. The National COVID Cohort Collaborative (N3C)49 and the COVID-19 Research Database50 are examples of recent data sharing ventures. This type of cooperation between commercial entities, academic institutions, and clinical operations is emblematic of the necessary infrastructure gains that promise progress across sectors. Many such collaborative data sharing ventures are ongoing and have evolved along with RL technology and privacy preservation techniques. However, the success of the approach in each population is dependent upon bias consideration. Fifteen years ago, RL studies utilizing Medicaid claims and other databases revealed biases inherent in the datasets51. Under representation of outcomes of high-risk pregnancies in lower income minority groups was observed. Moreover, biases against impoverished groups were found when using private insurance claims databases and conversely bias toward higher socio-economic class individuals was reported when performing linkages using Medicaid databases52,53,54. Medicaid specifically provides health care coverage for qualifying individuals with certain disadvantages including low-income, and physical disability. As RL technology develops and regulatory agencies approach these studies, appropriate bias identification and prevention will be critical.
Health equity
In recent years the US healthcare industry has begun to emerge from a long history of race and income-based care exclusion and deficits55,56,57,58. Numerous ethical and moral violations by the medical community, including forced sterilization of indigenous peoples59,60 and experimentation on Black men at Tuskegee58 resulted in a barrage of fear and resentment among Black, Indigenous, and People of Color (BIPOC) populations. These events followed by decades of discrimination left both rural and BIPOC patients wary of clinical research. Thus, very few of these patients have been included in clinical trials and have therefore not benefited from them to the same degree as wealthier white populations61,62,63. This discrimination has impacted many aspects of medical care, from the way we design tools and services to the way we run clinical trials. Current efforts, led by academic medical centers, the National Institutes of Health (NIH), and pharmaceutical companies, are aimed at reducing the barriers to care in all populations. Data linkage is one of the areas that can support equitable health care. For example, among cancer patients only 10% of individual tumors are sequenced, and the numbers are even lower for low-income patients and patients of color. Those who cannot access genomic testing or advanced clinical trials will never become aware of new treatments that could help them and their communities. Data linkage to identify gaps in population coverage, and/or biases in existing research data can help promote awareness and guide engagement of underserved communities in clinical research.
There are multiple areas wherein RL and emerging PPRL technology can reduce the burden of health disparities on underserved populations. RL allows for the integration of health care records with social determinants of health data, such as income, education, and housing status. By analyzing these linked data, healthcare organizations can identify high-risk populations that are more likely to experience health disparities. For example, they can pinpoint neighborhoods with a high prevalence of certain diseases or conditions. With RL, providers can tailor interventions for specific populations and offer targeted outreach and support to improve their health outcomes. RL can thus be used to assess disparities in healthcare access. By linking data from different healthcare facilities and comparing patient demographics, researchers and policymakers can identify areas where population groups face barriers to healthcare services. This information can inform the allocation of resources to address these disparities. Additionally, healthcare organizations and government agencies can use RL to assess the quality of care provided to different populations and implement quality improvement initiatives targeted at reducing disparities in care delivery.
Precision medicine
We now live in the era of widespread access to genomics tools and data. Companies such as Ancestry and 23andMe have made genomic testing widely available to consumers. Previously, genetic information pertaining to health risks was limited to those seeking genetic counseling or undergoing treatment for one of several diseases including cancers and cancer predisposition syndromes. Genomics data has transformed cancer treatment specifically. Some treatment providers/research institutions such as Memorial Sloan Kettering Cancer Center perform genomic sequencing of all incoming individual patients to characterize actionable genetic alterations and identify drug treatments that would be appropriate for a given mutation. This type of personalized medicine has had a significant impact on cancer treatment and may be able to do so for other disease states in the future. Finding a way to safely utilize genomics data in combination with other data types is imperative for advances in rare disease research and other areas. Integrating genomic and other omics data with EHR, laboratory results, prescription medicine history, social determinants of health, economic information and other RWD will open doors to therapeutic innovation across the health care system.
Advances in privacy-enhancing technology
One of the main challenges to successful data linkage and sharing is risk of patient/consumer exposure gaps in re-identification risk and privacy. As data becomes more complex and richer in information, it will require better protections and risk estimation to prevent re-identification. Some industries are moving away from using personal identifiers (e.g., the financial sector using credit cards with chips). However, in biomedical research applications retaining, but protecting, PII is preferable. Privacy must be a primary consideration in any new data endeavor. As a result, in addition to PPRL, new data privacy protection tools are emerging including synthetic data, federated learning, multi-party computation and secure data marketplaces.
In health research, synthetic data is artificially created data (often by computer algorithms) that statistically resembles real-world data but does not contain any real PII. The value of synthetic data is that it enables access to reliable and representative insights from sensitive data, while minimizing the risk to privacy and limiting regulatory requirements64. Researchers can access and share synthetic data in circumstances where sharing data is too challenging, expensive, or not permissible. For example, pharmaceutical companies desiring to share data from different countries with different regulatory obligations, could overcome this obstacle with synthetic datasets, an efficient, cost saving, and privacy preserving approach.
When linking multiple de-identified consumer/patient datasets, the risk of inappropriate re-identification is higher when the datasets are all housed on one server. Federated learning is a machine learning approach in which an algorithm is trained across multiple datasets held on decentralized servers65. Other advantages of this technique include the ability to use linked data without necessarily exchanging or purchasing said data. Therefore, federated learning increases privacy protection while expanding data accessibility for linkage. However, federated learning does require participants to share machine learning model parameters and protocols, thus leaving a minimal risk of privacy exposure.
As with federated learning, multi-party computation allows users to link data held at remote and decentralized servers. However, multi-party computation is a distinct machine learning approach that only requires participants to share end results, and all inputs remain encrypted66. This distinction increases privacy protection in multi-party computation by keeping training sets and protocols isolated and secure, thus reducing risk of re-identification.
One important advancement is secure data marketplaces, within platforms, where data providers can offer their data while ensuring that privacy is protected through technologies such as PPRL and secure data sharing protocols.
By integrating some or all these privacy-preserving technologies into existing data-sharing platforms, organizations and institutions can appropriately balance the needs of all stakeholders while maximizing data utility and maintaining data privacy. This approach will increase confidence for users, increasing the number who are willing to contribute their data while safeguarding their PII. Going forward, it will be critical to enhance the value of these platforms and help participating organizations comply with privacy regulations and build trust with stakeholders. For example, there are on-going efforts to establish successors to existing hospital/health care data management (e.g., HIE) in the form of Health Data Utilities (HDU) and vendor-specific data-sharing networks. Both HIEs and HDUs play important roles in the healthcare data ecosystem, however, HIEs primarily focus on the secure exchange of individual patient health records specifically among healthcare providers, while HDUs, often governed at the state level, are broader platforms that handle a wide range of health-related data for various purposes, including research, analytics, and value-added services. By integrating advanced privacy-preserving technologies into HIEs, they will be able to evolve into HDUs enabling secure, compliant, and efficient data sharing in the healthcare industry, while supplying the additional analytic and aggregation tools not otherwise found in HIEs. This advance would both support medical research and innovation while ensuring that patient privacy and data protection remain paramount.
Conclusions
In recent years real-world data sources and linkage have proliferated, bringing together diverse data types, and generating insights an order of magnitude greater than more narrow, single domain analyses42. These data, including the diverse forms of RWD (such as health records, insurance claims, consumer preferences, and geolocation), inform research and decision-making in diverse fields of study and policy as described above.
We predict that with advances in data linkage methods, data integration and sharing will reach into many more areas of investigation, healthcare, and policy making worldwide. Whereas many of the examples discussed here are relevant to the United States, many countries have a much longer history of successful data linkage efforts including the United Kingdom, Australia, Brazil, and Canada. The International Population Data Linkage Network (IPDLN) is an established and growing community promoting cooperation and scholarship in data linkage methods and applications and sponsors a biennial meeting for members. Within IPDLN, successful exemplars of data linkage centers include the UK Biobank, The SAIL Databank, and the Australian Institute of Health and Welfare.
Data linkage tools will empower research, government policies, and commercial decision-making, and safeguards for their use should be pursued responsibly on both domestic and international levels. Therefore, privacy protection and the associated technologies will become even more essential process components. Privacy best practice standards and legislation should be debated and agreed upon by experts in the field and consistently applied in real-world use. Continued enhancement of data quality is critical for the use of large linked datasets and will require stakeholders to agree to consistent documentation standards. Thoughtful balance between of the many technical, regulatory, and practical considerations for data linkage will enable the multiplicative impact of data linkage across previously disconnected fields.
References
Weber, G. M., Mandl, K. D. & Kohane, I. S. Finding the missing link for big biomedical data. JAMA 311, 2479–2480 (2014).
Stange, K. C. The problem of fragmentation and the need for integrative solutions. Ann. Fam. Med. 7, 100–103 (2009).
Cebul, R. D., Rebitzer, J. B., Taylor, L. J. & Votruba, M. E. Organizational fragmentation and care quality in the U.S healthcare system. J. Econ. Perspect. 22, 93–113 (2008).
Song, J. et al. Utilization of electronic health record data to evaluate the association of urban environment on systemic lupus erythematosus symptoms. Rheumatology (Oxford). https://doi.org/10.1093/rheumatology/keac647 (2022).
Walunas, T. L. et al. Disease outcomes and care fragmentation among patients with systemic lupus erythematosus. Arthritis Care Res. 69, 1369–1376 (2017).
Adler-Milstein, J., Bates, D. W. & Jha, A. K. A survey of health information exchange organizations in the United States: implications for meaningful use. Ann. Intern Med. 154, 666–671 (2011).
Krieger, N. The US Census and the People’s Health: public health engagement from enslavement and “indians not taxed” to census tracts and health equity (1790-2018). Am. J. Public Health 109, 1092–1100 (2019).
Lorkowski, J. & Pokorski, M. Medical records: a historical narrative. Biomedicines 10, 2594 (2022).
Camp, C. L. et al. Patient records at Mayo Clinic: lessons learned from the first 100 patients in Dr Henry S. Plummer’s dossier model. Mayo Clin. Proc. 83, 1396–1399 (2008).
Castellucci, M. Road to the Mayo Clinic: Plummer’s novel ideas transformed healthcare. Mod. Healthcare 46, H10–H12 (2016).
Dunn, H. L. Record linkage. Am. J. Public Health Nations Health 36, 1412–1416 (1946).
Fellegi, I. P. & Sunter, A. B. A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969).
Ruggles, S., Flood, S., Goeken, R., Schouweiler, M. & Sobek, M. IPUMS USA: Version 15.0 [dataset]. https://doi.org/10.18128/D010.V15.0 (IPUMS, Minneapolis, MN, 2023).
Mennemeyer, S. T., Menachemi, N., Rahurkar, S. & Ford, E. W. Impact of the HITECH Act on physicians’ adoption of electronic health records. J. Am. Med. Inform. Assoc. 23, 375–379 (2016).
Cohen, M. F. Impact of the HITECH financial incentives on EHR adoption in small, physician-owned practices. Int. J. Med. Inform. 94, 143–154 (2016).
Joseph, S., Sow, M., Furukawa, M. F., Posnack, S. & Chaffee, M. A. HITECH spurs EHR vendor competition and innovation, resulting in increased adoption. Am. J. Manag. Care 20, 734–740 (2014).
Szarfman, A. et al. Recommendations for achieving interoperable and shareable medical data in the USA. Commun. Med. 2, 86 (2022).
Wu, S. et al. Deep learning in clinical natural language processing: a methodical review. J. Am. Med. Inf. Assoc. 27, 457–470 (2020).
Ong, T. C., Duca, L. M., Kahn, M. G. & Crume, T. L. A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology. J. Am. Med. Inf. Assoc. 27, 505–513 (2020).
Joffe, E. et al. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J. Am. Med. Inf. Assoc. 21, 97–104 (2014).
Weber, S. C., Lowe, H., Das, A. & Ferris, T. A simple heuristic for blindfolded record linkage. J. Am. Med. Inf. Assoc. 19, e157–e161 (2012).
Grannis, S. J., Williams, J. L., Kasthuri, S., Murray, M. & Xu, H. Evaluation of real-world referential and probabilistic patient matching to advance patient identification strategy. J. Am. Med. Inf. Assoc. 29, 1409–1415 (2022).
Deng, Y. et al. Evolving availability and standardization of patient attributes for matching. Health Aff. Scholar 1, qxad047 (2023).
culbertson, A. et al. The building blocks of inter-operability: a multisite analysis of patient demographic attributes available for matching. Appl. Clin. Inform. 08, 322–336 (2017).
Krzyzanowski, B. & Manson, S. M. Twenty years of the health insurance portability and accountability act safe harbor provision: unsolved challenges and ways forward. JMIR Med. Inf. 10, e37756 (2022).
Kum, H. C., Krishnamurthy, A., Machanavajjhala, A., Reiter, M. K. & Ahalt, S. Privacy preserving interactive record linkage (PPIRL). J. Am. Med. Inform. Assoc. 21, 212–220 (2014).
Mirel, L. B., Resnick, D. M., Aram, J. & Cox, C. S. A methodological assessment of privacy preserving record linkage using survey and administrative data. Stat. J. IAOS 38, 413–421 (2022).
Nguyen, L. et al. Privacy-preserving record linkage of deidentified records within a public health surveillance system: evaluation study. J. Med. Internet Res. 22, e16757 (2020).
Irvine, K. et al. Real world performance of privacy preserving record linkage. Int. J. Population Data Sci. 3. https://doi.org/10.23889/ijpds.v3i4.990 (2018).
Kho, A. N. et al. Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J. Am. Med Inf. Assoc. 22, 1072–1080 (2015).
Kho, A. N. et al. in Machine Learning and Knowledge Discovery in Databases. (eds Peggy Cellier & Kurt Driessens) 79-87 (Springer International Publishing, 2022).
Yang, Y. et al. Ancillary Data Record Linkage to characterize the completeness of data for the All of Us Research Program. Int. J. Popul. Data Sci. 7. https://doi.org/10.23889/ijpds.v7i3.2090 (2022).
Marsolo, K. et al. Assessing the impact of privacy-preserving record linkage on record overlap and patient demographic and clinical characteristics in PCORnet(R), the National Patient-Centered Clinical Research Network. J. Am. Med Inf. Assoc. 30, 447–455 (2023).
Kiernan, D. et al. Establishing a framework for privacy-preserving record linkage among electronic health record and administrative claims databases within PCORnet((R)), the National Patient-Centered Clinical Research Network. BMC Res Notes 15, 337 (2022).
Sidky, H. et al. Data quality considerations for evaluating COVID-19 treatments using real world data: learnings from the National COVID Cohort Collaborative (N3C). BMC Med. Res. Methodol. 23, 46 (2023).
Khurshid, A. et al. Social and health information platform: piloting a standards-based, digital platform linking social determinants of health data into clinical workflows for community-wide use. Appl. Clin. Inform. 14, 883–892 (2023).
Graham, R. J. et al. Real-world analysis of healthcare resource utilization by patients with X-linked myotubular myopathy (XLMTM) in the United States. Orphanet J. Rare Dis. 18, 138 (2023).
Benitez, K., Loukides, G. & Malin, B. Beyond safe harbor: automatic discovery of health information de-identification policy alternatives. IHI 2010, 163–172 (2010).
El Emam, K. et al. A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inf. Assoc. 16, 670–682 (2009).
Blackport, J., Moffatt, C., Symmers, P., Bayless, P. & Gray, J. Methods and systems for monitoring a risk of re‐identification in a de‐identified database. U.S. Patent No. 11,741,262 B2 (2023). Filed July 19, 2021; issued August 29, 2023.
Baker, D. B., Kaye, J. & Terry, S. F. Governance through privacy, fairness, and respect for individuals. EGEMS 4, 1207 (2016).
Bjornevik, K. et al. Longitudinal analysis reveals high prevalence of Epstein-Barr virus associated with multiple sclerosis. Science 375, 296–301 (2022).
Lanz, T. V. et al. Clonally expanded B cells in multiple sclerosis bind EBV EBNA1 and GlialCAM. Nature 603, 321–327 (2022).
Opie-Martin, S. et al. The SOD1-mediated ALS phenotype shows a decoupling between age of symptom onset and disease duration. Nat. Commun. 13, 6901 (2022).
Benatar, M. et al. Design of a randomized, Placebo-Controlled, Phase 3 trial of tofersen initiated in clinically presymptomatic SOD1 variant carriers: the ATLAS study. Neurotherapeutics 19, 1248–1258 (2022).
Afshar, M. et al. Creation of a data commons for substance misuse related health research through privacy-preserving patient record linkage between hospitals and state agencies. JAMIA Open 6, ooad092 (2023).
Chin, R. F. M., Pickrell, W. O., Guelfucci, F., Martin, M. & Holland, R. Prevalence, healthcare resource utilization and mortality of Lennox-Gastaut syndrome: retrospective linkage cohort study. Seizure 91, 159–166 (2021).
Pathak, A. et al. Privacy preserving record linkage for public health action: opportunities and challenges. J. Am. Med. Inform. Assoc. 31, 2605–2612 (2024).
Haendel, M. A. et al. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J. Am. Med. Inf. Assoc. 28, 427–443 (2021).
Ando, W. et al. Impact of overlapping risks of type 2 diabetes and obesity on coronavirus disease severity in the United States. Sci. Rep. 11, 17968 (2021).
Bronstein, J. M. et al. Issues and biases in matching medicaid pregnancy episodes to vital records data: the Arkansas experience. Matern Child Health J. 13, 250–259 (2009).
Cole, J. A. et al. Bupropion in pregnancy and the prevalence of congenital malformations. Pharmacoepidemiol. Drug Saf. 16, 474–484 (2007).
Cole, J. A., Ephross, S. A., Cosmatos, I. S. & Walker, A. M. Paroxetine in the first trimester and the prevalence of congenital malformations. Pharmacoepidemiol. Drug Saf. 16, 1075–1085 (2007).
Grzeskowiak, L. E., Gilbert, A. L. & Morrison, J. L. Methodological challenges in using routinely collected health data to investigate long-term effects of medication use during pregnancy. Ther. Adv. Drug Saf. 4, 27–37 (2013).
Balan, N., Petrie, B. A. & Chen, K. T. Racial disparities in colorectal cancer care for black patients: barriers and solutions. Am. Surg. 88, 2823–2830 (2022).
Hwang, C. S. Black, incarcerated, and dying: reflections on racism and inequities in health care. Ann. Intern Med. 175, 1047–1048 (2022).
Lillard, J. W. Jr., Moses, K. A., Mahal, B. A. & George, D. J. Racial disparities in Black men with prostate. Cancer A Lit. Rev. Cancer. 128, 3787–3795 (2022).
Tobin, M. J. Fiftieth anniversary of uncovering the tuskegee syphilis study: the story and timeless lessons. Am. J. Respir. Crit. Care Med. 205, 1145–1158 (2022).
Jarrell, R. H. Native American women and forced sterilization, 1973-1976. Caduceus 8, 45–58 (1992).
Lawrence, J. The Indian Health Service and the sterilization of Native American women. Am. Indian Q 24, 400–419 (2000).
Swartz, T. H. & Titanji, B. Deconstruct racism in medicine - from training to clinical trials. Nature 583, 202 (2020).
Rai, T., Hinton, L., McManus, R. J. & Pope, C. What would it take to meaningfully attend to ethnicity and race in health research? Learning from a trial intervention development study. Socio. Health Illn. 44, 57–72 (2022).
Shah, S. J. & Essien, U. R. Equitable representation in clinical trials: looking beyond table 1. Circ. Cardiovasc. Qual. Outcomes 15, e008726 (2022).
Azizi, Z. et al. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11, e043497 (2021).
Dasaradharami Reddy, K. & Gadekallu, T. R. A comprehensive survey on federated learning techniques for healthcare informatics. Comput. Intell. Neurosci. 2023, 8393990 (2023).
van Egmond, M. B. et al. Privacy-preserving dataset combination and Lasso regression for healthcare predictions. BMC Med. Inf. Decis. Mak. 21, 266 (2021).
Author information
Authors and Affiliations
Contributions
T.S.K.E.M. led the paper writing effort with substantial effort and expertise from corresponding authors V.M. and A.N.K. J.L., V.L., and D.B.F. contributed specific expertise on privacy, regulation, and technology. All authors contributed to the revision and editing of the paper.
Corresponding authors
Ethics declarations
Competing interests
All authors are employed by or are external consultants at Datavant Inc., a commercial entity that produces data linkage technology.
Peer review
Peer review information
Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Eisinger-Mathason, T.S.K., Leshin, J., Lahoti, V. et al. Data linkage multiplies research insights across diverse healthcare sectors. Commun Med 5, 58 (2025). https://doi.org/10.1038/s43856-025-00769-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43856-025-00769-y