Digitally linked records

How good are automated record linkage methods?

9/28/2017 feature story

To facilitate the formation of the massive LIFE-M longitudinal dataset, Martha Bailey, Catherine Massey, and Eytan Adar are analyzing the performance of and cross-population inferences for the most common automated linking algorithms.

More Information.

Martha J. Bailey

Project Information:

How Does Automated Record Linkage Affect Inferences about Population Health?

This project compares the performance of automated linking algorithms with the goal of improving their potential. Automated linking methods are required to complete the NSF-funded Longitudinal Intergenerational Family Electronic Micro-dataset (LIFE-M), which will link millions of US vital records to historical decennial census records to create an extensive longitudinal dataset covering individuals born in the US from 1880 to 1930. This analysis emanates from that need. The project will produce systematic evidence regarding the performance of the most popular automated linking methods in terms of match rates, representativeness of the underlying population, erroneous match rates, and systematic measurement error. It will also examine how phonetic name-cleaning methods affect quality. Significantly, the project will analyze how match quality metrics vary for different underrepresented subgroups - including women, racial/ethnic minorities, and immigrants - to determine how specific linking methods could differentially affect inferences for different populations. Finally, the project will formulate recommended practices for researchers based upon the findings.

Martha J. Bailey, Eytan Adar

Feature Archive.


Connect with PSC follow PSC on Twitter Like PSC on Facebook