Creating a Data Quality Control Framework for Producing New Personnel-Based S&E Indicators
We will develop an Automated and Stratified Entity Disambiguation (ASED) framework to resolve name ambiguity in large bibliographic data. We increase disambiguation accuracy by using stratified segmentation of entity instances and supervised machine learning trained on automatically labeled data. Second, we demonstrate the value of disambiguated data at scale by examining the involvement of U.S. science & engineering (S&E) researchers in international collaboration and citation networks using the entire corpus of Web of Science. We propose counterfactual analyses and impact simulations that compare model validity and research findings from the same data disambiguated using different methods. The approach we propose to disambiguate names and estimate ambiguity impact will contribute to sociology and management research for understanding what makes scientists and nations innovative and productive from ambiguous data, and to computer & information science for improving entity disambiguation and unstructured record linkage. The tools will be shared for reuse and improvement by scholars, and integrated into a data and codes platform open to research community for rigorous knowledge discovery from promising but messy data on S&E.
National Science Foundation
Funding Period: 9/1/2019 to 8/31/2021