Luís A. Nunes Amaral

co-Director, Northwestern Institute on Complex Systems
Professor of Chemical & Biological Engineering
Professor of Physics & Astronomy (by courtesy)
Professor of Medicine (by courtesy)

Chemical & Biological Engineering
2145 Sheridan Road (Room E136)
EvanstonIL 60208US
Phone: (847) 491-7850

Abstract

Understanding “how to optimize the production of scientific knowledge” is paramount to those who support scientific research—funders as well as research institutions—to the communities served, and to researchers. Structured archives can help all involved to learn what decisions and processes help or hinder the production of new knowledge. Using artificial intelligence (AI) and large language models (LLMs), we recently created the first structured digital representation of the historic archives of the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health. This work yielded a digital knowledge base of entities, topics, and documents that can be used to probe the inner workings of the Human Genome Project, a massive international public-private effort to sequence the human genome, and several of its offshoots like The Cancer Genome Atlas (TCGA) and the Encyclopedia of DNA Elements (ENCODE). The resulting knowledge base will be instrumental in understanding not only how the Human Genome Project and genomics research developed collaboratively, but also how scientific goals come to be formulated and evolve. Given the diverse and rich data used in this project, we evaluated the ethical implications of employing AI and LLMs to process and analyze this valuable archive. As the first computational investigation of the internal archives of a massive collaborative project with multiple funders and institutions, this study will inform future efforts to conduct similar investigations while also considering and minimizing ethical challenges. Our methodology and risk-mitigating measures could also inform future initiatives in developing standards for project planning, policymaking, enhancing transparency, and ensuring ethical utilization of artificial intelligence technologies and large language models in archive exploration.