NIST PREP Opportunity in Machine Learning for Identifying Datasets in Research Papers

The National Institute of Standards and Technology (NIST) Information Technology Laboratory (ITL) has prepared the following project description for recruitment within the NIST Professional Research Experience Program (PREP). Priority will be given to the best "match" between students and the project, rather than by project alone.

For questions regarding the NIST ITL recruitment using PREP please contact:
     Mark Przybocki (mark.przybocki@nist.gov); (301) 337-0767

For questions regarding the individual project descriptions, please contact the identified mentor.

For this PREP opportunity we are seeking students who are U.S. citizens.
Students interested in being considered for a NIST PREP placement in the Information Technology Laboratory should send the following items to Mark Przybocki (mark.przybocki@nist.gov):

  1. Resume/CV that highlights your coursework, experience and interests
  2. Identification of the project(s) for which you wish to be considered
  3. A letter of recommendation from a current Morgan State University Faculty member

Project Title: Machine Learning for Identifying Datasets in Research Papers

Position: Graduate Student (MS or PhD)

Mentor: Ian Soboroff (ian.soboroff@nist.gov)

Project Description:

Researchers in AI, information retrieval, natural language processing, and computer vision use datasets to build models, conduct experiments, and measure the effectiveness of their algorithms. The research papers they write refer to the dataset(s) used, but often in an oblique way that may be familiar to the community but to an outside reader may be mysterious. Most researchers use the dataset in a typical fashion, but others might subset it or only use parts of it. NIST is investigating using machine learning to build models that can identify datasets used in research papers, so that comprehensive catalogs of dataset usage can support use of the datasets themselves.

This project envisions the student extending a current dataset of research papers indicating if they use TREC collections (trec.nist.gov) with richer labels to indicate which specific datasets are used; building one or more machine learning models that can predict if a paper uses a TREC dataset and output which one; measuring the performance of those models; and developing an interactive tool to help staff identify papers using NIST datasets.

Desired Candidate Qualifications:

  • Familiarity with Python and with classical machine learning algorithms like logistic regression would be helpful