Job Description
Large Language Models (LLMs) are very widely adopted and deployed ML models that can address a multitude of advanced text mining tasks at scale and in production. By nature, LLMs are able to excel in a wide range of tasks not only because of their huge training datasets and the unprecedented number of model parameters, but also due to their ability to memorize large amounts of infomrmation contained in the training data. However, in the majority of the cases, the vendors of LLMs rarely disclose information about the training data used during the pre-training phase of the models, but also during their fine-tuning for specific tasks. Combined, these two facts, i.e., that the LLMs are memorizing large portions of their taining sets, and that there is rarely a disclosure of the details of the training sets used, can effortlessly lead us to the conclusion that often the data included in the LLMs are problematic: they might be copyrighted, and/or have a license that would not allow the LLM to be used for specific applications, e.g., commercial.
This necessitates the design and development of technical approaches that, given a text excert and a black box access to an LLM, can we determine if the model was trained and has memorized the provided text? In the scientific literature, there have been at least two such technical approaches with varying degrees of success, advantages and limitations: the first approach uses Cloze queries on a given data set, to determine the degree to which this dataset has been used in the training of an LLM [1]; the second approach, Min-k% Probability, is statistical and first gets the probability of each token X, selects the k% toekns with minimum probabilities and calculates their average log likelihood – if the average log likelihood is high, the text is likely in the training data of the LLM [2]. In this project the intern will be given a corpus of Elsevier scientific articles as text input, and is required to implement and execute the two approaches on the given corpus. In addition the intern will be asked to design and implement expansions that improve either of the two methodologies and provide better estimates of the contamination of a data set into the training of an LLM.
References:
[1] Chang et al., “Speak, Memory: An archaeology of Books Known to ChatGPT/GPT-4”, 2023
[2] Shi et al., “Detecting Pretraining Data from Large Language Models”, 2024
-----------------------------------------------------------------------
Elsevier is an equal opportunity employer: qualified applicants are considered for and treated during employment without regard to race, color, creed, religion, sex, national origin, citizenship status, disability status, protected veteran status, age, marital status, sexual orientation, gender identity, genetic information, or any other characteristic protected by law. We are committed to providing a fair and accessible hiring process. If you have a disability or other need that requires accommodation or adjustment, please let us know by completing our Applicant Request Support Form: https://forms.office.com/r/eVgFxjLmAK , or please contact 1-855-833-5120.
Please read our Candidate Privacy Policy.