Background:
Use of standard vocabularies in cancer data repositories is an important way to preserve the ongoing scientific value of cancer research data. However, not all studies record their data using standards, or the same standards. A key problem then in research data management and distribution revolves around how to translate (or map) terms used in one standard to synonymous or similar terms in another. This tends to be an expensive, manual process, with computational tools that use some form of string matching. Large language models have the potential to automate the term mapping process and to do so with high accuracy. Several methods and corresponding software for using LLMs to map terms have been proposed in the literature and by our collaborators.
Project Description:
At a high level, we envision a project that would implement various LLM-based mapping methods, along with a user interface that is built with data subject matter experts in mind. The concept is along the lines of a toolkit, with an intuitive frontend enabling SMEs to work with incoming data, attempt mappings to a selection of standards, and provide feedback that enables the SME to evaluate the resulting mappings and select those that make sense for future use.
- Fall mentor time: Thursday: 4:30 PM Eastern
- Fall lab time: Tuesday: 3:30 PM Eastern
- Industry: Biotechnology
- Topics: llms, ui/ux
- Requirements: Open to all students