AI Data Curation Models and Tools
Data Model Curator – The overall goal of this project is to create AI tools to assist with data curation and harmonization, which are common bottlenecks in data sharing projects. This particular model is a fined-tuned LoRA layer based on Llama-3.1-8B that is optimized for the specific task of generating structured data models in JSON format from a dump of tabular data files
-
Generation of Synthetic Data Models and Contributions
-
Creation of Serialized File From AI Model Output
Data Mesh Automation Tools
To automate a data commons (or AI commons) joining a data mesh, there must be a machine readable configuration format that tells the mesh everything it needs to know about the node. These machine readable configurations are known as a node card and include authentication and authorization information, data APIs, APIs for analysis environment, etc. Likewise, a data mesh must also have a mesh card: a configuration format that tells nodes and other meshes details about the mesh and how they could potentially join the mesh.
We demonstrate the schema for the node and mesh cards, APIs, and deployment at the links below:
-
GitHub
AI Tools for Querying Data and Creating Cohorts
GDC Cohort Copilot
The NCI Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts.
Query augmented generation from a data commons
Query augmented generation (QAG) is an architecture for integrating LLMs with external data sources such as databases, knowledge bases and data commons. GDC-QAG is an application of QAG to the NCI Genomic Data Commons (GDC), to obtain accurate LLM responses to queries centered around frequencies of simple somatic mutations, copy number variants, microsatellite instability status and combinations of variants. GDC-QAG is containerized using Gradio and deployed as model context protocol (MCP) server on Hugging Face.