PubMed GPT: A domain-specific large language model for biomedical texts

2024-08-06 10:07:00

"We are excited to release a new biomedical model trained on PubMed as the first step in building a foundational model that can support biomedical research." - Percy Liang, Director of CRFM

The Stanford Center for Fundamental Modeling Research (CRFM) and MosaicML have jointly developed the PubMed GPT model, a large language model trained to interpret biomedical language.

At present, large language models (LLMs) are commonly used in natural language synthesis, image synthesis, speech synthesis, etc., but there are few known applications in specific industries. The PubMed GPT introduced in this article demonstrates the capabilities of industry-specific large language models, especially in the biomedical field. Using the MosaicML cloud platform, the developers of CRFM trained a generative pre-trained model (GPT) on PubMed's biomedical dataset. The results show that the language generation model in a specific domain will have a good development prospect in practical applications, and at the same time, LLMs will also show better performance and competitiveness. Note: This model is currently for R&D only and is not suitable for production.

PubMed GPT

Model. PubMed GPT 2.7B is based on the HuggingFace GPT model with a parameter of 2.7B and a maximum context length of 1024 markers. The design is as simple as possible to demonstrate the power of existing LLM training methods.

数据。采用Pile数据集的部分——PubMed Abstracts和PubMed Central。

Compute. The developers chose to train PubMed GPT on 50B tokens multiple times to reach a longer computation period (300B). The results show that an excellent LLM model can still be trained under the condition of limited data.

MosaicML Cloud Platform

MosaicML Cloud. Based on the MosaicML cloud software stack, developers trained PubMed GPT on a cluster with 128 NVIDIA A100-40GB GPUs and 1600Gb/s network bandwidth between nodes, with a total training time of about 6.25 days.

Composer Library. Due to the efficiency and inclusiveness of the MosaicML open-source Composer library, developers use the Composer library as well as its FSDP integration to train models.

Streaming datasets. To manage custom training datasets quickly, flexibly, and inexpensively, developers use MosaicML's new StreamingDataset library to manage 100GB of multi-text training datasets.

assess

The developers evaluated PubMed GPT on several Q&A benchmarks. For example, the following is a medical question summary benchmark:

It processes the patient's queries (which may contain information such as ambiguities, misspellings, etc.) and presents them to the doctor in a clear and correct format.

At the same time, the developers compared the results with 5 models (as shown above): :D RAGON, GPT-Neo 2.7B, Galactica, BioLinkBERT, and PubMedBERT. It turned out that:

1. LLM is very versatile, and it has the same performance as a professionally designed system when trained from scratch in a specific field;

2. Pre-training for domain-specific data is better than general data;

3. The focus model can achieve high-quality results with fewer resources.

summary

The results of PubMed GPT are only the first step in research in biomedical texts and other fields, and more researchers are needed to develop more advanced results in the future. And it's just a proof of concept at the moment, and the ultimate hope is for a future where trustworthy interactive AI systems emerge that can facilitate reliable interactions while screening human experts.

Resources

https://www.mosaicml.com/blog/introducing-pubmed-gpt

PubMed GPT: A domain-specific large language model for biomedical texts

Read on