DrBERT

DrBERT is a state-of-the-art language model for French Biomedical based on the RoBERTa architecture pretrained on the French Biomedical corpus NACHOS.

DrBERT was assessed on 11 distinct practical biomedical applications for French language, including named entity recognition (NER), part-of-speech tagging (POS), binary/multi-class/multi-label classification, and multiple-choice question answering. The outcomes revealed that DrBERT enhanced the performance of most tasks compared to prior techniques, indicating that from-scratch pre-trained strategy is still the most effective for BERT language models on French Biomedical.

DrBERT was trained and evaluated by Yanis Labrak (LIA, Zenidoc), Adrien Bazoge (LS2N), Richard Dufour (LS2N), Mickael Rouvier (LIA), Emmanuel Morin (LS2N), Béatrice Daille (LS2N) and Pierre-Antoine Gourraud (Nantes University).

Models on HuggingFace

Tutorial

Load the model using HuggingFace's Transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Dr-BERT/DrBERT-7GB")

model = AutoModelForMaskedLM.from_pretrained("Dr-BERT/DrBERT-7GB")

Pre-trained models

All models pre-trained from-scratch use the CamemBERT configuration, which is the same as RoBERTabase architecture (12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters).

Available models

Model #params Architecture Download Strategy Training data
Dr-BERT / DrBERT-7GB 110M Base drbert-7gb.tar.gz From-scratch NACHOSlarge (7.4 GB text)
Dr-BERT / DrBERT-4GB 110M Base drbert-4gb.tar.gz From-scratch NACHOSsmall (4 GB text)
Dr-BERT / DrBERT-4GB-CP-PubMedBERT 110M Base drbert-4gb-cp-pubmedbert.tar.gz Continual pre-training NACHOSsmall (4 GB text)
Dr-BERT / DrBERT-4GB-CP-CamemBERT 110M Base drbert-4gb-cp-camembert.tar.gz Continual pre-training NACHOSsmall (4 GB text)

Non-released models

Nantes Biomedical Data Warehouse (NBDW) was obtained using the data warehouse from Nantes University Hospital.

Model #params Architecture Strategy Training data
ChuBERT-8GB 110M Base From-scratch NBDWmixed (8 GB of text)
ChuBERT-4GB 110M Base From-scratch NBDWsmall (4 GB of text)
ChuBERT-CP-4GB 110M Base Continual pre-training NBDWsmall (4 GB of text)

Sources of the NACHOS corpus :

The NACHOS dataset is only available for academic research, please contact : mickael.rouvier@univ-avignon.fr. Please include your name, last name, affiliation, contact details and a brief description of how you intend to use NACHOS.

Recent Publications

(2023). DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains.

Arxiv HAL Code Dataset

Citation BibTeX

@inproceedings{labrak2023drbert,
    title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
	author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
	booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
	month = july,
	year = 2023,
	address = {Toronto, Canada},
	publisher = {Association for Computational Linguistics}
}

License

The DrBert models as well as the pre-training scripts have been publicly released online under a MIT open-source license.