DrBERT

French Biomedical Language Model

LIA - Avignon University
LS2N - Nantes University
CHU de Nantes
Zenidoc

DrBERT

DrBERT is a state-of-the-art language model for French Biomedical based on the RoBERTa architecture pretrained on the French Biomedical corpus NACHOS.

DrBERT was assessed on 11 distinct practical biomedical applications for French language, including named entity recognition (NER), part-of-speech tagging (POS), binary/multi-class/multi-label classification, and multiple-choice question answering. The outcomes revealed that DrBERT enhanced the performance of most tasks compared to prior techniques, indicating that from-scratch pre-trained strategy is still the most effective for BERT language models on French Biomedical.

DrBERT was trained and evaluated by Yanis Labrak (LIA, Zenidoc), Adrien Bazoge (LS2N), Richard Dufour (LS2N), Mickael Rouvier (LIA), Emmanuel Morin (LS2N), Béatrice Daille (LS2N) and Pierre-Antoine Gourraud (Nantes University).

Models on HuggingFace

Tutorial

Load the model using HuggingFace's Transformers

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Dr-BERT/DrBERT-7GB")

model = AutoModelForMaskedLM.from_pretrained("Dr-BERT/DrBERT-7GB")

Pre-trained models

All models pre-trained from-scratch use the CamemBERT configuration, which is the same as RoBERTa_base architecture (12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters).

Available models

Model	#params	Architecture	Download	Strategy	Training data
`Dr-BERT` / `DrBERT-7GB`	110M	Base	drbert-7gb.tar.gz	From-scratch	NACHOS_large (7.4 GB text)
`Dr-BERT` / `DrBERT-4GB`	110M	Base	drbert-4gb.tar.gz	From-scratch	NACHOS_small (4 GB text)
`Dr-BERT` / `DrBERT-4GB-CP-PubMedBERT`	110M	Base	drbert-4gb-cp-pubmedbert.tar.gz	Continual pre-training	NACHOS_small (4 GB text)
`Dr-BERT` / `DrBERT-4GB-CP-CamemBERT`	110M	Base	drbert-4gb-cp-camembert.tar.gz	Continual pre-training	NACHOS_small (4 GB text)

Non-released models

Nantes Biomedical Data Warehouse (NBDW) was obtained using the data warehouse from Nantes University Hospital.

Model	#params	Architecture	Strategy	Training data
`ChuBERT-8GB`	110M	Base	From-scratch	NBDW_mixed (8 GB of text)
`ChuBERT-4GB`	110M	Base	From-scratch	NBDW_small (4 GB of text)
`ChuBERT-CP-4GB`	110M	Base	Continual pre-training	NBDW_small (4 GB of text)

Sources of the NACHOS corpus :

The NACHOS dataset is only available for academic research, please contact : mickael.rouvier@univ-avignon.fr. Please include your name, last name, affiliation, contact details and a brief description of how you intend to use NACHOS.

Recent Publications

Yanis Labrak, Adrien Bazoge, Richard Dufour, Mickael Rouvier, Emmanuel Morin, Béatrice Daille and Pierre-Antoine Gourraud. (2023). DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains.

Arxiv HAL Code Dataset

Citation BibTeX

@inproceedings{labrak2023drbert,
    title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
	author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
	booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
	month = july,
	year = 2023,
	address = {Toronto, Canada},
	publisher = {Association for Computational Linguistics}
}

License

The DrBert models as well as the pre-training scripts have been publicly released online under a MIT open-source license.

DrBERT

French Biomedical Language Model

LIA - Avignon University LS2N - Nantes University CHU de Nantes Zenidoc

DrBERT

Models on HuggingFace

Tutorial

Pre-trained models

Available models

Non-released models

Sources of the NACHOS corpus :

Recent Publications

Citation BibTeX

License

LIA - Avignon University
LS2N - Nantes University
CHU de Nantes
Zenidoc