DrBERT is a state-of-the-art language model for French Biomedical based on the RoBERTa architecture pretrained on the French Biomedical corpus NACHOS.
DrBERT was assessed on 11 distinct practical biomedical applications for French language, including named entity recognition (NER), part-of-speech tagging (POS), binary/multi-class/multi-label classification, and multiple-choice question answering. The outcomes revealed that DrBERT enhanced the performance of most tasks compared to prior techniques, indicating that from-scratch pre-trained strategy is still the most effective for BERT language models on French Biomedical.
DrBERT was trained and evaluated by Yanis Labrak (LIA, Zenidoc), Adrien Bazoge (LS2N), Richard Dufour (LS2N), Mickael Rouvier (LIA), Emmanuel Morin (LS2N), Béatrice Daille (LS2N) and Pierre-Antoine Gourraud (Nantes University).
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Dr-BERT/DrBERT-7GB")
model = AutoModelForMaskedLM.from_pretrained("Dr-BERT/DrBERT-7GB")
All models pre-trained from-scratch use the CamemBERT configuration, which is the same as RoBERTabase architecture (12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters).
Nantes Biomedical Data Warehouse (NBDW) was obtained using the data warehouse from Nantes University Hospital.
The NACHOS dataset is only available for academic research, please contact : mickael.rouvier@univ-avignon.fr. Please include your name, last name, affiliation, contact details and a brief description of how you intend to use NACHOS.
@inproceedings{labrak2023drbert, title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}}, author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine}, booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper}, month = july, year = 2023, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics} }
The DrBert models as well as the pre-training scripts have been publicly released online under a MIT open-source license.