Introduction

Discovering materials and utilizing them for practical applications is an extremely time-consuming process that may span decades1,2. To accelerate this process, we need to exploit and harness the knowledge on materials that has been developed over the centuries through rigorous scientific procedure in a cohesive fashion3,4,5,6,7,8. Textbooks, scientific publications, reports, handbooks, websites, etc., serve as a large data repository that can be mined for obtaining the already existing information9,10. However, it is a challenging task to extract useful information from these texts since most of the scientific data is semi- or un-structured in the form of text, paragraphs with cross reference, image captions, and tables10,11,12. Extracting such information manually is extremely time- and resource-intensive and relies on the interpretation of a domain expert.

Natural language processing (NLP), a sub-domain in artificial intelligence, presents an alternate approach that can automate information extraction from text. Earlier approaches in NLP relied on non-neural methods based on n-grams such as Brown et al. (1992)13, structural learning framework by Ando and Zhang (2005)14, or structural correspondence learning by Blitzer et al. (2006)15, but these are no longer state of the art. Neural pre-trained embeddings like word2vec16,17 and GloVe18 are quite popular, but they lack domain-specific knowledge and do not produce contextual embeddings. Recent progress in NLP has led to the development of a computational paradigm in which a large, pre-trained language model (LM) is finetuned for domain-specific tasks. Research has consistently shown that this pretrain-finetune paradigm leads to the best overall task performance19,20,21,22,23. Statistically, LMs are probability distributions for a sequence of words such that for a given set of words, it assigns a probability to each word24. Recently, due to the availability of large amounts of text and high computing power, researchers have been able to pre-train these large neural language models. For example, Bidirectional Encoder Representations from Transformers (BERT)25 is trained on BookCorpus26 and English Wikipedia, resulting in state-of-the-art performance on multiple NLP tasks like question answering and entity recognition, to name a few.

Researchers have used NLP tools to automate database creation for ML applications in the materials science domain. For instance, ChemDataExtractor27, an NLP pipeline, has been used to create databases of battery materials28, Curie and Néel temperatures of magnetic materials29, and inorganic material synthesis routes30. Similarly, NLP has been used to collect the composition and dissolution rate of calcium aluminosilicate glassy materials31, and zeolite synthesis routes to synthesize germanium containing zeolites32, and to extract process and testing parameters of oxide glasses, thereby enabling improved prediction of the Vickers hardness11. Researchers have also made an automated NLP tool to create databases using the information extracted from computational materials science research papers33. NLP has also been used for other tasks such as topic modeling in glasses, that is, to group the literature into different topics in an unsupervised fashion and to find images based on specific queries such as elements present, synthesis, or characterization techniques, and applications10.

A comprehensive review by Olivetti et al. (2019) describes several ways in which NLP can benefit the materials science community34. Providing insights into chemical parsing tools like OSCAR435 capable of identifying entities and chemicals from text, Artificial Chemist36, which takes the input of precursor information and generates synthetic routes to manufacture optoelectronic semiconductors with targeted band gaps, robotic system for making thin films to produce cleaner and sustainable energy solutions37, and identification of more than 80 million materials science domain-specific named entities, researches have prompted the accelerated discovery of materials for different applications through the combination of ML and NLP techniques. Researchers have shown the domain adaptation capability of word2vec and BERT in the field of biological sciences as BioWordVec38 and BioBERT19, other domain-specific BERTs like SciBERT21 trained on scientific and biomedical corpus39, clinicalBERT40 trained on 2 million clinical notes in MIMIC-III v1.4 database41, mBERT42 for multilingual machine translations tasks, PatentBERT23 for patent classification and FinBERT for financial tasks22. This suggests that a materials-aware LM can significantly accelerate the research in the field by further adapting to downstream tasks9,34. Although there were no papers on developing materials-aware language models prior to this work43, in a recent preprint44, Walker et al. (2021) emphasize the impact of domain-specific language models on named entity recognition (NER) tasks in materials science.

In this work, we train materials science domain-specific BERT, namely MatSciBERT. Figure 1 shows the graphical summary of the methodology adopted in this work encompassing creating the materials science corpus, training the MatSciBERT, and evaluating different downstream tasks. We achieve state-of-the-art results on domain-specific tasks as listed below.

  1. a.

    NER on SOFC, SOFC Slot dataset by Friedrich et al. (2020)45 and Matscholar dataset by Weston et al. (2019)9

  2. b.

    Glass vs. Non-Glass classification of paper abstracts10

  3. c.

    Relation Classification on MSPT corpus46

Fig. 1: Methodology for training MatSciBERT.
figure 1

We create the Materials Science Corpus (MSC) through query search followed by selection of relevant research papers. MatSciBERT, pre-trained on MSC, is evaluated on various downstream tasks.

The present work, thus, bridges the gap in the availability of a materials domain language model, allowing researchers to automate information extraction, knowledge graph completion, and other downstream tasks and hence accelerate the discovery of materials. We have hosted the MatSciBERT pre-trained weights at https://huggingface.co/m3rg-iitd/matscibert and codes for pre-training and finetuning on downstream tasks at https://github.com/M3RG-IITD/MatSciBERT. Also, the codes with finetuned models for the downstream tasks are available at https://doi.org/10.5281/zenodo.6413296.

Results and discussion

Dataset

Textual datasets are an integral part of the training of an LM. There exist many general-purpose corpora like BookCorpus26 and EnglishWikipedia, and domain-specific corpora like biomedical corpus39, and clinical database41, to name a few. However, none of these corpora is suitable for the materials domain. Therefore, with the aim of providing a materials specific LM, we first create a corpus spanning four important materials science families of inorganic glasses, metallic glasses, alloys, and cement and concrete. It should be noted that although these broad categories are mentioned, several other categories of materials, including two-dimensional materials, were also present in the corpus. Specifically, we have selected ~150 K papers out of ~1 M papers downloaded from the Elsevier Science Direct Database. The steps to create the corpus are provided in the Methods section. The details about the number of papers and words for each family are given in Supplementary Table 1. We have also provided the list of DOIs and PIIs of the papers used to pre-train MatSciBERT in the GitHub repository for this work.

The materials science corpus developed for this work has ~285 M words, which is nearly 9% of the number of words used to pre-train SciBERT (3.17B words) and BERT (3.3B words). Since we continue pre-training SciBERT, MatSciBERT is effectively trained on a corpus consisting of 3.17 + 0.28 = 3.45B words. From Supplementary Table 1, one can observe that 40% of the words are from research papers related to inorganic glasses and ceramics, and 20% each from bulk metallic glasses (BMG), alloys, and cement. Although the number of research papers for “cement and concrete” is more than “inorganic glasses and ceramics”, the latter has higher words. This is because of the presence of a greater number of full-text documents retrieved associated with the latter category. The Supplementary Table 2 represents the word count of important strings relevant to the field of materials science. It should be noted that the corpus encompasses the important fields of thermoelectric, nanomaterials, polymers, and biomaterials. Also, note that the corpora used for training the language model consists of both experimental and computational works as both these approaches play a crucial role in understanding material response. The average paper length for this corpus is ~1848 words, which is two-thirds of the average paper length of 2769 words for the SciBERT corpus. The lower average paper length can be attributed to two things: (a) In general, materials science papers are shorter than biomedical papers. We verified this by computing the average paper length of full-text materials science papers. The number came out to be 2366. (b) There are papers without full text also in our corpus. In that case, we have used the abstracts of such papers to arrive at the final corpus.

Pre-training of MatSciBERT

For MatSciBERT pre-training, we follow the domain adaptive pre-training proposed by Gururangan et al. (2020). In this work, authors continued pre-training of the initial LM on corpus of domain-specific text20. They observed a significant improvement in the performance on domain-specific downstream tasks for all the four domains despite the overlap between initial LM vocabulary and domain-specific vocabulary being less than 54.1%. BioBERT19 and FinBERT22 were also developed using the similar approach where the vanilla BERT model was further pre-trained on domain-specific text, and tokenization is done using the original BERT vocabulary. We initialize MatSciBERT weights with that of some suitable LM and then pre-train it on MSC. To determine the appropriate initial weights for MatSciBERT, we trained an uncased wordpiece47 vocabulary based on the MSC using the tokenizers library48. The overlap of MSC vocabulary is 53.64% with the uncased SciBERT21 vocabulary and 38.90% with the uncased BERT vocabulary. Because of the larger overlap with the vocabulary of SciBERT, we tokenize our corpus using the SciBERT vocabulary and initialize the MatSciBERT weights with that of SciBERT as made publicly available by Beltagy et al. (2019)21. It is worth mentioning that a materials science domain-specific vocabulary would likely represent the corpus with a lesser number of wordpieces and potentially lead to a better language model. For e.g., “yttria-stabilized zirconia” is tokenized as [“yt”, “##tri”, “##a”, “-”, “stabilized”, “zircon”, “##ia”] by the SciBERT vocabulary, whereas a domain-specific tokenization might have resulted in [“yttria”, “-”, “stabilized”, “zirconia”]. However, using a domain-specific tokenizer does not allow the use of SciBERT weights and takes advantage of the scientific knowledge already learned by SciBERT. Further, using the SciBERT vocabulary for the materials domain is not necessarily detrimental since the deep neural language models have the capacity to learn repeating patterns that represent new words using the existing tokenizer. For instance, when the wordpieces “yt”, “##tri”, and “##a” occur consecutively, SciBERT indeed recognizes that some material is being discussed, as demonstrated in the downstream tasks. This is also why most domain-specific BERT-based LMs like FinBERT22, BioBERT19, and ClinicalBERT40 extend the pre-training instead of using domain-specific tokenizers and learning from scratch.

The details of the pre-training procedure are provided in the Methods section. The pre-training was performed for 360 h, after which the model achieved a final perplexity of 2.998 on the validation set (see Supplementary Fig. 1a). Although not directly comparable due to different vocabulary and validation corpus, BERT25, and RoBERTa49 authors report perplexities as 3.99 and 3.68, respectively, which are in the same range. We also provide graphs for other evaluation metrics like MLM loss and MLM accuracy in Supplementary Fig. 1b, c. The final pre-trained LM was then used to evaluate different materials science domain-specific downstream tasks, details of which are described in the subsequent sections. The performance of the LM on the downstream tasks was compared with that of SciBERT, BERT, and other baseline models to evaluate the effectiveness of MatSciBERT to learn the materials’ specific information.

In order to understand the effect of pre-training on the model performance, a materials domain-specific downstream task, NER on SOFC-slot, was performed using the model at regular intervals of pre-training. To this extent, the pre-trained model was finetuned on the training set of the SOFC-slot dataset. The choice of the SOFC-slot dataset was based on the fact that the dataset was comprised of fine-grained materials-specific information. Thus, this dataset is appropriate to distinguish the performance of SciBERT from the materials-aware LMs. The performance of these finetuned models was evaluated on the test set. LM-CRF architecture was used for the analysis since LM-CRF consistently gives the best performance for the downstream task, as shown later in this work. The macro-F1 averages across three seeds exhibited an increasing trend (see Supplementary Fig. 2a), suggesting the importance of training for longer durations. We also show a similar graph for the abstract classification task (Supplementary Fig. 2b).

Downstream tasks

Here, we evaluate MatSciBERT on three materials science specific downstream tasks namely, Named Entity Recognition (NER), Relation Classification, and Paper Abstract Classification.

We now present the results on the three materials science NER datasets as described in the Methods section. To the best of our knowledge, the best Macro-F1 on solid oxide fuel cells (SOFC) and SOFC-Slot datasets is 81.50% and 62.60%, respectively, as reported by Friedrich et al. (2020), who introduced the dataset44. We run the experiments on the same train-validation-test splits as done by Friedrich et al. (2020) for a fair comparison of results. Moreover, since the authors reported results averaged over 17 entities (the extra entity is “Thickness”) for the SOFC-Slot dataset, we also report the results taking the ‘Thickness’ entity into account.

Table 1 shows the Macro-F1 scores for the NER task on the SOFC-Slot and SOFC datasets by MatSciBERT, SciBERT, and BERT. We observe that LM-CRF always performs better than LM-Linear. This can be attributed to the fact that the CRF layer can model the BIO tags accurately. Also, all SciBERT architectures perform better than the corresponding BERT architecture. We obtained an improvement of ~6.3 Macro F1 and ~3.2 Micro F1 (see Supplementary Table 3) on the SOFC-Slot test set for MatSciBERT vs. SciBERT while using the LM-CRF architecture. For the SOFC test dataset, MatSciBERT-BiLSTM-CRF performs better than SciBERT-BiLSTM-CRF by ~2.1 Macro F1 and ~2.1 Micro F1. Similar improvements can be seen for other architectures as well. These MatSciBERT results also surpass the current best results on SOFC-Slot and SOFC datasets by ~3.35 and ~0.9 Macro-F1, respectively.

Table 1 Macro-F1 scores on the test set for SOFC-Slot and SOFC datasets averaged over three seeds and five cross-validation splits.

It is worth noting that the SOFC-slot dataset consists of 17 entity types and hence has more fine-grained information regarding the materials. On the other hand, SOFC has only four entity types representing coarse-grained information. We notice that the performance of MatSciBERT on SOFC-slot is significantly better than that of SciBERT. To further evaluate this aspect, we analyzed the F1-score of both SciBERT and MatSciBERT on all the 17 entity types of the SOFC-slot data individually, as shown in Fig. 2. Interestingly, we observe that for all the materials related entity types, namely anode material, cathode material, electrolyte material, interlayer material, and support material, MatSciBERT performs better than SciBERT. In addition, for materials related properties such as open circuit voltage and degradation rate, MatSciBERT is able to significantly outperform SciBERT. This suggests that MatSciBERT is indeed able to capitalize on the additional information learned from the MSC to deliver better performance on complex problems specific to the materials domain.

Fig. 2: Comparison of MatSciBERT and SciBERT on validation sets of SOFC-Slot dataset.
figure 2

The entity-level F1-score for MatSciBERT and Scibert models in blue and red color respectively. The bold colored text represents the best model’s score.

Now, we present the results for the Matscholar dataset9 in Table 2. For this dataset too, MatSciBERT outperforms SciBERT, BERT as well as the current best results, as can be seen in the case of LM-CRF architecture. The authors obtained Macro-F1 of 85.41% on the validation set and 85.10% on the test set, and Micro-F1 of 87.09% and 87.04% (see Supplementary Table 4). We observe that our best model MatSciBERT-CRF has Macro-F1 values of 88.66% and 86.38%, both better than the existing state of the art.

Table 2 Macro-F1 scores on the test set for Matscholar averaged over three seeds.

In order to demonstrate the performance of MatSciBERT, we demonstrate an example from the validation set of the dataset in Supplementary Figs. 3 and 4. The overall superior performance of MatSciBERT is evident from Table 2.

Table 3 shows the results for the Relation Classification task performed on the Materials Synthesis Procedures dataset46. We also compare the results with two recent baseline models, MaxPool and MaxAtt50, details of which can be found in the Methods section. Even in this task, we observe that MatSciBERT performs better than SciBERT, BERT, and baseline models consistently, although with a lower margin.

Table 3 Test set results for Materials Synthesis Procedures dataset averaged over three seeds.

In Paper Abstract Classification downstream task, we consider the ability of LMs to classify a manuscript into glass vs. non-glass topics based on an in-house dataset10. This is a binary classification problem, with the input being the abstract of a manuscript. Here too, we use the same baseline models MaxPool and MaxAtt50. Table 4 shows the comparison of accuracies achieved by MatSciBERT, SciBERT, BERT, and baselines. It can be clearly seen that MatSciBERT outperforms SciBERT by more than 2.75% accuracy on the test set.

Table 4 Test set results for glass vs. non-glass dataset averaged over three seeds.

Altogether, we demonstrate that the MatSciBERT, pre-trained on a materials science corpus, can perform better than SciBERT for all the downstream tasks such as NER, abstract classification, and relation classification on materials datasets. These results also suggest that the scientific literature in the materials domain, on which MatSciBERT is pre-trained, is significantly different from the computer science and biomedical domains on which SciBERT is trained. Specifically, each scientific discipline exhibits significant variability in terms of ontology, vocabulary, and domain-specific notations. Thus, the development of a domain-specific language model, even within the scientific literature, can significantly enhance the performance in downstream tasks related to text mining and information extraction from literature.

Applications in materials domain

Now, we discuss some of the potential areas of application of MatSciBERT in materials science. These areas can range from the simple topic-based classification of research papers to discovering materials or alternate applications for existing materials. We demonstrate some of these applications as follows: (i) Document classification: A large number of manuscripts have been published on materials related topics, and the numbers are increasing exponentially. Identifying manuscripts related to a given topic is a challenging task. Traditionally, these tasks are carried out employing approaches such as term frequency-inverse document frequency (TFIDF) or Word2Vec, which is used along with a classification algorithm. However, these approaches directly vectorize a word and are not context sensitive. For instance, in the phrases “flat glass”, “glass transition temperature”, “tea glass”, the word “glass” is used in a very different sense. MatSciBERT will be able to extract the contextual meaning of the embeddings. Thus, MatSciBERT will be able to effectively classify the topics thereby enabling improved topic classification. This is evident from the binary classification results presented earlier in Table 4, where we observe that the accuracy obtained using MatSciBERT (96.22%) was found to be significantly higher than the results obtained using pooling based BiLSTM models (91.44%). This approach can be extended to a larger set of abstracts for the accurate classification of documents from the literature.

(ii) Topic modeling: Topic modeling is an unsupervised approach of grouping documents belonging to similar topics together. Traditionally, topic modeling employs algorithms such as latent Dirichlet allocation (LDA) along with TF-IDF or Word2Vec to cluster documents having the same or semantically similar words together. Note that these approaches rely purely on the frequency of word (in TF-IDF) or the embeddings of the word (in Word2Vec) for clustering without taking into account the context. The use of context-aware embeddings as learned in MatSciBERT could significantly enhance the topic modeling task. As a preliminary study, we perform topic modeling using MatSciBERT on an in-house corpus of abstracts on glasses and ceramics. Note that the same corpus was used in an earlier work10 for topic modeling using LDA. Specifically, we obtain the output embeddings of the [CLS] token for each abstract using MatSciBERT. Further, these embeddings were projected into two dimensions using the UMAP algorithm51 and then clustered using the k-means algorithm52. We then concatenate all the abstracts belonging to the same cluster and calculate the most frequent words for each cluster/topic.

The Supplementary Tables 5 and 6 shows the top ten topics obtained using LDA and MatSciBERT, respectively. The top 10 keywords associated with each topic are also provided in the table. We observe that the topics and keywords from MatSciBERT-based topic modeling are more coherent than the ones obtained from LDA. Further, the actual topics associated with the keywords are not very apparent from Supplementary Table 5. Specifically, Topic 9 by LDA contains keywords from French, suggesting that the topic represents French publications. Similarly, Topic 5 and Topic 3 have several generic keywords that don’t represent a topic clearly. On the other hand, the keywords obtained by MatSciBERT enable a domain expert to identify the topics well. For instance, some of the topics identified based on the keywords by three selected domain experts are dissolution of silicates (9), oxide thin films synthesis and their properties (8, 6), materials for energy (0), electrical behavior of ceramics (1), and luminescence studies (5). Despite their efforts, the same three domain experts were unable to identify coherent topics based on the keywords provided by LDA. Altogether, MatSciBERT can be used for topic modeling, thereby providing a broad overview of the topics covered in the literature considered.

(iii) Information extraction from images: Images hold a large amount of information regarding the structure and properties of materials. A proxy to identify relevant images would be to go through the captions of all the images. However, each caption may contain multiple entities, and identifying the relevant keywords might be a challenging task. To this extent, MatSciBERT finetuned on NER can be an extremely useful tool for extracting information from figure captions.

Here, we extracted entities from the figure captions used by Venugopal et al. (2021)10 using MatSciBERT finetuned on the Matscholar NER dataset. Specifically, entities were extracted from ~110,000 image captions on topics related to inorganic glasses. Using MatSciBERT, we obtained 87,318 entities as DSC (sample descriptor), 10,633 entities under APL (application), 145,324 as MAT (inorganic material), 76,898 as PRO (material property), 73,241 as CMT (characterization method), 33,426 as SMT (synthesis method), and 2,676 as SPL (symmetry/phase label). Figure 3 shows the top 10 extracted entities under the seven categories proposed in the Matscholar dataset. The top entities associated with each of the categories are coating (application), XRD (characterization), glass (sample descriptor, inorganic material), composition (material property), heat (synthesis method), and hexagonal (symmetry/phase). Further details associated with each category can also be obtained from these named entities. It should be noted that each caption may be associated with more than one entity. These entities can then be used to obtain relevant images for specific queries such as “XRD measurements of glasses used for coating” or “emission spectra of doped glasses”, or “SEM images of bioglasses with Ag”, to name a few.

Fig. 3: Top-10 entities for various categories.
figure 3

a APL Application, b CMT Characterization method, c DSC Sample descriptor, d MAT Inorganic material, e PRO Material Property, and f SMT Synthesis method.

Further, Fig. 4 shows some of the selected captions from the image captions along with the corresponding manual annotation by Venugopal et al. (2021)10. The task of assigning tags to each caption was carried out by human experts. Note that only one word was assigned per image caption in the previous work. Using the MatSciBERT NER model, we show that multiple entities are extracted for the selected five captions. This illustrates the large amount of information that can be captured using the LM proposed in this work.

Fig. 4: Comparison of MatSciBERT based NER tagging with manually assigned labels.
figure 4

MatSciBERT-based NER model is able to extract multiple entities as compared to single manual label for each caption.

(iv) Materials caption graph: In addition to the queries as mentioned earlier, graph representations can provide in-depth insights into the information spread in figure captions. For instance, questions such as “which synthesis and characterization methods are commonly used for a specific material?”, “what are the methods for measuring a specific property?” can be easily answered using knowledge graphs. Here, we demonstrate how the information in figure captions can be represented using materials caption graphs (MCG). To this extent, we first randomly select 10,000 figure captions from glass-related publications. Further, we extract the entities and their types from the figure captions using the MatSciBERT finetuned on Matscholar NER dataset. For each caption, we create a fully connected graph by connecting all the entities present in that caption. These graphs are then joined together to form a large MCG. We demonstrate some insights gained from the MCGs below.

Figure 5 shows two subsets of graphs extracted from the MCGs. In Fig. 5a, we identified two entities that are two-hop neighbors, namely, Tg and anneal. Note that these entities do not share an edge. In other words, these two entities are not found simultaneously in any given caption. We then identified the intersection of all the one-hop neighbors of both the nodes and plotted the graph as shown in Fig. 5a. The thickness of the edge represents the strength of the connection in terms of the number of occurrences. We observe that there are four common one-hop neighbors for Tg and anneal, namely, XRD, doped, glass, and amorphous. This means that these four entities occur in captions along with Tg and anneal, even though these two entities are not directly connected in the captions used for generating the graph. Figure 5a suggests that Tg is related to glass, amorphous, and doped materials and that these materials can be synthesized by annealing. Similarly, the structures obtained by annealing can be characterized by XRD. From these results, we can also infer that Tg is affected by annealing, which agrees with the conventional knowledge in glass science.

Fig. 5: Materials caption graph.
figure 5

a Connecting two unconnected entities, b exploring entities related to characterization method “XRD”.

Similarly, Fig. 5b shows all the entities connected to the node XRD. To this extent, we select all the captions having XRD as CMT. After obtaining all the entities in those captions, we randomly sample 20 pairs and then plotted them as shown in Fig. 5b. Note that the number of edges is 18 and the number of nodes is 19 because of one pair being (XRD, XRD) and two similar pairs (XRD, glass). The node color represents the entity type, and the edge width represents the frequency of the pair in the entire database of entities extracted from the captions where “XRD” is present. Using the graph, we can obtain the following information:

  1. 1.

    XRD is used as a characterization method for different material descriptors like glass, doped materials, nanofibers, and films.

  2. 2.

    Materials prepared using synthesis methods (SMT) like aging, heat-treatment, and annealing are also characterized using XRD.

  3. 3.

    While studying the property (PRO) glass transition temperature (Tg), XRD was also performed to characterize the samples.

  4. 4.

    In the case of silica glass ceramics (SGCs), phosphor, and phosphor-in-glass (PiG) applications (APL), XRD is used as CMT.

  5. 5.

    For different materials like ZnO, glasses, CsPBr3, yttria partially stabilized zirconia (YPSZ), XRD is a major CMT which is evident from the thicker edge widths.

Note this information covers a wide range of materials and applications in materials literature. Similar graphs can be generated for different entities and entity types using the MCG to gain insights into the materials literature.

(v) Other applications such as relation classification: MatSciBERT can also be applied for addressing several other issues such as relation classification and question answering. The relation classification task demonstrated in the present manuscript can provide key information regarding several aspects in materials science which are followed in a sequence. These would include synthesis and testing protocols, and measurement sequences. This information can be further used to discover an optimal pathway for material synthesis. In addition, such approaches can also be used to obtain the effect of different testing and environmental conditions, along with the relevant parameters, on the measured property of materials. This could be especially important for those properties such as hardness or fracture toughness, which are highly sensitive to sample preparation protocols, testing conditions, and the equipment used. Thus, the LM can enable the extraction of information regarding synthesis and testing conditions that are otherwise buried in the text.

At this juncture, it is worth noting that there are very few annotated datasets available for the material corpus. This contrasts with the biomedical corpus, where several annotated datasets are available for different downstream tasks such as relation extraction, question-answering, and NER. While the development of materials science specific language model can significantly accelerate the NLP-related applications in materials, the development of annotated datasets is equally important for accelerating materials discovery.

In conclusion, we developed a materials-aware language model, namely, MatSciBERT, that is trained on materials science corpus derived from journals. The LM, trained from the initial weights of SciBERT, exploits the knowledge on computer science and biomedical corpora (on which the original SciBERT was pre-trained) along with the additional information from the materials domain. We test the performance of MatSciBERT on several downstream tasks such as document classification, NER, and relation classification. We demonstrate that MatSciBERT exhibits superior performance on all the datasets tested in comparison to SciBERT. Finally, we discuss some of the applications through which MatSciBERT can enable accelerated information extraction from the materials science text corpora. To enable accelerated text mining and information extraction, the pre-trained weights of MatSciBERT are made publicly available at https://huggingface.co/m3rg-iitd/matscibert.

Methods

Dataset collection and preparation

In the training of an LM in a generalizable way, a considerable amount of dataset is required. For example, BERT25 was pre-trained on BookCorpus26 and English Wikipedia, containing a total of 3.3 billion words. SciBERT21, an LM trained on scientific literature, was pre-trained using a corpus consisting of 82% papers from the broad biomedical domain and 18% papers from the computer science domain. However, we note that none of these LMs includes text related to the materials domain. Here, we consider materials science literature from four broad categories, namely, inorganic glasses and ceramics, metallic glasses, cement and concrete, and alloys, to cover the materials domain in a representative fashion.

The first step in retrieving the research papers is to query search from the Crossref metadata database53. This resulted in a list of more than 1 M articles. Although Crossref gives the search results from different journals and publishers, we downloaded papers only from the Elsevier Science Direct database using their sanctioned API54. Note that the Elsevier API returns the research articles in XML format; hence, we wrote a custom XML parser for extracting the text. Occasionally, there were papers having only abstract and not full text depending upon the journal and publication date. Since the abstracts contain concise information about the problem statement being discussed in the paper and what the research contributions are, therefore, we have included them in our corpus. Therefore, we have included all the sections of the paper when available and abstracts otherwise. For glass science-related papers, the details are given in our previous work10. For concrete and alloys, we first downloaded many research papers for each material category using several queries such as “cement”, “interfacial transition zone”, “magnesium alloy”, and “magnesium alloy composite materials”, to name a few.

Since all the downloaded papers did not belong to a particular class of materials, we manually annotated 500 papers based on their abstracts, whether they were relevant to the field of interest or not. Further, we finetuned SciBERT classifiers21,55, one for each category of material, on these labeled abstracts for identifying relevant papers among the downloaded 1 M articles. We consider these selected papers from each category of materials for training the language model. A detailed description of the Materials Science Corpus (MSC) is given in the Results and Discussion section of the paper. Finally, we divided this corpus into training and validation, with 85% being used to train the language model and the remaining 15% as validation to assess the model’s performance on unseen text.

Note that the texts in the scientific literature may have several symbols, including some random characters. Sometimes the same semantic symbol has many Unicode surface forms. To address these anomalies, we also performed Unicode normalization of MSC to:

  1. a.

    get rid of random Unicode characters like , , , and

  2. b.

    map different Unicode characters having similar meaning and appearance to either a single standard character or a sequence of standard characters.

For example, % gets mapped to %, > to > , to , = and = to = , ¾ to 3/4, to name a few. First, we normalized the corpus using BertNormalizer from the tokenizers library by Hugging Face56,57. Next, we created a list containing mappings of the Unicode characters appearing in the MSC. We mapped random characters to space so that they do not interfere during pre-training. It’s important to note that we also perform this normalization step on every dataset before passing it through the MatSciBERT tokenizer.

Pre-training of MatSciBERT

We pre-train MatSciBERT on MSC as detailed in the last sub-section. Pre-training LM from scratch requires significant computational power and a large dataset. To address this issue, we initialize MatSciBERT with weights from SciBERT and perform tokenization using the SciBERT uncased vocabulary. This has the additional advantage that existing models relying on SciBERT, which are pre-trained on biomedical and computer science corpora, can be interchangeably used with MatSciBERT. Further, the vocabulary existing in the scientific literature as constructed by SciBERT can be used to reasonably represent the new words in the materials domain.

To pre-train MatSciBERT, we employ the optimized training recipe, RoBERTa49, suggested by Liu et al. (2019). This approach has been shown to significantly improve the performance of the original BERT. Specifically, the following simple modifications were adopted for MatSciBERT pre-training:

  1. 1.

    Dynamic whole word masking: It involves masking at the word level instead of masking at the wordpiece level, as discussed in the latest release of the BERT pre-training code by Google58. Each time a sequence is sampled, we randomly mask 15% of the words and let the model predict each masked wordpiece token independently.

  2. 2.

    Removing the NSP loss from the training objective: BERT was pre-trained using two unsupervised tasks: Masked-LM and Next-Sentence Prediction (NSP). NSP takes as input a pair of sentences and predicts whether the two sentences follow each other or not. RoBERTa authors claim that removing the NSP loss matches or slightly improves downstream task performance.

  3. 3.

    Training on full-length sequences: BERT was pre-trained with a sequence length of 128 for 90% of the steps and with a sequence of the length of 512 for the remaining 10% steps. RoBERTa authors obtained better performance by training only with full-length sequences. Here, input sequences are allowed to contain segments of more than one document and [SEP] token is used to separate the documents within an input sequence.

  4. 4.

    Using larger batch sizes: Authors also found that training with larger mini-batches improved the pre-training loss and increased the end-task performance.

Following these modifications, we pre-train MatSciBERT on the MSC with a maximum sequence length of 512 tokens for fifteen days on 2 NVIDIA V100 32GB GPUs with a batch size of 256 sequences. We use the AdamW optimizer with β1 = 0.9, β2 = 0.98, ε = 1e–6, weight decay = 1e−2 and linear decay schedule for learning rate with warmup ratio = 4.8% and peak learning rate = 1e−4. Pre-training code is written using PyTorch59 and Transformers57 library and is available at our GitHub repository for this work https://github.com/M3RG-IITD/MatSciBERT.

Downstream tasks

Once the LM is pre-trained, we finetune it on various supervised downstream tasks. Pre-trained LM is augmented with a task-specific output layer. Finetuning is done to adapt the model to specific tasks as well as to learn the task-specific randomly initialized weights present in the output layer. Finetuning is done on all the parameters end-to-end. We evaluate the performance of MatSciBERT on the following three downstream NLP tasks:

  1. 1.

    Named Entity Recognition (NER) involves identifying domain-specific named entities in a given sentence. Entities are encoded using the BIO scheme to account for multi-token entities53. Dataset for the NER task includes various sentences, with each sentence being split into multiple tokens. Gold labels are provided for each token. More formally, Let E = {e1, … ek} be the set of k entity types for a given dataset. If [x1, … xn] are tokens of a sentence and [y1, … yn] are labels for these tokens, then each yi L = {B-e1, I-e1, … B-ek, I-ek, O}. Here, B-ei and I-ei represent the beginning and inside of entity ei.

  2. 2.

    Input for the Relation Classification60 task consists of a sentence and an ordered pair of entity spans in that sentence. Output is a label denoting the directed relationship between the two entities. The two entity spans can be represented as s1 = (i, j) and s2 = (k, l), where i and j denote the starting and ending index of the first entity and similarly k and l denote the starting and ending index of the second entity in the input statement. Here, i ≤ j, k ≤ l, and (j < k or l < i). The last constraint guarantees that the two entities do not overlap with each other. The output label belongs to L, where L is a fixed set of relation types. An example of a sentence from the task is given in Fig. 6. The task is to predict the labels like “Participant_Material”, “Apparatus_Of” given the sentence and pair of entities as input.

  3. 3.

    In the Paper Abstract Classification task, we are given an abstract of a research paper, and we have to classify whether the abstract is relevant to a given field or not.

Fig. 6: Relation classification task.
figure 6

The different entities are enclosed in boxes with their respective labels. The related entities are connected using arrows labeled with the relation.

Datasets

We use the following three Materials Science-based NER datasets to evaluate the performance of MatSciBERT against SciBERT:

  1. 1.

    Matscholar NER dataset9 by Weston et al. (2019): This dataset is publicly available and contains seven different entity types. Training, validation, and test sets consist of 440, 511, and 546 sentences, respectively. Entity types present in this dataset are inorganic material (MAT), symmetry/phase label (SPL), sample descriptor (DSC), material property (PRO), material application (APL), synthesis method (SMT), and characterization method (CMT).

  2. 2.

    Solid Oxide Fuel Cells – Entity Mention Extraction (SOFC) dataset by Friedrich et al. (2020)45: This dataset consists of 45 open-access scholarly articles annotated by domain experts. Four different entity types have been annotated by the authors, namely Material, Experiment, Value, and Device. There are 611, 92, and 173 sentences in the training, validation, and test sets, respectively.

  3. 3.

    Solid Oxide Fuel Cells – Slot Filling (SOFC-Slot) dataset by Friedrich et al. (2020)45: This is the same as the above dataset except that entity types are more fine-grained. There are 16 different entity types, namely Anode Material, Cathode Material, Conductivity, Current Density, Degradation Rate, Device, Electrolyte Material, Fuel Used, Interlayer Material, Open Circuit Voltage, Power Density, Resistance, Support Material, Time of Operation, Voltage, and Working Temperature. Two additional entity types: Experiment Evoking Word and Thickness, are used for training the models.

For relation classification, we use the Materials Synthesis Procedures dataset by Mysore et al. (2019)46. This dataset consists of 230 synthesis procedures annotated as graphs where nodes represent the participants of synthesis steps, and edges specify the relationships between the nodes. The average length of a synthesis procedure is nine sentences, and 26 tokens are present in each sentence on average. The dataset consists of 16 relation labels. The relation labels have been divided into three categories by the authors:

  1. a.

    Operation-Argument relations: Recipe target, Solvent material, Atmospheric material, Recipe precursor, Participant material, Apparatus of, Condition of

  2. b.

    Non-Operation Entity relations: Descriptor of, Number of, Amount of, Apparatus-attr-of, Brand of, Core of, Property of, Type of

  3. c.

    Operation-Operation relations: Next operation

The train, validation, and test set consist of 150, 30, and 50 annotated material synthesis procedures, respectively.

The dataset for classifying research papers related to glass science or not on the basis of their abstracts has been taken from Venugopal et al. (2021)10. The authors have manually labeled 1500 abstracts as glass and non-glass. These abstracts belong to different fields of glass science like bioactive glasses, rare-earth glasses, glass ceramics, thin-film studies, and optical, dielectric, and thermal properties of glasses, to name a few. We divide the abstracts into a train-validation-test split of 3:1:1.

Modeling

For NER task, we use the BERT contextual output embedding of the first wordpiece of every token to classify the tokens among |L | classes. We model the NER task using three architectures: LM-Linear, LM-CRF, and LM-BiLSTM-CRF. Here, LM can be replaced by any BERT-based transformer model. We take LM to be BERT, SciBERT and MatSciBERT in this work.

  1. 1.

    LM-Linear: The output embedding of the wordpieces are passed through a linear layer with softmax activation. We use the BERT Token Classifier implementation of transformers library57.

  2. 2.

    LM-CRF: We replace the final softmax activation of the LM-Linear architecture with a CRF layer61 so that the model can learn to label the tokens belonging to the same entity mentioned and also learn the transition scores between different entity types. We use the CRF implementation of PyTorch-CRF library62.

  3. 3.

    LM-BiLSTM-CRF: Bidirectional Long Short-Term Memory63 is added in between the LM and CRF layer. BERT embeddings of all the wordpieces are passed through a stacked BiLSTM. The output of BiLSTM is finally fed to the CRF layer to make predictions.

In case of Relation Classification task, we use the Entity Markers-Entity Start architecture60 proposed by Soares et al. (2019) for modeling. Here, we surround the entity spans within the sentence with some special wordpieces. We wrap the first and second entities with [E1], [\E1] and [E2], [\E2] respectively. We concatenate the output embeddings of [E1] and [E2] and then pass it through a linear layer with softmax activation. We use the standard cross-entropy loss function for the training of the linear layer and finetuning of the language model.

For the baseline, we use two recent models, MaxPool and MaxAtt, proposed by Maini et al. (2020)50. In this approach too, the pair of entities are wrapped with the same special tokens. Then glove embeddings18 of words in the input sentence are passed through a BiLSTM, an aggregation mechanism (different for MaxPool and MaxAtt) over words, and a linear layer with softmax activation.

In Paper Abstract Classification task, we use the output embedding of the CLS token to encode the entire text/abstract. We pass this embedding through a simple classifier to make predictions. We use the BERT Sentence Classifier implementation of the transformers library57. For the baseline, we use a similar approach as relation classification except that there is no pair of input entities.

Hyperparameters

We use a linear decay schedule for the learning rate with a warmup ratio of 0.1. To ensure sufficient training of randomly initialized non-BERT layers, we set different learning rates for the BERT part and non-BERT part. We set the peak learning rate of the non-BERT part to 3e-4 and choose the peak learning rate of the BERT part from [2e−5, 3e−5, 5e−5], whichever results in a maximum validation score averaged across three seeds. We use a batch size of 16 and an AdamW optimizer for all the architectures. For LM-BiLSTM-CRF architecture, we use a 2-layer stacked BiLSTM with a hidden dimension of 300 and dropout of 0.2 in between the layers. We perform finetuning for 15, 20, and 40 epochs for Matscholar, SOFC, and SOFC Slot datasets, respectively, as initial experiments exhibited little or no improvement after the specified number of epochs. All the weights of any given architecture are updated during finetuning, i.e., we do not freeze any of the weights. We make the code for finetuning and different architectures publicly available. We refer readers to the code for further details about the hyperparameters.

Evaluation metrics

We evaluate the NER task based on entity-level exact matches. We use the CoNLL evaluation script (https://github.com/spyysalo/conlleval.py). For NER and Relation Classification tasks, we use Micro-F1 and Macro-F1 as the primary evaluation metrics. We use accuracy to evaluate the performance of the paper abstract classification task.