Natural Language Processing Datasets in the Materials Science Domain

1 minute read

Published:

WORK IN PROGRESS

Materials science is an interdisciplinary field focused on the research and discovery of new materials. During my time at Bosch, I worked closely with several domain experts from this field and learned to appreciate how important it is to identify materials with particular characteristics in order to solve challenges such as addressing climate change (e.g., by building environment-friendly energy systems or cars). But this blogpost is not about the benefits of materials science research, it is about how natural language processing (NLP) can support this research. While much prior work in NLP for scientific text has concentrated on the biomedical domain, there is a growing interest in also building datasets and solutions for the materials science domain. In this blogpost, I am collecting corpora and benchmarks addressing this domain, which are the basis for building solutions that can meaningfully help materials science researchers.

If you have / know a dataset that falls under this category that is not listed here yet, please just e-mail me!

DatasetReferenceData / AnnotationsMaterials science subdomains
SOFC-Exp  

SOFC-Exp

https://aclanthology.org/2020.acl-main.116.pdf

Materials Science Procedural Text Corpus (MSPT)

https://aclanthology.org/W19-4007/

Recipes

https://www.nature.com/articles/s41597-019-0224-1

Weston / NER

https://pubs.acs.org/doi/10.1021/acs.jcim.9b00470

MuLMS

Text: https://github.com/boschresearch/mulms-wiesp2023 https://github.com/boschresearch/sciol-wacv-2024 https://github.com/boschresearch/mulms-az-codi2023

https://openaccess.thecvf.com/content/WACV2024/papers/Tarsi_SciOL_and_MuLMS-Img_Introducing_a_Large-Scale_Multimodal_Scientific_Dataset_and_WACV_2024_paper.pdf

PolyNERE

https://aclanthology.org/2024.lrec-main.1126.pdf

maybe: https://aclanthology.org/2024.findings-acl.779.pdf

MS-Mentions

https://aclanthology.org/2021.emnlp-main.101/

SC-CoMIcs

https://aclanthology.org/2020.lrec-1.834.pdf

PolyIE

PcMSP

https://aclanthology.org/2022.findings-emnlp.446/

https://aclanthology.org/2021.emnlp-main.438.pdf https://aclanthology.org/2023.acl-long.753.pdf

They used data from several datasets! https://aclanthology.org/2023.acl-long.201.pdf