Natural Language Processing Datasets in the Materials Science Domain
Published:
WORK IN PROGRESS
Materials science is an interdisciplinary field focused on the research and discovery of new materials. During my time at Bosch, I worked closely with several domain experts from this field and learned to appreciate how important it is to identify materials with particular characteristics in order to solve challenges such as addressing climate change (e.g., by building environment-friendly energy systems or cars). But this blogpost is not about the benefits of materials science research, it is about how natural language processing (NLP) can support this research. While much prior work in NLP for scientific text has concentrated on the biomedical domain, there is a growing interest in also building datasets and solutions for the materials science domain. In this blogpost, I am collecting corpora and benchmarks addressing this domain, which are the basis for building solutions that can meaningfully help materials science researchers.
If you have / know a dataset that falls under this category that is not listed here yet, please just e-mail me!
Dataset | Reference | Data / Annotations | Materials science subdomains |
SOFC-Exp | … |
SOFC-Exp
https://aclanthology.org/2020.acl-main.116.pdf
Materials Science Procedural Text Corpus (MSPT)
https://aclanthology.org/W19-4007/
Recipes
https://www.nature.com/articles/s41597-019-0224-1
Weston / NER
https://pubs.acs.org/doi/10.1021/acs.jcim.9b00470
MuLMS
Text: https://github.com/boschresearch/mulms-wiesp2023 https://github.com/boschresearch/sciol-wacv-2024 https://github.com/boschresearch/mulms-az-codi2023
https://openaccess.thecvf.com/content/WACV2024/papers/Tarsi_SciOL_and_MuLMS-Img_Introducing_a_Large-Scale_Multimodal_Scientific_Dataset_and_WACV_2024_paper.pdf
PolyNERE
https://aclanthology.org/2024.lrec-main.1126.pdf
maybe: https://aclanthology.org/2024.findings-acl.779.pdf
MS-Mentions
https://aclanthology.org/2021.emnlp-main.101/
SC-CoMIcs
https://aclanthology.org/2020.lrec-1.834.pdf
PolyIE
PcMSP
https://aclanthology.org/2022.findings-emnlp.446/
Other related works
https://aclanthology.org/2021.emnlp-main.438.pdf https://aclanthology.org/2023.acl-long.753.pdf
They used data from several datasets! https://aclanthology.org/2023.acl-long.201.pdf