ReactXT:

Understanding Molecular “Reaction-ship” via Reaction-Contextualized Molecule-Text Pretraining

Zhiyuan Liu1*, Yaorui Shi2*, An Zhang1, Sihang Li2, Enzhi Zhang3, Xiang Wang2†, Kenji Kawaguchi1, Tat-Seng Chua1,
1National University of Singapore
2University of Science and Technology of China 3Hokkaido University
*Equal Contribution Correspondence

Abstract

Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for helping the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis.

To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis.


Framework

Reaction-Contextualized Molecule-Text Pretraining

Teaser

We propose Reaction-Contextualized Molecule-Text Pretraining (ReactXT), a new pretraining method for reaction-text modeling.

  • ReactXT incorporates chemical reactions, instead of only single molecules, into the pretraining process.
  • ReactXT is good at both reaction-text generation and molecule-text generation downstream tasks.

Context

Teaser

ReactXT aims to improve React-Text modeling by introducing three types of input contexts.

  • Forward reaction: The forward reaction context contains molecule roles (Reactant/Catalyst/Solvent/Product), molecule SMILES, and 2D molecular graph embeddings.
  • Backward reaction: Similar to the forward context but with the order of molecular roles reversed. Suppose the forward context prediction trains the model to predict the product from the reactants, then the backward context prediction trains the model to predict the reactant from the product.
  • Random molecule: A small amount of random molecules are also included to ensure the LM retains the capability to describe individual molecules outside chemical reactions.

OpenExp

Teaser

We collected 2.2M chemical reactions and associated experiment procedures from the USPTO dataset and the Open Reaction Database. The data are then undergone the following steps to form the OpenExp dataset.

  • Tranlation: We apply pretrained LM to translate the unstructured experiment procedure records into structured descriptions.
  • NER: We perform named entity recognition to extract the chemical materials from the structured descriptions.
  • Filtering: Some invalid samples are filtered out. More details are in Table 3 of our paper.

Related Links

This work uses MolCA and Galactica as the backbone molecule-text language models.

The processing of OpenExp is inspired by smiles2actions and paragraph2actions. The original data come from USPTO and ORD.

Citation

@inproceedings{liu2024reactxt,
      title={ReactXT: Understanding Molecular “Reaction-ship” via Reaction-Contextualized Molecule-Text Pretraining},
      author={Liu, Zhiyuan and Shi, Yaorui and Zhang, An and Li, Sihang and Zhang, Enzhi and Wang, Xiang and Kawaguchi, Kenji and Chua, Tat-Seng},
      booktitle={Findings of the Association for Computational Linguistics: {ACL} 2024},
      publisher={Association for Computational Linguistics},
      year={2024},
      url={https://openreview.net/forum?id=V-ejDfLiwe}
  }