Scalability and Multi-Task Learning in GraphStorm 0.3 with Intuitive APIs.

GraphStorm, a low-code graph machine learning (GML) enterprise framework, has become an essential tool for building, training, and deploying graph solutions at enterprise scale in just days, rather than months. With GraphStorm, solutions can directly take into account the structure of relationships or interactions between billions of entities, which is crucial in real-world data such as fraud detection, recommendations, community detection, and search and retrieval issues.

Today, GraphStorm 0.3 is released, adding native support for multitask learning in graphs. This new version allows defining multiple training objectives on different nodes and edges within a single training cycle. Additionally, it introduces new APIs that allow customization of processes in GraphStorm. Now, with just 12 lines of code, a custom node classification training loop can be implemented. To ease getting started with the new API, two examples have been published in Jupyter notebooks: one for node classification and another for a link prediction task. A comprehensive study of co-training language models (LM) and graph neural networks (GNN) on large graphs with rich textual features using the Microsoft Academic Graph (MAG) dataset has also been published.

Native support for multitask learning in graphs reflects an effort to meet the needs of various enterprise applications operating with graph data for multiple tasks. For example, retail organizations that want to detect fraud in both sellers and buyers, or scientific publishers looking to relate papers to properly cite them for better discovery.

GraphStorm 0.3 supports six common tasks for multitask learning in graphs: node classification, node regression, edge classification, edge regression, link prediction, and node feature reconstruction. Training objectives can be specified through a YAML configuration file. An example of this is simultaneously defining a topic classification task on nodes of type “paper” and a link prediction task on edges of type “paper-citing-paper.”

Since its launch in early 2023, customers have mostly used the GraphStorm command-line interface (CLI), which simplifies the construction, training, and deployment of models using common recipes. However, many customers have requested a more flexible interface that allows customization of training and inference pipelines according to their specific requirements. In response to this demand, GraphStorm 0.3 introduces refactored APIs that allow defining a custom node classification training pipeline with just 12 lines of code.

In the previous version, GraphStorm introduced integrated techniques for efficiently training language models (LM) and GNN models together at scale in graphs with rich text. Since then, users have requested guides for optimizing the use of these techniques. GraphStorm 0.3 addresses this demand by launching an LM+GNN benchmark using the Microsoft Academic Graph (MAG) dataset in two standard GML tasks: node classification and link prediction.

Performance evaluations have been conducted for two main methodologies: pretrained BERT + GNN, and fine-tuned BERT + GNN. The fine-tuned BERT+GNN method, introduced by GraphStorm developers in 2022, showed up to 40% better performance compared to the pretrained BERT + GNN method in the link prediction task in MAG.

GraphStorm has also been evaluated using large synthetic graphs to demonstrate its scalability, seamlessly handling graphs with up to 100 billion edges in a matter of hours.

GraphStorm 0.3, released under the Apache-2.0 license, is designed to address the challenges of large-scale GML, now offering native support for multitask learning and new APIs for customizing pipelines and other components. To get started, visit the GraphStorm GitHub repository and documentation.

via: MiMub in Spanish

Last articles

Scroll to Top
×