Here’s the translation into American English:
—
Amazon has introduced an innovative cluster creation experience through its SageMaker HyperPod platform. This new tool aims to simplify the setup of distributed training and inference clusters with a single click, thereby eliminating common errors that often arise during the configuration process. The solution allows for the orchestration of clusters using Slurm or Amazon Elastic Kubernetes Service (EKS) and ensures secure networking through Amazon Virtual Private Cloud (VPC), along with high-performance storage.
With SageMaker HyperPod, users can efficiently scale complex tasks, such as training generative artificial intelligence or optimizing models. This platform enables the integration of clusters that can include hundreds or even thousands of AI accelerators. Additionally, it includes continuous hardware monitoring, automatically resolving issues and ensuring workload recovery without the need for manual intervention.
Previously, users had to manually configure various resources within AWS, which created potential failure points. The new experience streamlines this process by allowing the creation of all necessary resources in a single step and automatically applying recommended default settings.
In the Amazon SageMaker AI console, new deployment options have been introduced, including a quick setup and a customized option. The quick setup uses defaults for instance groups, networking, orchestration, and permissions, while the customized option offers more granular control over each of these parameters.
The automated setup not only creates a new VPC and subnets but also generates a new EKS cluster with the latest version of Kubernetes. Additionally, lifecycle scripts are stored in a new S3 bucket. The customized setup allows the use of existing VPCs or security groups, as well as the installation of specific operators in the EKS cluster.
Both modes include the ability to add new instance groups, ranging from standard groups to restricted options, providing users with the freedom to choose between on-demand capabilities or flexible training plans. SageMaker HyperPod also incorporates advanced health-checking tools and options to customize lifecycle scripts, making it a robust solution for large-scale machine learning model training.
This advancement in SageMaker HyperPod cluster creation is designed to simplify the establishment of effective and durable infrastructures, optimizing deployment for seamless integration into continuous delivery workflows. With this update, Amazon aims to promote a more accessible adoption of customized training environments, aligning with the diverse needs of users in the fields of artificial intelligence and machine learning.
—
Referrer: MiMub in Spanish