Lessons Learned from Creating Core Models in AWS Through Japan’s GENIAC Program

dcarrero

22 hours ago

Here’s the translation to American English:

In 2024, Japan’s Ministry of Economy, Trade and Industry announced the Generative AI Accelerator Challenge (GENIAC), a national program aimed at boosting generative artificial intelligence in the country. This ambitious initiative offers companies not only funding but also guidance and access to powerful computational resources for developing foundational models. In this context, Amazon Web Services (AWS) was highlighted as the infrastructure provider in the second phase of GENIAC, providing technical support to 12 participating organizations.

At first glance, the task of providing access to robust hardware infrastructure, including hundreds of GPUs and Trainium chips, seemed straightforward. However, experience showed that successfully training foundational models requires much more than just advanced technology. AWS realized that having over 1,000 accelerators was only a starting point, as the real complexity lay in establishing a system that operates reliably and overcoming the challenges of distributed training.

During this phase, the 12 organizations managed to deploy 127 Amazon EC2 P5 instances, servers equipped with the NVIDIA H100 TensorCore GPU, and 24 Amazon EC2 Trn1 instances in a single day. Over the next six months, several large-scale models were trained, including notable projects such as Stockmark-2-100B-Instruct-beta and Llama 3.1 Shisa V2 405B.

A key takeaway from this experience was the importance of having multidisciplinary teams to carry out large-scale machine learning initiatives. AWS formed a virtual team that integrated account specialists, solution architects, and service teams, creating an environment conducive to effective knowledge sharing and support.

Structured communication also proved crucial. An internal Slack channel was established to facilitate program coordination, enabling quick problem resolution and creating a space where participants could ask questions and share information. To ensure proper tracking, AWS maintained detailed documentation on each client, clarifying technical requirements and configurations. Through weekly meetings, the team was able to share learnings and continuously improve the participatory model.

The creation of reference architectures was also essential in this process. Instead of having each team start from scratch, AWS developed pre-validated templates and automations to facilitate two main approaches: AWS ParallelCluster and SageMaker HyperPod. These technical frameworks covered the entire stack, allowing teams to deploy their environments with minimal friction.

The GENIAC program has demonstrated that large-scale training of foundational models is not only a technical challenge but also an organizational one. Thanks to structured and collaborative support, a small group of participants successfully executed large workloads in the cloud. As this second phase concluded, a technical event was organized in Tokyo to prepare developers for the next stage of GENIAC, underscoring a significant advancement in the realm of generative artificial intelligence. AWS reaffirms its commitment to the progress of these technologies on a global scale.

Source: MiMub in Spanish