Accelate Foundation Model Training and Aninging with new recipes Amazon Sagemaker Hyperpod

Today we announce the general availability of Amazon Sagemaker Hyperpod recipes that help scientists and developers of all sets of skills to start training and models of the Fine Dollar Foundation (FMS) in minutes with the latest performance. Now they can approach an optimized recipe for training and fine -tuning popular publicly available FMS, such as LLAMA 3.1 405B, LLAMA 3.2 90B or Mixtral 8x22B.

In AWS Re: Invent 2023, we introduced the Sagemaker Hyperpod to shorten the time to practice FMS by up to 40 pierients and scalated across more than a thousand calculation resources in parallel with pre -configured distributed training libraries. At Sagemaker Hyperpod you will find the required accelerated computing sources for training, create the most optimal training and run training in various blocks based on the availability of computing resources.

Hyperpod SageMaker recipes include AWS tested tray, removing tiring work experimenting with various model configurations, eliminating weeks of iteration ranking and testing. Recipes automate several critical steps, such as loading data sets of training, using distributed training techniques, automatic checkpoints for faster failures and end-to-end training loops.

If you change the recipe, you can easily switch between instances based on GPUs or trainium to further optimize training and reduce costs. You can easily run workload in production for training work Sagemaker Hyperpod or Sagemaker.

Recipes for Sagemaker Hyperpod in Action
If you want to start, visit the recipes Hyperpod Sagemaker Github Recity to go through recipes for popular FMS training.

You only need to edit the direct recipe parameters to a specific instance type and rent your data file in the cluster configuration and then run a single -line recipe to achieve the latest performance.

After cloning the repetition, you need to edit a file of config.Yaml recipes to a specific model and type of cluster.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 install -r requirements.txt.
$ cd ./recipes_collections
$ vim config.yaml

Recipes are supported by Sagemaker Hyperpod with Slurm, Sagemaker Hyperpod with Amazon Elastic Kubernetes Service (Amazon EKS) and Sagemaker Training Jobs. For example, you can set the cluster type (SLORM ORCESTRATEUR), model name (META LLAMA 3.1 405B language model), instance type (ml.p5.48xlarge) and your data rental companies, such as storing training data, results, protocols, etc.

defaults:
- cluster: slurm # support: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # name of model to be trained
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or other supported cluster instances
base_results_dir: # Location(s) to store the results, checkpoints, logs etc.

In this Yaml file, you can optionally adjust the training parameters specific to the model that outlines the optimal configuration, including the number of accelerator devices, type, training accuracy, parallelization and clashes, optimizer and logging for experiment monitoring via Tensorboard.

run:
  name: llama-405b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
restore_from_path: null
trainer:
  devices: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  name: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Start training from pretrained model
model:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # other model-specific params

If you want to start this recipe in the Sagemaker Hyperpod with Slurm, you have to prepare the cluster Sagemaker Hyperpod after the cluster setting.

Then connect to the node of the Hyperpod Sagemaker head, access the slurm controller and copy the modified recipe. Next, start the auxiliary file to generate the Slurm Script Script for a task that you can use for dry running to check the content before starting the training task.

$ python3 main.py --config-path recipes_collection --config-name=config

When the training is complete, the trained model is automatically stored in the assigned data location.

You want to start this recipe for Sagemaker Hyperpod with Amazon ECS, Clone the recipe from GitHub Restilina, install requirements and adjust the recipe (cluster: k8s) My laptop. Then create a link between your laptop and the EKS cluster and then use the Hyperpod command line (CLI) to start the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
  "recipes.run.name": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.model.data.train_dir": "<your_train_data_dir>",
  "recipes.model.data.val_dir": "<your_val_data_dir>",
}'

You can also start a recipe for SageMaker training work using Sagemaker Python SDK. The following example is to launch Pytorch training scripts for Sagemaker training work with prevailing recipes for training.

...
recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=<output_path>,
           base_job_name=f"llama-recipe",
           role=<role>,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As a progress of training, the model checkpoints are stored on the Amazon Simple Storage Service (Amazon S3) with a fully automated checking capacity, allowing faster recovery from training errors and restarting instances.

Now available
Amazon Sagemaker Hyperpod recipes are now available in Hyperpod Sagemaker Github Restital. If you want to learn more, visit the Sagemaker Hyperpod product and Amazon Sagemaker AI Developer Guide.

Provide recipes Sagemaker Hyperpod and try and send AWS Re: Post for Sagemaker or by the usual AWS contacts.

– Channels

Accelate Foundation Model Training and Aninging with new recipes Amazon Sagemaker Hyperpod | Amazon Web Services

Leave a Comment Cancel reply