AWS Batch introduces quota management and preemption for SageMaker Training jobs
AWS Batch has introduced quota management with job preemption for SageMaker Training jobs, enhancing resource allocation and prioritization.
AWS Batch has expanded its capabilities to include quota management with job preemption specifically for SageMaker Training jobs. This new feature enables efficient allocation and sharing of compute resources across various teams and projects. For users utilizing GPU capacity in SageMaker Training jobs, this enhancement allows for intelligent allocation of compute resources, prioritization of critical training jobs, and the automatic preemption of lower-priority workloads when urgent experiments need resources.
With the introduction of quota management, users can establish up to 20 quota shares per job queue. These function as virtual queues with dedicated capacity limits and configurable resource sharing strategies. The service is designed to automatically employ cross-share preemption to reclaim borrowed capacity when the original job owner submits their tasks. Additionally, within-share preemption is supported, allowing high-priority jobs to preempt those with lower priority within the same quota share.
Users can monitor capacity utilization at various levels, including queue, quota share, and job-level granularity. They also have the ability to update job priorities after they have been submitted, which can influence preemption decisions, and configure preemption retry limits to control how the system behaves. This feature integrates directly with the SageMaker Python SDK through the aws_batch module.
This new functionality for quota management and job preemption in SageMaker Training jobs is now available in all AWS Regions where AWS Batch operates. For further details, users can refer to the Quota Management example notebook on GitHub and consult the AWS Batch User Guide.