Run a MPIJob
This page shows how to leverage Kueue’s scheduling and resource management capabilities when running MPI Operator MPIJobs.
This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.
Before you begin
Check administer cluster quotas for details on the initial cluster setup.
Check the MPI Operator installation guide.
You can modify kueue configurations from installed releases to include MPIJobs as an allowed workload.
Note
In order to use MPIJob, prior to v0.8.1, you need to restart Kueue after the installation. You can do it by running:kubectl delete pods -lcontrol-plane=controller-manager -nkueue-system
.
Note
While using both MPI Operator and Training Operator, it is required to disable Training Operator’s MPIJob option. Training Operator deployment needs to be modified to enable all kubeflow jobs except MPIJob, as mentioned here.MPI Operator definition
a. Queue selection
The target local queue should be specified in the metadata.labels
section of the MPIJob configuration.
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. Optionally set Suspend field in MPIJobs
spec:
runPolicy:
suspend: true
By default, Kueue will set suspend
to true via webhook and unsuspend it when the MPIJob is admitted.
Sample MPIJob
This example is based on https://github.com/kubeflow/mpi-operator/blob/ccf2756f749336d652fa6b10a732e241a40c7aa6/examples/v2beta1/pi/pi.yaml.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: pi
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
ttlSecondsAfterFinished: 60
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-launcher
securityContext:
runAsUser: 1000
command:
- mpirun
args:
- -n
- "2"
- /home/mpiuser/pi
resources:
limits:
cpu: 1
memory: 1Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-worker
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
resources:
limits:
cpu: 1
memory: 1Gi
For equivalent instructions for doing this in Python, see Run Python Jobs.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.