Base Command Manager / Bright Cluster Manager Release Notes
############################################################
Release notes for NVIDIA Base Command Manager (BCM) 11.25.08
############################################################
*Released: 1 October 2025*
General
=======
New Features
------------
* Updated Slurm 25.05 to 25.05.2
* Updated Slurm 24.11 to 24.11.6
* Updated Topograph to 3.2.0
* Updated etcd to 3.5.22
* Updated cm-nvhpc to 25.5
* Added cm-kubeadm-manage-joins helper script
Fixed Issues
------------
* Fixed potential killing of slurmdbd by systemd
NVIDIA Mission Control
======================
New Features
------------
* Support for NVIDIA Mission Control (NMC) 2.0
* Updated run:ai to 2.22
* Bumped Run:ai self-hosted helm chart to 40m timeout (20m was not enough)
* Run:ai cluster installation (helm install) increased max. timeout to 20 minutes (instead of default)
* Do not check if Kyverno is disabled during AHR (Autonomous Hardware Recovery) setup
* Run:ai wizard now includes the correct mpi / kubeflow CRDs
Fixed Issues
------------
* Run:ai cluster installation stage will now install always on the correct kube cluster (in case of multiple kubernetes cluster on a single BCM cluster)
* Long hostnames could visually render incorrect TUI dialogs in the Run:ai configuration dialog.
* Run:ai self-hosted allows for custom version selection and latest patch release versions
CMDaemon
========
New Features
------------
* Standardize on array result for rest/sysinfo
* Fixed an issue with changing cluster UUID not updating in BCM
* Ensure we add the KubernetesCertsExpiration healthcheck to existing deployments
* CMDaemon runs scontrol reconfig once for both slurm.conf and topology.conf updates
* Fixed an issue with labeled entities not being listed in the base-view monitoring view
* BCM can set node-role.kubernetes.io/runai-gpu-worker=true for GPU nodes (and cpu worker variant).
* Opt-in through BCM GlobalConfig to disable writing out root kubeconfig configuration files on kubernetes worker nodes.
Fixed Issues
------------
* Make sure hostname changes are actually host name changes + restore etcd:etcd ownership after restoring pki backup (done before and after kubeadm reset)
* Introduce new KubernetesCertsExpiration healthcheck, that warns if nodes have expiring certificates (within 30 days)
* Fixed regression in KubernetesChildNode healthcheck (and others)
* Extend v1/rest/status to include rack and system name
* cm-kubeadm-manage helper script should never overwrite etcd certificates handled by BCM
* Fixed regressions in metrics collection scripts in Kubernetes
* Only invoke kubeadm kubeconfig user commands when actually needed
COD
===
New Features
------------
* ALL: Change default BCM version to 11.0.
* ALL: Create inbound rules for both protocols (TCP/UDP) if the user didn't explicitly request a protocol.
* AWS: Preserve internal IP address when recovering broken AWS HA head node.
* AWS: Create placement group with spread strategy for better HA head node separation.
* Azure: Added support for NAT gateway for outbound connectivity.
* GCP: Add cloudsetting network_performance_config.
* GCP: Always delete empty subnets in BCM created networks.
* OCI: Ensure the VM architecture is compatible with the requested image.
Fixed Issues
------------
* ALL: Improve error on missing architecture config.
* ALL: Preserve SSH host keys on cm-cloud-ha-setup.
* ALL: HA setup checks that head node IP is within CIDR.
* ALL: Harmonize --cluster-tags and --head-node-tags cluster create flags.
* ALL: Updated image search defaults and validation. By default list the latest image. Flag --all-revisions overrides --latest. When both --latest and --all-revisions are used, an info message is logged.
* Azure: Powering on multiple nodes could raise an error while creating several VHDs at the same time.
* GCP: Fix cryptic invalid refresh token grant errors.
* GCP: Set correct instance hostname on cluster create.
* GCP: Reduce API requests on cluster delete.
* GCP: Fix timeout creating HA shared storage.
* GCP: Optimize cluster list.
* GCP: Improve reliability of cluster delete.
* OCI: Fixed cluster create hang when non-existing region is specified.
cm-kubernetes-setup
===================
New Features
------------
* Started using containerd v2 by default for new Kubernetes setups
* Updated Kubeflow training operator to 1.9.2
* Remove validation for NetQ and Kubernetes version Mapping Only for NetQ version 4.15.0
* The wizard now allows choosing 'toolkit.enabled=true', delegating the install + containerd configuration to the NVIDIA GPU Operator
Fixed Issues
------------
* Fixed kube_get_available_nodes remote_request RPC call to also consider head nodes for Kubernetes workers
* CAPI templates (stored in the kubernetes submode) should not be uninstallable via cm-kubernetes-setup.
* Fixed showing MetalLB/BGP screen in cm-kubernete-setup when Tigera operator is selected
cm-wlm-setup
============
Fixed Issues
------------
* Fixed adding pyxis plugin to secondary software images on multi-arch/multi-distro setups