Base Command Manager / Bright Cluster Manager Release Notes

############################################################
Release notes for NVIDIA Base Command Manager (BCM) 11.25.08
############################################################

*Released: 1 October 2025*

General
=======

New Features
------------

* Updated Slurm 25.05 to 25.05.2
* Updated Slurm 24.11 to 24.11.6
* Updated Topograph to 3.2.0
* Updated etcd to 3.5.22
* Updated cm-nvhpc to 25.5
* Added cm-kubeadm-manage-joins helper script

Fixed Issues
------------

* Fixed potential killing of slurmdbd by systemd

NVIDIA Mission Control
======================

New Features
------------

* Support for NVIDIA Mission Control (NMC) 2.0
* Updated run:ai to 2.22
* Bumped Run:ai self-hosted helm chart to 40m timeout (20m was not enough)
* Run:ai cluster installation (helm install) increased max. timeout to 20 minutes (instead of default)
* Do not check if Kyverno is disabled during AHR (Autonomous Hardware Recovery) setup
* Run:ai wizard now includes the correct mpi / kubeflow CRDs

Fixed Issues
------------

* Run:ai cluster installation stage will now install always on the correct kube cluster (in case of multiple kubernetes cluster on a single BCM cluster)
* Long hostnames could visually render incorrect TUI dialogs in the Run:ai configuration dialog.
* Run:ai self-hosted allows for custom version selection and latest patch release versions

CMDaemon
========

New Features
------------

* Standardize on array result for rest/sysinfo
* Fixed an issue with changing cluster UUID not updating in BCM
* Ensure we add the KubernetesCertsExpiration healthcheck to existing deployments
* CMDaemon runs scontrol reconfig once for both slurm.conf and topology.conf updates
* Fixed an issue with labeled entities not being listed in the base-view monitoring view
* BCM can set node-role.kubernetes.io/runai-gpu-worker=true for GPU nodes (and cpu worker variant).
* Opt-in through BCM GlobalConfig to disable writing out root kubeconfig configuration files on kubernetes worker nodes.

Fixed Issues
------------

* Make sure hostname changes are actually host name changes + restore etcd:etcd ownership after restoring pki backup (done before and after kubeadm reset)
* Introduce new KubernetesCertsExpiration healthcheck, that warns if nodes have expiring certificates (within 30 days)
* Fixed regression in KubernetesChildNode healthcheck (and others)
* Extend v1/rest/status to include rack and system name
* cm-kubeadm-manage helper script should never overwrite etcd certificates handled by BCM
* Fixed regressions in metrics collection scripts in Kubernetes
* Only invoke kubeadm kubeconfig user commands when actually needed

COD
===

New Features
------------
* ALL: Change default BCM version to 11.0.
* ALL: Create inbound rules for both protocols (TCP/UDP) if the user didn't explicitly request a protocol.
* AWS: Preserve internal IP address when recovering broken AWS HA head node.
* AWS: Create placement group with spread strategy for better HA head node separation.
* Azure: Added support for NAT gateway for outbound connectivity.
* GCP: Add cloudsetting network_performance_config.
* GCP: Always delete empty subnets in BCM created networks.
* OCI: Ensure the VM architecture is compatible with the requested image.

Fixed Issues
------------

* ALL: Improve error on missing architecture config.
* ALL: Preserve SSH host keys on cm-cloud-ha-setup.
* ALL: HA setup checks that head node IP is within CIDR.
* ALL: Harmonize --cluster-tags and --head-node-tags cluster create flags.
* ALL: Updated image search defaults and validation. By default list the latest image. Flag --all-revisions overrides --latest. When both --latest and --all-revisions are used, an info message is logged.
* Azure: Powering on multiple nodes could raise an error while creating several VHDs at the same time.
* GCP: Fix cryptic invalid refresh token grant errors.
* GCP: Set correct instance hostname on cluster create.
* GCP: Reduce API requests on cluster delete.
* GCP: Fix timeout creating HA shared storage.
* GCP: Optimize cluster list.
* GCP: Improve reliability of cluster delete.
* OCI: Fixed cluster create hang when non-existing region is specified.

cm-kubernetes-setup
===================

New Features
------------

* Started using containerd v2 by default for new Kubernetes setups
* Updated Kubeflow training operator to 1.9.2
* Remove validation for NetQ and Kubernetes version Mapping Only for NetQ version 4.15.0
* The wizard now allows choosing 'toolkit.enabled=true', delegating the install + containerd configuration to the NVIDIA GPU Operator

Fixed Issues
------------

* Fixed kube_get_available_nodes remote_request RPC call to also consider head nodes for Kubernetes workers
* CAPI templates (stored in the kubernetes submode) should not be uninstallable via cm-kubernetes-setup.
* Fixed showing MetalLB/BGP screen in cm-kubernetes-setup when Tigera operator is selected

cm-wlm-setup
============

Fixed Issues
------------

* Fixed adding pyxis plugin to secondary software images on multi-arch/multi-distro setups