Base Command Manager / Bright Cluster Manager Release Notes

#############################################################
Release notes for NVIDIA Base Command™ Manager (BCM) 10.25.03
#############################################################

*Released: 28 March 2025*

General
=======

New Features
------------

* Updated multiple Kubernetes operators when performing new setups: ingress-nginx 4.12.1 (fix CVE-2025-1097, CVE-2025-1098, CVE-2025-1974, CVE-2025-24513, CVE-2025-24514), kube-prometheus-stack 70.3.0 (fix CVE-2025-22868, CVE-2025-22870), kube-state-metrics 5.31.0. Existing Kubernetes setups need to be updated manually.
* Added BFB pre/post-install sections in cm-dpu-manage
* Added CUDA 12.6 packages
* Added mlnx-ofed24.10 packages
* Updated Slurm 24.05 to 24.05.7
* Updated Slurm 24.11 to 24.11.3
* Updated cm-nvhpc to 25.1
* Updated cuda-driver to 570.124.06
* Updated cuda-driver-535 to 535.216.03
* Updated cuda-driver-550 to 550.127.08
* Updated cuda12.6 to 12.6.3
* Updated mlnx-ofed23.10 to 23.10-4.0.9.1
* Updated mlnx-ofed58 to 5.8-6.0.4.2
* Updated the Ubuntu 24.04 base distribution to Ubuntu 24.04.1
* Updated BaseOS image to version 7.0.2

Fixed Issues
------------

* Updated the grub images provided by cm-tftpboot to support booting PE32+ Linux kernels which otherwise prevents aarch64 node booting with RHEL 9.5

CMDaemon
========

New Features
------------

* An issue where in the case of a write failure CMDaemon may generate a large number of websocket related log error messages
* An issue parsing the bgp information in cm-lite-daemon
* An issue where an invalid "plain/text" MIME type may be used by CMDaemon
* Added new advanced configuration option HttpStrictTransportSecurityMaxAge
* Update the Slurm topology.conf file when a cloud node goes UP or DOWN
* Allow the option to configure network security group for a specific VNIC in OCI
* An issue where DPU apply is not working for a brief period of time immediately after the node becomes UP
* Added bound checks in the monitoring storage to prevent possible crashes due to data corruption
* Extend cm-deploy-lite-daemon to download packages on the head node for switches that are not connected to the external network
* An issue where on Ubuntu 24.04 CMDaemon configures the ntp service instead of ntpsec
* Include the arch/os software image information in the CMDaemon XML dump file
* Allow the option to specify onboot=no for network interfaces via extra configuration values
* Added total cpu and memory utilization metrics
* An issue with generating Slurm topology configuration from switches connected to other switches
* Restrict the ability to use cmsh foreach or range commands for inefficient power or terminate operations on a large number of devices
* Add a new REST API endpoint for WLM drain operations
* Allow the option to override the list of enabled OCI agent plugins
* Allow the option to customize /var/lib/kubelet/config.yaml via the Kubelet role

Fixed Issues
------------

* An issue where redfish metrics containing a ~ symbol are being exported to Prometheus
* Perform periodic checks for certificate signing requests for new Kubelets and Cert rotations which otherwise may prevent the CSR approvals
* Rare CMDaemon crash when performing PDU-port power operations
* An issue where the DPU apply RPC timeout is too short for the operation to complete
* Improved pagination of the REST API /network/topology endpoint
* An issue where the sysinfo GPU UUID does not match the nvidia-smi UUID
* An issue where switching to a different consolidator can leave behind the old monitoring data
* An issue where kill-no-job-user-ssh-sessions returns no-data instead of failing
* An issue where recently added labeled entities may be removed and prevent returning correct PromQL results
* An issue where sysinfo disk information is shown multiple times for devices managed by the lite-daemon
* An issue with parsing jobs information when group name is not set
* Rare deadlock within the head node CMDaemon process when both head nodes and many other nodes are being committed at the same time from different threads
* An issue where the slurmctld service may be restarted when CMDaemon is restarted
* An issue where chargeback queries can incorrectly report the value is out of range

Node Installer
==============

Fixed Issues
------------

* An issue with copying dangling symbolic links from the /cm/conf/* configuration directories to the node

COD
===

New Features
------------

* Disable by default the public networks access for the storage account in Azure

Machine Learning
================

New Features
------------

* Added NCCL 2.25.1 and CuDNN 9.6 and 9.7 for Cuda 12.8

cm-kubernetes-setup
===================

New Features
------------

* Modify the default configuration for NVIDIA container toolkit to match the Run:ai requirements
* Tune the Kubernetes API Server to more sensible defaults for production systems
* [Kubernetes] Various improvements to out-of-the-box Kube Prometheus Stack configuration (+ patch to fix existing BCM from pre-10.25.02 (cm-kubernetes-setup --patch-kube-prometheus-stack))
* [Kubernetes] Simplify BCM landingpage ingress (no need for running Pod)
* Enable Typha by default on clusters with less than 50 nodes when setting up Kubernetes with Calico CNI
* Allow the option to setup NetQ 4.13 with cm-kubernetes-setup using Kubernetes v1.31 on Ubuntu 22.04
* Allow the option to install NVIDIA GPU Operator without installing BCM NVIDIA GPU packages

Improvements
------------

* Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough

Fixed Issues
------------

* An issue where the cluster-admin service account is not created in the correct 'default' namespace
* Suppress an incorrect warning message about existing /etc/kubernetes directory as a symlink on the secondary head node
* Allow the option to install BCM NVIDIA GPU packages without installing NVIDIA GPU Operator
* Wait for the Etcd information to become available to prevent cm-kubernetes-setup failures with error message 'NoneType' object has no attribute 'advertiseClientUrls'

cm-wlm-setup
============

New Features
------------

* Allow the option to select NRT GPU configuration settings in cm-wlm-setup

Fixed Issues
------------

* An issue with setting up pyxis if the secondary head node is down

cmsh
====

Fixed Issues
------------

* An issue with creating a 'node' type execution multiplexer in cmsh
* An issue with the addinterface command which can result in a crash of cmsh

jupyter
=======

Fixed Issues
------------

* An issue where kernel icons are not available when the certificates are generated with openssl 3.2.2

pythoncm
========

New Features
------------

* Added pythoncm parallel MIG function RPC