Base Command Manager / Bright Cluster Manager Release Notes

############################################################
Release notes for NVIDIA Base Command Manager (BCM) 10.24.11
############################################################

*Released: 12 December 2024*


General
=======

New Features
------------

* Added cuda-driver-550 and cuda-fabric-manager-550 packages.
* Added mlnx-ofed24.07 package
* Added support for SUSE Linux Enterprise Server (SLES) 15 SP6

Improvements
------------

* Updated Nsight Systems to 2024.6.1
* Updated cm-openssl to 3.1.7
* Updated cuda-driver to 565.57.01
* Updated cuda-driver-535 to 535.216.01
* Updated cuda12.6-toolkit to 12.6 Update 2
* Updated freeipmi to 1.6.14

CMDaemon
========

Improvements
------------

* Reduced time required for all compute nodes to reconnect when the head node CMDaemon is restarted
* Use 64 bit OID versions for in and out octets in the SNMP switch monitoring sampler
* Added a REST endpoint to get the network topology
* Allow the option to configure /home exports per category for individual users/tenants
* Improved REST rack API call result to include the device type information
* Allow the option to configure additional DNS forward zones for networks defined in CMDaemon
* Added support for Equal Cost Multi-Path Route (ECMP) to IP Routing in layer3 setups
* Added the /[a-f0-9]{12}_[hc]/ regex to the default list of options for the IgnoreInotifyInterface advanced configuration option
* Added subgroups support in the WLM check-alloc implementation for allowing or denying user logins to compute nodes
* Allow the option to deploy cm-lite-daemon with a cm-deploy-lite-daemon.sh deployment script without using ZTP on Cumulus switches
* Improved CMDaemon commit validation for bootable networks when there is an overlap in the network ranges/CIDR
* Allow the option to disable the json login for a list of users defined in an advanced configuration option DisableLoginServiceUsers
* Added Raritan PDU monitoring sampler script
* Allow the option to use FQDN for the compute nodes with the global configuration option ShortHostname=0
* Kubernetes module files will now be created on all nodes with kubelet or firewall roles
* Use the Cumulus nv commands for setting up the username and password on Cumulus 5.9 and newer
* Sample the DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS metric for gpu nvswitches
* CMDaemon will now manage the mst service on all nodes with a DPU
* Added a ZTP stage directory /cm/local/apps/cmd/etc/htdocs/dpu/ztp for scripts executed on DPUs after BFB push
* Added new endpoint to REST API for power management
* Added new REST API endpoint for node categories
* Added Forge IB interface support
* Added CMDaemon health check for expiring Kubernetes certificates

Fixed Issues
------------

* An issue where Azure cloud node instance creation failures may not be correctly reported by cmsh
* An issue with the head node HA shared interfaces not being brought up automatically after they are manually brought down
* An issue with the duplex regex in the interfaces health check
* An issue with the Slurm takeover script
* An issue where the unit of the PDUUptime and SwitchUptime metrics is not correctly shown in cmsh/Base View
* An issue where automatic file system exports may not removed when a network configuration is updated
* An issue with missing newlines in /var/spool/cmd/events.log
* An issue where cmd -x can produce an XML configuration file with duplicate values for the "revision" or "extra values" properties
* An issue where the TotalGPUTemperature metric is reported as 0
* An issue with configuring the Slurm accounting service on edge setups
* An issue where the sgeexecd service on the compute nodes may be restarted when CMDaemon is restarted
* An issue with applying the search domain index in the partition and category settings when generating the resolver configuration
* An issue with applying the search domain index in the network settings when generating the resolver configuration
* An issue where level 3 switches are not being added to the Slurm topology.conf file when the tree Slurm plugin is configured
* An issue with the Prometheus monitoring sampler data collection when a username and a password are configured
* An issue with the Slurm power management scripts raising an AttributeError exception
* An issue where empty configuration options in the Kubelet role may not result in Kubernetes manifest files updates
* An issue where duplicate provisioning requests may be queued when a cloud director is not UP
* An issue where the megaraid health checks may not be able to report a failure writting the FAIL message to an incorrect file descriptor
* An issue with the node-installer unable to copy symbolic links from the /cm/conf/ directories
* A timing issue with image update of compute nodes running systemd-managed automount filesystems, where CMDaemon may not detect the automounted filesystem mount point and may not add it to the exclude list
* An issue where CMDaemon may not be able to update the kernel hash in the mysql database when a software image initrd is updated
* An issue with the redfish monitoring sampler printing informational messages to an incorrect file descriptor
* In some cases, an issue where ramdisk creation may be started while CMDaemon is stopping
* An issue where Azure compute nodes may be left dangling after powering on more nodes than the allowed quota
* An issue where the WLM slots values may be expressed in bytes in cmsh/Base View
* An issue where switch control scripts may not correctly separate the stdout and stderr output
* An issue where the restart required flag is set for DPUs when they are not running CMDaemon
* An issue where the dhcpd service configuration file may include compute nodes BMC interfaces which are not in use
* An issue where for nodes running Ubuntu a bonded interface bond options are not added correctly to the networking configuration file
* An issue with the system interrupts metrics not being expressed in number of interrupts per second
* An issue where CPU* metrics are displayed in Jiffies instead of Jiffies/s
* An issue where ProcSNMP metrics such as IpInDelivers are not configured as cumulative
* An issue where the SlurmState metrics may not include hostnames that include hyphens
* An issue where a ramdisk creation task does not transition to a failed state when trying to create a ramdisk for a locked image

Node Installer
==============

Fixed Issues
------------

* An issue where the compute nodes /etc/machine-id are not unique
* An issue where for nodes running Ubuntu a bonded interface bond options are not added correctly to the networking configuration file

COD
===

Fixed Issues
------------

* COD Openstack: Make cluster start wait for renamed nodes

Machine Learning
================

New Features
------------

* Added NCCL 2.23.4 for CUDA12.6

cm-bios-tools
=============

Improvements
------------

* An issue with random redfish disconnect errors
* An issue with performing flash operations of H100 GPU tray firmware in parallel

cm-cluster-extension
====================

Fixed Issues
------------

* A validation issue in the advanced settings dialog which can result in validation error messages such as "Create tunnel networks is not integer"

cm-kubernetes-setup
===================

Improvements
------------

* Allow the option to configure Kubernetes Ingress HTTPS on port 443 on the head node with SSL passthrough
* Allow the option to setup Kubernetes version 1.31 with cm-kubernetes-setup. Kubernetes versions 1.27 and older are no longer available options for performing new Kubernetes setups
* The use of kube-rbac-proxy is now deprecated in the Jupyter operator and in the permissions manager in favor of using the internal kubebuilder mechanism
* Added support for the NIM operator in cm-kubertenes-setup
* Updated Kubernetes OVN CNI to 1.1.13
* Updated local path provisioner to version 0.0.29
* Improved retry mechanism when Kubernetes certificate signing requests time out

Fixed Issues
------------

* An issue with using cm-kubernetes-setup --pull command line option on Ubuntu 24.04
* An issue with handling older versions of the Kubernetes permission manager where not all API endpoints exist

cm-lite-daemon
==============

Improvements
------------

* Added MemoryUtilization metric for devices running cm-lite-daemon
* Added ARP table information to the switch overview
* Added reported network speed metric

cm-scale
========

Improvements
------------

* Allow the option to reboot compute nodes when the software image changes instead of performing power off and on cycle

cmsh
====

Improvements
------------

* An issue with calculating the IPs when cloning devices with cmsh when using "layer3" network setup
* Allow the option to override the default timeout for monitoring scripts when running samplenow with --max-run-time option
* An issue in cmsh with displaying the monitoring data when using monitoringdump --uncompress
* cmsh will now show the Azure availability zone also in the cases when the zone has been auto-selected by Azure

Fixed Issues
------------

* An issue where when using --next-ip to clone a device with multiple with network interfaces on same network the resulting IPs of the cloned device may be identical
* An issue with using regular expressions with the foreach command in the interfaces submode for devices
* An issue where the networks IP is incorrectly updated when an interface is configured with startif = active and the device is cloned with cmsh
* An issue where the cmsh monitoringbackuprings command does not take into account a backup role may be disabled when showing the information
* An issue with alignment of the power results table when some hostnames are too long
* A timing issue in cmsh where a device power operation may not be executed if it is initiated shortly (within ~2s) after the device is committed

pythoncm
========

Improvements
------------

* Added send_warning_event pythoncm cluster method

slurm
=====

Improvements
------------

* Updated Slurm 24.05.4 Sharp Plugin to 1.0.1

topograph
=========

Improvements
------------

* The cluster-topology-generator is now renamed to topograph