Base Command Manager / Bright Cluster Manager Release Notes
############################################################
Release notes for NVIDIA Base Command Manager (BCM) 10.24.09
############################################################
*Released: 7 October 2024*
General
=======
New Features
------------
* Added the compose and buildx CLI plugins to cm-docker
* Added CuDNN 8.9 for CUDA 12.4
* Added the cm-ngc-cli package for RHEL9 and Ubuntu22/24
* Added support for Ubuntu 24.04
* Added CuDNN 9.3
* Added cm-iperf.py wrapper around iperf3 to make testing multiple nodes easier
* Added mlnx-ofed24.04 packages
* Added CUDA 12.5 packages
Improvements
------------
* Updated cm-nvidia-container-toolkit to v1.16.2 (CVE-2024-0132 and CVE-2024-0133)
* Updated cm-docker to v26.1.5
* Allow DPU nodes to be defined without an OOB interface
* Updated NCCL2 for CUDA 12.5
* Added pmix3/pmix4 plugin to Slurm 23.02, 23.11 and 24.05
* Updated mlnx-ofed58 to 5.8-5.1.1.2
* Updated mlnx-ofed23.10 to 23.10-3.2.2.0
* Upgraded cm-nsight-systems to 2024.5.1
* Updated cm-iperf to 3.17.1
* Public IP address resources in cluster extension Azure are now created with "Standard" SKU.
Fixed Issues
------------
* An issue with Slurm jobs chargeback when a job requests MIGs
* An issue with cmdaemonctl on the compute nodes using cm-cmd-ports, which is only installed on the head nodes
* An issue that allowed an user with the readonly profile to restart services
* An issue with cm-nfs-checker not writing its logs
CMDaemon
========
New Features
------------
* Allow the prometheus service and exporter to be authenticated with a bearer token
* Added flag to enable creation of /tftpboot/pxelinux.cfg/ symlinks
* Added option to log events to a hook script
* Allow the configuration of the DNS resolver on Ubuntu in stub or uplink mode via the extra_values flag resolv=stub or resolv=uplink in the nodes or categories
* Added /rest/v1/status/wait REST endpoint
* Fetch job history from the workload managers even if cmdaemon was down when the job was running
* Added support for specifying SSH-authorized keys for users via notes
Improvements
------------
* Optimized WLM job operations by executing them in batches
* Reduced load on the head node caused by many nodes changing states quickly
* Reduced load caused by Slurm config writer calling MIG status too often
* Changed audit log to write in the machine local timezone instead of GMT
* Use home_mode instead of umask from /etc/login.defs when setting home directory permissions for a new user
* Display new certificate request event with info, not notice, when autosign takes care of it
* Sped up verification of profile/token access to RPC
* Allow DataTranslator::delay to throw away data that fails to resolve for 60s, to prevent it from staying in the cache forever when metrics are deleted in the meantime
* Added advanced config option DisableLdap=1 to disable all ldap integration
* Added advanced config option SoftwareImageDisableZFS=1 to disable ZFS
* Added support for BMC event logs using Redfish
* Optimized Trigger::Actuator evaluation of triggers for all data
* Sped up the entity, measurable, and parameter comparison in triggers
* Reduced number of log messages when /cm/shared is not mounted
* Increased the BFB push timeout from 15m to 30m
* Allow the option to manage separated DPUs without a MAC
* Configure a ssh ProxyJump via the DPU host when a DPU is running in embedded mode
* Improved BF3 support in cm-dpu-manage
* Added "Disable PXE" extra_value flag to prevent CMDaemon from writing the dhcpd.conf
* Added rshim to the services managed by CMDaemon
* Enabled the "Compute RDMA GPU Monitoring" agent plugin for OCI
* Allow Ubuntu based distributions to configure a layer3 network setup with a /31 connection between the node and the switch
* Spread out the schedulers healthcheck to reduce the flood of parallel WLM calls
* Optimized memory usage on large numbers of jobs when job tracing is disabled
Fixed Issues
------------
* An issue preventing diskless nodes from running rsync with --xattrs and --acls options
* An issue with CMD_SERVER_IP pointing to the node IP in /tftpboot/
* An issue where incomplete HTTP headers can result in cmdaemon threads reading ssl and consuming memory
* An issue with cmdaemon prometheus scraper not accepting metrics without labels
* An issue with adding AllocNodes to the Slurm partition parameters
* An issue with the prometheus /exporter endpoint not returning data grouped by metric
* An issue with Trigger::Actuator being too slow to process all incoming samples
* An issue with shared_mutex lock that can cause a crash or memory consumption increasing if two threads change the same data
* An issue in gzip inflate that can cause a RPC to loop while consuming CPU resources
* An issue with service cmd stop taking too long on compute nodes when the active head node is down
* An issue in cm-lite-daemon that prevented the calculation of the derivate for cumulative metrics in samplenow
* An issue preventing ZTP from getting a new certificate when cluster.pem has been updated
* An issue with an unequal split of the DHCPD range when it is shared between two head nodes
* An issue with removing stopped service information from the database
* An issue when adding AccountingStorageTRES to slurm.conf
* An issue with cmdaemonctl logconf
* An issue with slurm_states sampler returning exit code 1 when presented with a node state that it doesn't handle
* An issue with the reporting of power status for non-instantiated nodes in AzureĀ
* An issue with Slurm configuration update when bcm slurmautodetect is set
Node Installer
==============
Fixed Issues
------------
* An issue with node-installer not always setting the selected node MAC when rebooting
cmsh
====
New Features
------------
* Added cmsh redfishsubscriptions command
* Added mlxconfig reset to the DPU commands
Improvements
------------
* Improved help for the GPU profiling command
* Improved tab completion to support metrics with a space in their name
Fixed Issues
------------
* An issue when setting the value of the switchport property from devices to a single value after the property had multiple
* An issue entering selinux mode
* An issue with cmsh sometimes hanging when running with -c
* An issue with event acknowledge
* An issue when automatically setting IP in the clone command
pythoncm
========
New Features
------------
* Added pythoncm.Network.devices
* Added RPC timeouts to pythoncm
* Added an example of overloading the event handler in pythoncm
Base View
=========
Fixed Issues
------------
* An issue with run command not showing the full output
Base View NG
============
Fixed Issues
------------
* An issue with package updates not showing in ubuntu
Cluster Tools
=============
Improvements
------------
* Ensure cm-container-registry-setup writes the registry certificates to software images without assigned nodes
* Update shorewall routes with Kube Service Network for air-gapped Kubernetes deployments
* Improved NetQ version vs. Kubernetes version compatibility checks in cm-kubernetes-setup
* Improved checks for the prerequisite SSH configuration required for NetQ in cm-kubernetes-setup
* Do not allow to run cm-chroot-sw-img on the passive head node or provisioning nodes unless forced
Fixed Issues
------------
* An issue with the retry of NetQ installation where a bootstrap reset purge-db was not done before retrying
* An issue with an infinite loop in request-remote-assistance with nohup
* An issue in request-license doing bad hostname comparison and then unnecessarily trying to ssh to itself
* An issue with cmha-setup that prevented clusters with BCME licenses from setting up HA
* An issue when modifying the secondary head node entity during cm-cloud-ha-setup
COD
===
New Features
------------
* Added support for setting up accelerated networking for head nodes when creating clusters with cm-cod-azure
Improvements
------------
* Set the default region for cm-cod-oci to us-sanjose-1
* Allow enabling/disabling creation of the NAT gateway and shared public ip in cm-cloud-ha-setup via the GUI
* Store the cluster password hashed (instead of plain text) on the headnode
* Improved robustness of cluster deletion in cm-cod-oci
Fixed Issues
------------
* An issue when recreating clusters created with managed identity
pyxis-sources
=============
Improvements
------------
* Updated pyxis to 0.20.0
slurm
=====
New Features
------------
* Added integration with the Topology Generation Service
* Set the ENROOT_MOUNT_HOME configuration option to "no" by default for new setups
slurm23.02
==========
Improvements
------------
* Updated Slurm 23.02 to 23.02.8
slurm23.11
==========
Improvements
------------
* Updated Slurm 23.11 to 23.11.10
slurm24.05
==========
New Features
------------
* Added extra Slurm packages with support for NVIDIA SHARP
Improvements
------------
* Updated Slurm 24.05 to 24.05.3