Base Command Manager / Bright Cluster Manager Release Notes
#############################################################
Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.05
#############################################################
*Released: 24 May 2024*
General
=======
New Features
------------
* Added mlnx-ofed24.01 package
* Added CUDA 12.4 toolkit packages
* Added PBS Professional 2024 packages
* cmdaemon-apidocs has been replaced with cm-api-docs; Documentation can be accessed via the landing page
* Added cm-nsight-systems-cli package containing the latest CLI version of nsight-systems package, replacing the current cm-nsight-systems package
Improvements
------------
* Updated mlnx-ofed23.10 to 23.10-2.1.3.1
* For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemons
Fixed Issues
------------
* An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass
* An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver-* packages
CMDaemon
========
New Features
------------
* Mark the devices monitored by MQTT as UP when recent monitoring data exists
* Allow the option to run Slurm accounting database daemon in high-availability mode
Improvements
------------
* Added per node Slurm state metric
* Allow the option to select automatically a random free port for the IMEX service
* Allow the option to define an exclude list in the network interfaces healthcheck to skip specified interfaces
* Added /var/lib/rancher/.* to the default exclude list for the ProcMounts monitoring producer, which otherwise can create unnecessary monitoring metrics
* Improved performance of the cm-mqtt service
* Allow the option to automatically set the bond and the bond members MAC addresses in the CMDaemon node interfaces entities when the node boots
* Include the perm-mac-address under a bond interface when verifying the license, which resolves an issue with verifying the license when a bond network interface is created after the license is requested
* Added lxc* interfaces to the default exclude list for the ProcNetDev monitoring producer
* Allow the option to disable a MQTT with a flag in the configuration file
Fixed Issues
------------
* An issue with validating of the LDAP group during commit of a user
* An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed
* An issue where DNS allow-query configuration entries are not added on edge directors for Kubernetes networks, preventing queries from these networks
* An issue where monitoring consolidators may not be created for all entity-measurable pairs
* An issue where a temporary resolv.conf bind mount created in the software images are being added to the CMDaemon monitoring database
* Added a retry mechanism around gethostbyname when writing the IMEX configuration files, which can otherwise throw an exception
* An issue with chargeback calculations using per node requested CPU/GPU information
* An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage
* An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon
* An issue which can lead to high CMDaemon memory usage if the post-provisioning monitoring-resume operation has failed
* An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover
* An issue in the RPC status code which in some cases can result in an infinite recursion on the passive head node in the event of a failover
* An issue with the node profile missing the UPDATE_CONFIG_FILES_AFTER_IMAGE_UPDATE_TOKEN which prevents the /cm/conf files from being copied from the software image to the provisioned nodes when using non-head-node provisioners
* An issue with the reporting of the GPU chargeback information
* An issue where a category can be removed while another category's provisioning role still has a reference to it
* An issue where Azure cloud compute nodes are cloned with an incorrect power status when the original node's power status is ON
* An issue where on failure AWS node power on actions produce an error message "Unable to parse output" instead of the AWS error message
* An issue with the cmsh dropunused command that can result in removing too many measurables
* An issue with the cmsh device syncinfo command when specifying an fspart path
Base View
=========
Fixed Issues
------------
* An issue with displaying the SNMP system information data for switches
Cluster Tools
=============
Fixed Issues
------------
* An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package
COD
===
New Features
------------
* Added support for creating HA COD clusters in Azure
* Allow the option to skip shared storage setup with the cm-cloud-ha-setup tool
* Added support for OCI defined tags. This changes the original --head-node-tags command line option to --head-node-freeform-tags and adds new command line option --head-node-defined-tags
Improvements
------------
* Allow the option to select the Azure availability zone on the command line of the cluster create command
Machine Learning
================
New Features
------------
* Introduced ML NCCL and CuDNN packages for CUDA 12.4
Fixed Issues
------------
* An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start
cm-clone-install
================
Fixed Issues
------------
* An issue with handling bond interfaces and bond members configuration on Ubuntu base distribution
cm-cluster-extension
====================
Fixed Issues
------------
* An issue where 'germany' is incorrectly listed as an Azure region
cm-create-image
===============
Fixed Issues
------------
* An issue where missing modular metadata for the 'default' package group on the RHEL8 and RHEL9 ISOs can prevent the creation of software images
cm-kubernetes-setup
===================
Improvements
------------
* Ensure the /var/lib/etcd directory has the correct permissions (0700) for etcd member to be able to join the etcd cluster
Fixed Issues
------------
* A regression in cm-kubernetes-setup that allows the user to select nodes in a way that results in the overlap of compute nodes between different Kubernetes clusters
* An issue with cm-kubernetes-setup --pull unable to complete if while the images are being pulled the pod is evicted due to disk pressure
cm-scale
========
New Features
------------
* Allow the option to reboot the nodes in FULL install mode when the cm-scale engine is changed
Fixed Issues
------------
* An issue where the shutdown state from files may be used incorrectly
cm-wlm-setup
============
Fixed Issues
------------
* Setting up pyxis will no longer configure it to clean the data directory from epilog since enroot can perform this automatically
cmsh
====
Improvements
------------
* Allow the options to specify the IP increment with the cmsh addinterface command
* Include the job run time data in the cmsh WLM jobs info command
* Allow the option to specify the network CIDR on the cmsh "add network" command line
Fixed Issues
------------
* An issue where the cmsh monitoring trigger info command does not show grouped expressions
* An issue with importing older formats of the .cmshhistory file, which can result in duplicating all entries in the cmsh command history
jupyter
=======
New Features
------------
* Allow the option to use sqsh files to run Jupyter kernels based on enroot
* Restrict the access to Jupyter based on group memberships
Improvements
------------
* Allow the option to install and configure VNC when setting up Jupyter
pythoncm
========
Fixed Issues
------------
* An issue in the pythoncm cluster.py implementation where an incorrect logger variable name is being used
pyxis-sources
=============
Improvements
------------
* Updated pyxis-sources to 0.19.0