Base Command Manager / Bright Cluster Manager Release Notes

#############################################################
Release notes for NVIDIA Base Command™ Manager (BCM) 10.24.05
#############################################################

*Released: 24 May 2024*

General
=======

New Features
------------

* Added mlnx-ofed24.01 package
* Added CUDA 12.4 toolkit packages
* Added PBS Professional 2024 packages
* cmdaemon-apidocs has been replaced with cm-api-docs; Documentation can be accessed via the landing page
* Added cm-nsight-systems-cli package containing the latest CLI version of nsight-systems package, replacing the current cm-nsight-systems package

Improvements
------------

* Updated mlnx-ofed23.10 to 23.10-2.1.3.1
* For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemons

Fixed Issues
------------

* An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass
* An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver-* packages

CMDaemon
========

New Features
------------

* Mark the devices monitored by MQTT as UP when recent monitoring data exists
* Allow the option to run Slurm accounting database daemon in high-availability mode

Improvements
------------

* Added per node Slurm state metric
* Allow the option to select automatically a random free port for the IMEX service
* Allow the option to define an exclude list in the network interfaces healthcheck to skip specified interfaces
* Added /var/lib/rancher/.* to the default exclude list for the ProcMounts monitoring producer, which otherwise can create unnecessary monitoring metrics
* Improved performance of the cm-mqtt service
* Allow the option to automatically set the bond and the bond members MAC addresses in the CMDaemon node interfaces entities when the node boots
* Include the perm-mac-address under a bond interface when verifying the license, which resolves an issue with verifying the license when a bond network interface is created after the license is requested
* Added lxc* interfaces to the default exclude list for the ProcNetDev monitoring producer
* Allow the option to disable a MQTT with a flag in the configuration file

Fixed Issues
------------

* An issue with validating of the LDAP group during commit of a user
* An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed
* An issue where DNS allow-query configuration entries are not added on edge directors for Kubernetes networks, preventing queries from these networks
* An issue where monitoring consolidators may not be created for all entity-measurable pairs
* An issue where a temporary resolv.conf bind mount created in the software images are being added to the CMDaemon monitoring database
* Added a retry mechanism around gethostbyname when writing the IMEX configuration files, which can otherwise throw an exception
* An issue with chargeback calculations using per node requested CPU/GPU information
* An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage
* An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon
* An issue which can lead to high CMDaemon memory usage if the post-provisioning monitoring-resume operation has failed
* An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover
* An issue in the RPC status code which in some cases can result in an infinite recursion on the passive head node in the event of a failover
* An issue with the node profile missing the UPDATE_CONFIG_FILES_AFTER_IMAGE_UPDATE_TOKEN which prevents the /cm/conf files from being copied from the software image to the provisioned nodes when using non-head-node provisioners
* An issue with the reporting of the GPU chargeback information
* An issue where a category can be removed while another category's provisioning role still has a reference to it
* An issue where Azure cloud compute nodes are cloned with an incorrect power status when the original node's power status is ON
* An issue where on failure AWS node power on actions produce an error message "Unable to parse output" instead of the AWS error message
* An issue with the cmsh dropunused command that can result in removing too many measurables
* An issue with the cmsh device syncinfo command when specifying an fspart path

Base View
=========

Fixed Issues
------------

* An issue with displaying the SNMP system information data for switches

Cluster Tools
=============

Fixed Issues
------------

* An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package

COD
===

New Features
------------

* Added support for creating HA COD clusters in Azure
* Allow the option to skip shared storage setup with the cm-cloud-ha-setup tool
* Added support for OCI defined tags. This changes the original --head-node-tags command line option to --head-node-freeform-tags and adds new command line option --head-node-defined-tags

Improvements
------------

* Allow the option to select the Azure availability zone on the command line of the cluster create command

Machine Learning
================

New Features
------------

* Introduced ML NCCL and CuDNN packages for CUDA 12.4

Fixed Issues
------------

* An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start

cm-clone-install
================

Fixed Issues
------------

* An issue with handling bond interfaces and bond members configuration on Ubuntu base distribution

cm-cluster-extension
====================

Fixed Issues
------------

* An issue where 'germany' is incorrectly listed as an Azure region

cm-create-image
===============

Fixed Issues
------------

* An issue where missing modular metadata for the 'default' package group on the RHEL8 and RHEL9 ISOs can prevent the creation of software images

cm-kubernetes-setup
===================

Improvements
------------

* Ensure the /var/lib/etcd directory has the correct permissions (0700) for etcd member to be able to join the etcd cluster

Fixed Issues
------------

* A regression in cm-kubernetes-setup that allows the user to select nodes in a way that results in the overlap of compute nodes between different Kubernetes clusters
* An issue with cm-kubernetes-setup --pull unable to complete if while the images are being pulled the pod is evicted due to disk pressure

cm-scale
========

New Features
------------

* Allow the option to reboot the nodes in FULL install mode when the cm-scale engine is changed

Fixed Issues
------------

* An issue where the shutdown state from files may be used incorrectly

cm-wlm-setup
============

Fixed Issues
------------

* Setting up pyxis will no longer configure it to clean the data directory from epilog since enroot can perform this automatically

cmsh
====

Improvements
------------

* Allow the options to specify the IP increment with the cmsh addinterface command
* Include the job run time data in the cmsh WLM jobs info command
* Allow the option to specify the network CIDR on the cmsh "add network" command line

Fixed Issues
------------

* An issue where the cmsh monitoring trigger info command does not show grouped expressions
* An issue with importing older formats of the .cmshhistory file, which can result in duplicating all entries in the cmsh command history

jupyter
=======

New Features
------------

* Allow the option to use sqsh files to run Jupyter kernels based on enroot
* Restrict the access to Jupyter based on group memberships

Improvements
------------

* Allow the option to install and configure VNC when setting up Jupyter

pythoncm
========

Fixed Issues
------------

* An issue in the pythoncm cluster.py implementation where an incorrect logger variable name is being used

pyxis-sources
=============

Improvements
------------

* Updated pyxis-sources to 0.19.0