Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.0-21

== General ==

- Improvements

* Added CUDA 12.1 packages
* Added CUDA 12.2 packages
* Added CUDA 12.3 packages
* Added cuda-driver-legacy-470 package: 470 version of NVIDIA driver to support older datacenter/Tesla GPUs
* Added mlnx-ofed23.04
* Added mlnx-ofed23.07
* Added mlnx-ofed23.10
* The mlnx-ofed packages' installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED
* Updated cm-nvhpc to 23.11
* Updated cm-openssl to 1.1.1u
* Updated cuda-driver-legacy-470 to 470.223.02
* Updated cuda-driver to 545.23.08
* Updated mlnx-ofed49 to 4.9-7.1.0.0
* Updated mlnx-ofed54 to 5.4-3.7.5.0
* Updated mlnx-ofed58 to 5.8-4.1.5.0
* Updated openssl to 1.1.1w

- Fixed Issues

* Changed the architecture of the Lmod package from independent (noarch/all) to architecture dependent, which resolves the "module 'bit32' not found" issue on Ubuntu
* An issue where the 90-cm-sysctl.conf file is not marked as a configuration file on Ubuntu base distributions

== CMDaemon ==

- Improvements

* Added rules, alert, and alertmanagers Prometheus endpoints
* Disable the CMDaemon monitoring engine if it detects missing or truncated monitoring database files to prevent a possible CMDaemon crash
* The cpuspeedGovernor node and category property is no longer supported and has been removed from the CMDaemon configuration
* Ensure malformed strings in the GPU information do not corrupt the JSON serialization in CMDaemon
* Allow the option to use storcli software with the CMDaemon megaraid healthcheck

- Fixed Issues

* An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces
* In some cases, CMDaemon may fail to trigger a provisioning request for a modified file when the names of two images begin with the same substring
* An issue with printing informational messages in the mounts health check implementation
* An issue where a restart of CMDaemon on the head node can cause CMDaemon on the compute nodes to perform a generally harmless restart of the Slurmd service also when there are no configuration changes
* An issue where the cumulative flag passed to CMDaemon by a JSON monitoring sampler script is not interpreted during initialization
* An issue where the interfaces health check can report failure on compute nodes with a ConnectX IB card in UEFI mode as the BOOTIF interface
* In some cases, an issue where cm-diagnose may not collect the required information from the primary/passive head node when the secondary head node is the active head node
* An issue with consecutive executions of "open --failbeforedown" to open devices with cmsh when the value of the failbeforedown counter is not changed

== Node Installer ==

- Improvements

* An issue where the node-installer may halt with a message "Unable to determine accelerators" due to temporary issues with listing the devices with lspci

- Fixed Issues

* An issue with provisioning compute nodes with separate /usr filesystem
* An issue that prevented cloning headnodes with btrfs filesystems
* An issue with the node-installer disk scripts being unable to assemble MD raids
* An issue with the bootif_detect and getclientid scripts on compute nodes that PXE booting from ConnectX-3 cards and using the GRUB bootloader
* An issue where the RDMA settings are not added to the corresponding entries in the /etc/fstab file when using NFS over RDMA
* In some cases, an issue with the bootif_detect script is unable to detect the correct InfiniBand (IB) device when there are multiple IB interfaces

== Head Node Installer ==

- Improvements

* Slurm 23.02 is now installed by default for new cluster installations

== Machine Learning ==

- New Features

* Introduced ML package cm-cudnn8.9-cuda12.1 and cm-cudnn8.9-cuda12.0
* Introduced ML package cm-cudnn8.5-cuda11.8

== cm-cluster-extension ==

- Fixed Issues

* Fixed various issues related to Azure caused by changes in the Azure API

== cm-diagnose ==

- Improvements

* Sanitize all mysqldumps in cm-diagnose

== cm-jupyter-setup ==

- Fixed Issues

* An issue that prevented cm-jupyter-setup from running in multi-distro environments

== cm-wlm-setup ==

- Fixed Issues

* An issue where Ubuntu-based cm-wlm-setup is unable to complete the setup of slurm if the Slurm packages had previously been removed by using the purge package manager option

== jupyter ==

- Improvements

* Update the JupyterLab and JupyterHub dependencies to the most recent versions

== openpbs23.06 ==

- Improvements

* Added OpenPBS 23.06 packages

== pythoncm ==

- Fixed Issues

* An issue in the collapse Bracket code which in some cases can produce an error "Solver::find, cleared zero bits" when handling a selection of hostnames