Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.1-18

== General ==

- Improvements

* Added cuda-driver-legacy-470 package: 470 version of NVIDIA driver to support older datacenter/Tesla GPUs.
* Added CUDA 12.2 packages
* Added CUDA 12.3 packages
* Added mlnx-ofed23.04 package
* Added mlnx-ofed23.07 package
* Added mlnx-ofed23.10 package
* The mlnx-ofed packages' installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED
* Updated cm-nvhpc to 23.11
* Updated CUDA 12.1 to 12.1 update 1
* Updated cuda-driver to 545.23.08
* Updated mlnx-ofed49 to 4.9-7.1.0.0
* Updated mlnx-ofed54 to 5.4-3.7.5.0
* Updated mlnx-ofed58 to 5.8-4.1.5.0
* Updated cm-openssl to 1.1.1w

== CMDaemon ==

- Improvements

* Redirect the output from cm-burn to tty1
* Added a periodic CMDaemon maintenance task to free the heap allocations that are not automatically freed, which reduces the memory usage
* Added an hourly cron job to clean up files left behind when the sample-ipmi script is killed
* Added “rules”, “alert”, and “alertmanagers” Prometheus end points
* An issue where cm-cmd-ports --get does not return the requested configuration settings when it cannot connect to the active CMDaemon
* Disable the CMDaemon monitoring engine if it detects missing or truncated monitoring database files to prevent possible CMDaemon crash
* The cpuspeedGovernor node and category property is no longer supported and has removed from the CMDaemon configuration
* Allow the option to use storcli software with the CMDaemon megaraid healthcheck

- Fixed Issues

* An issue with the passive head node forwarding labeled entity information to the active, preventing it from being used in PromQL queries
* An issue with parsing the Slurm jobs requested memory information
* An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change
* An issue with sorting by timestamp in a monitoring plot consisting of raw and consolidated data, which in some cases can result in CMDaemon returning no monitoring data for certain requests
* An issue with collecting GPU job metrics for containerized Pyxis jobs
* An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces
* An issue where a linear interpolation is used for health check data instead of the last known value, which can affect the representation of the data in Bright View
* An issue where CMDaemon may crash in ArchOSInfo::is_arch_os when the cm-config-os-arch package is not installed on the head node
* An issue where new devices' metrics may not be saved to the monitoring DB if CMDaemon is restarted right after the new devices are created
* In some cases, CMDaemon may fail to trigger a provisioning request for a modified file when two images names start with the same substring
* An issue where the monitoring data for old jobs may not be removed when trimming the cache, causing CMDaemon to crash
* An issue with consecutive executions of "open --failbeforedown" to open devices with cmsh when the value of the failbeforedown counter is not changed
* An issue with printing informational messages in the mounts health check implementation
* An issue where an imageupdate of the compute nodes can remove the Kubernetes controller-manager configuration and service files, preventing the (re)start of the service
* An issue where the cumulative flag passed to CMDaemon by a JSON monitoring sampler script is not interpreted during initialization
* An issue where the CMDaemon remote mount check function does not take into account a custom port specified by the NFSCheckerPort advanced configuration option
* An issue with collecting UGE job information if UGE job accounting rotation is configured
* In some cases, CMDaemon crash in the provisioning status code after canceling a provisioning request
* An issue where imageupdate --pattern can sync files not matching the pattern

== Node Installer ==

- Fixed Issues

* An issue with provisioning compute nodes with separate /usr filesystem
* An issue that prevented cloning headnodes with btrfs filesystems
* An issue with the node-installer disk scripts being unable to assemble MD raids
* An issue with the bootif_detect and getclientid scripts on compute nodes that PXE boot from ConnectX-3 cards and use the GRUB bootloader

== Head Node Installer ==

- Improvements

* Slurm 23.02 is now installed by default for new cluster installations

== Machine Learning ==

- New Features

* Introduced ML package cm-cudnn8.8-cuda*
* Introduced ML package cm-cudnn8.9-cuda12.1 and cm-cudnn8.9-cuda12.0

== cm-cluster-extension ==

- Fixed Issues

* Fixed various issues related to Azure caused by changes in the Azure API

== cm-diagnose ==

- Improvements

* Include syslog in cm-diagnose
* Sanitize all mysqldumps in cm-diagnose

- Fixed Issues

* In some cases, an issue where cm-diagnose may not collect the required information from the primary/passive head node when the secondary head node is the active head node

== cm-jupyter-setup ==

- Fixed Issues

* An issue that prevented cm-jupyter-setup from running in multi-distro environments

== cm-kubernetes-setup ==

- Fixed Issues

* In some cases, a race condition in the Kubernetes certificate generation during setup which can prevent the kubelet services from starting

== cm-scale ==

- Fixed Issues

* Auto Scaler now takes Slurm mincpus parameter into account
* In some cases, an issue with terminating cloud nodes when multiple clone operations are also executed in parallel
* An issue where config.py is replaced when the cm-scale package is updated

== cm-wlm-setup ==

- Fixed Issues

* An issue where Ubuntu-based cm-wlm-setup is unable to complete the setup of Slurm if the Slurm packages had previously been removed by using the purge package manager option

== cmsh ==

- Fixed Issues

* An issue where cmsh may not include the K, M, or G suffixes when printing consolidated averages for data without units
* cmsh crash when cloning an entity without specifying a name in the genericresouces submode

== jupyter ==

- Improvements

* Update the JupyterLab and JupyterHub dependencies to the most recent versions

== openpbs23.06 ==

- Improvements

* Added OpenPBS 23.06 packages

== pythoncm ==

- Fixed Issues

* An issue in the collapseBracket code, which in some cases can produce an error "Solver::find, cleared zero bits" when handling a selection of hostnames

== pyxis-sources ==

- New Features

* Updated pyxis sources package to 0.17.0

== slurm23.11 ==

- New Features

* Added Slurm 23.11 packages. The cm-setup and cmdaemon packages need to be updated to their most recent versions before installing Slurm 23.11