Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.2-15

== General ==

- New Features

* Added support for SLES15 SP5
* Added RHEL 9.2 and Rocky 9.2 installation ISOs
* Added RHEL 8.8 and Rocky 8.8 installation ISOs
* Added mlnx-ofed23.10 package
* Added mlnx-ofed23.07 package

- Improvements

* Updated cm-openssl to 3.0.12
* Updated cuda-driver-legacy-470 to 470.223.02
* Updated cuda-driver package to 535.129.03
* Updated cuda-driver to 535.104.12
* Updated cm-libprometheus to 0.47.0

== CMDaemon ==

- New Features

* Update the Slurm state for AWS spot instances to DOWN/FUTURE when they are terminated outside of CMDaemon. This resolves an issue with slurmctld reporting it is unable to resolve the host names of the terminated nodes
* Clean up the list of nodes stored in the CMDaemon database for re-queued Slurm jobs
* Send warning event when provisioning request is stalled (for over 2h)

- Improvements

* Prevent already outdated monitoring data with timestamps in the past to be saved in the CMDaemon database
* An issue where the rogueprocess health check kill processes action may not take into account the whitelisted users
* Added rules, alert, and alertmanagers Prometheus end points
* Improved selection for the internal IP address used by etcd server on the internal network in case of multiple internal networks
* An issue where cm-cmd-ports --get does not return the requested configuration settings when it cannot connect to the active CMDaemon
* An issue with cm-burn unable to complete when both the pre or post stages are not defined
* Disable the CMDaemon monitoring engine if it detects missing or truncated monitoring database files to prevent possible CMDaemon crash
* An issue where the kubelet service may not be able to start on compute nodes when assigning the Kubernetes roles due to an exclude list preventing the kubelet.service file from being synced to the nodes
* Store the availability zones for networks created by COD or manually, which can enable cm-scale to distribute the loads between AZs in COD deployments
* Allow the option to use storcli software with the CMDaemon megaraid healthcheck

- Fixed Issues

* An issue with sorting on timestamps in a monitoring plot consisting of raw and consolidated data, which in some cases can result in CMDaemon returning no monitoring data for certain requests
* In some cases, a timing issue that may prevent the pbsmom service from starting in an on-perm+edge workload manager setup
* An issue where the service account CA Kubernetes certificate may be removed if the Kubernetes master role is assigned to a compute node and then unassigned from the head node
* An issue with the Prometheus exporter for monitoring data when the data contains measurable names with spaces
* An issue where a linear interpolation is used for health check data instead of the last known value, which can affect the visual representation of the data in Bright View
* An issue with moving the software image revisions directories when updating the path of the parent software image
* Allow the option to change a user's home directory to a directory path that already exists
* An issue where CMDaemon may crash in ArchOSInfo::is_arch_os when the cm-config-os-arch package is not installed on the head node
* An issue where new devices' metrics may not be saved to the monitoring DB if CMDaemon is restarted right after the new devices are created
* An issue in the prejob prolog script which can prevent WLM jobs from starting when prejob health checks are enabled in CMDaemon
* An issue where if cloud instances are terminated while the director is down, they might be listed with an UP+terminated state
* An issue where the "Reboot required: Interfaces have been modified" event may be generated for nodes when they have a VLAN interface on top of a bridge interface which includes a bond interface
* Fixed issue with cm-cloud-storage-setup when using us-east-1 region
* In some cases, CMDaemon may fail to trigger a provisioning request for a modified file when two images names start with the same substring
* An issue where the monitoring data for old jobs may not removed when trimming the cache, causing CMDaemon to crash
* An issue with configuring a default gateway for edge nodes running Ubuntu base distribution
* An issue where Slurm GRES configuration settings may be generated also when addtogresconf is set to no in CMDaemon
* In some cases, duplicate nameservers entries may be written in the /etc/resolv.conf file
* Allow to append or skip adding Slurm drain reason when a healthcheck fails with drain action enabled
* In some cases, imageupdate --pattern may sync files not matching the pattern

== Bright View ==

- New Features

* Notify about the availability of BCM package updates from within Base View

== COD ==

- Improvements

* Allow the option to configure cluster-on-demand clusters spanning multiple regions

== Head Node Installer ==

- Improvements

* By default, the head node installer now includes Slurm 23.02

== cm-diagnose ==

- Improvements

* Sanitize all mysqldumps in cm-diagnose

== cm-harbor ==

- Fixed Issues

* In some cases, a race condition where Harbor from the cm-harbor package and Shorewall are concurrently updating the iptables rules, which can prevent enabling the required iptables rules

== cm-kubernetes-setup ==

- Improvements

* Deploy the NVIDIA GPU Operator with toolkit.enabled=false by default

- Fixed Issues

* An issue where the wizard may use the commonName instead of the user name when adding users to Kubernetes

== cm-scale ==

- Fixed Issues

* Auto Scaler now takes Slurm mincpus parameter into account
* An issue where cm-scale may not start terminated cloud nodes when the nodes have been terminated while the node-installer was still running
* In some cases, an issue with terminating cloud nodes when multiple clone operations are also being executed in parallel
* An issue where cm-scale may fail to start spot instances after a no-capacity event occurs in an AWS availability zone

== cm-setup ==

- Fixed Issues

* A regression in cm-container-registry-setup for Harbor on HA head node setup, which can result in "no such file or directory" error messages for files not present on the passive head node

== cm-wlm-setup ==

- Fixed Issues

* Crash in cm-wlm-setup when disabling Slurm setup where the primary server option has been set in CMDaemon
* An issue where enroot is not configured by default on a head node when the pyxis Slurm plugin is enabled

== cmsh ==

- Improvements

* Allow the option to set the IP address when adding new lite nodes with cmsh

- Fixed Issues

* An issue where cmsh may not include the K, M, or G suffixes when printing consolidated averages for data without units

== jupyter ==

- Improvements

* Update the JupyterLab and JupyterHub dependencies to the most recent versions

== pythoncm ==

- Fixed Issues

* An issue where in some cases the pythoncm implementation may fail to expand a range of integer numbers and then cause cm-scale not to start the required number of nodes for Slurm job arrays
* An issue in the collapse Bracket code, which in some cases can produce an error "Solver::find, cleared zero bits" when handling a selection of hostnames

== Slurm ==

- Improvements

* Updated Slurm 22.05 to 22.05.10 and 23.06 to 23.02.6 (CVE-2023-41914)