Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.2-4

== General ==

- New Features

* cm-docker has been upgraded to v20.10.17

- Improvements

* mlnx-ofed56: updated to version 5.6-2.0.9.0

- Fixed Issues

* An issue with building the ice 1.6.7 driver from intel-wired-ethernet-drivers on RHEL8 compute nodes
* mlnx-ofed56, mlnx-ofed55, mlnx-ofed54, mlnx-ofed49: added Mellanox OFED KMOD/KMP package build functionality for RPM based distributions

- Known issues

* Upgrading the CM packages on SLES12 can result in a conflict between the cm-dhcp and the base distro dhcp packages. It is safe to answer “yes” to replace the conflicting dhcp files with files from cm-dhcp

== cmdaemon ==

- New Features

* REST endpoint for workload/jobs

- Improvements

* Improved logic when invalidating the nscd hosts cache on the compute nodes, to avoid cases where an outdated cache interferes with hostnames lookup
* cmdaemon certificates are now generated with a start date of 1 calendar day before the issue date, instead of the Unix epoch
* An issue where cm-manipulate-advanced-config.py is missing a python import, resulting in a crash when executed
* Ensure that the Kubernetes NetworkPolicy feature works when the kube-proxy masqueradeAll flag is disabled
* An issue where monitoring data for labeled entities is not preserved after the entity has been dropped and automatically re-added
* Added an endpoint prometheus/api/v1/status/buildinfo for the latest Grafana
* Fixed an issue where the lite nodes physical CPU IDs and counts are not correctly set
* Add dmidecode information to the lite nodes system info
* Ignore /dev/loop devices in the lite nodes system info
* Allow for monitoring triggers to set post-drain actions
* Add REST support to dump monitoring data for jobs
* Optimize the internal cmdaemon WLM state checks, which reduces the cmdaemon load when the job information is being cleaned up
* Rewrite of the mysql health check, so that it does not require the mysql password to be included on the command line
* pythoncm now includes periodic checks during the provisioning wait, to ensure that tools such as cm-wlm-setup do not time out while the nodes are being provisioned
* Ensure the head node(s) do not fall back to running in a compute node mode when mariadb is not in a good state while cmdaemon is starting

- Fixed Issues

* An issue where cmdaemon can crash if the Bright View monitoring tree call does not pass a context
* Add full support for multi-value http request parameters
* An issue where drain actions are not executed after a Kubernetes cluster node has been drained
* An issue with GPU MIG profiles configuration for Slurm, which can lead to not (correctly) detected MIG devices
* In some cases, cmdaemon crash when an instantiated AWS cloud compute node is removed
* An issue where cmdaemon may occasionally hang on SSL_read while stopping
* An issue with setting the correct numerical value for the RealMemory parameter in slurm.conf
* An issue where the default gateway may not be set on a cluster with an aliased external network interface
* An issue with updating the values of the SelectType and SelectTypeParameters parameters in the slurm.conf file
* An issue where the oomkiller health check may not detect the OOM killer has run on RHEL8 compute nodes
* An issue where the monitoring trigger information is not pulled from all monitoring nodes
* In some cases, cmdaemon stop may be too slow due to an issue in the WLM NodeRunningJobCache
* An issue where the monitoring pickup intervals for lite nodes is not shown in cmsh
* Increase the length of the cmdaemon category names to a maximum of 128 characters
* Improve the MIG scripts error messages displayed in cmsh
* An issue with saving the LSF WLM job nodes in the cmdaemon DB
* An issue where cloud nodes may briefly go from PENDING to DOWN after they are powered on

== node-installer ==

- Improvements

* Allow the node-installer to continue with configuring IPMI after encountering a failure to set username and password when the user already exists

- Fixed Issues

* An issue in the ilo_power.pl script, which can break the remote power management for nodes that use an ilo0 interface for the power control

== Bright View ==

- Fixed Issues

* An issue where cloning an instantiated cloud compute node with Bright View can result in the source node's cloud instance ID and IP address carried over to the cloned node, which can result in the wrong node being power reset or terminated by cmdaemon
* An issue where cloning a software image with Bright View may result in an incorrect value for the FSPart of the cloned image, resulting in the cloned image using the original image's directory path
* An issue with using the root shell feature in Bright View
* An issue with cloning WLM job queues with Bright View

== cm-create-image ==

- Fixed Issues

* In some cases, cm-create-image can fail to copy /etc/resolv.conf from the headnode due to a broken /etc/resolv.conf symlink in the software image

== cm-kubernetes-setup ==

- New Features

* Add support for NVIDIA GPU Operator, Prometheus Operator Stack, Prometheus Adapter
* Support for DGX Ubuntu and RHEL8 software images

- Fixed Issues

* Ensure cm-kubernetes-setup --default-cni-bin-dir flag updates all relevant roles

== cm-libpam ==

- New Features

* Allow groups to be whitelisted for WLM+PAM in /etc/security/pam_bright.d/pam_whitelist_group.conf

== cm-scale ==

- Fixed Issues

* An issue with setting the AS_ENGINE environment variable in the Auto Scaler allocation prolog and epilog

== cm-setup ==

- Fixed Issues

* An issue where cm-container-registry-setup does not correctly set up multiple registry certificates for containerd

== cm-uge ==

- Improvements

* Update the default settings in cm-uge to allow running OpenMPI jobs without involving ssh

== cm-wlm-setup ==

- Improvements

* Fix an issue with installing Pyxis on a multi-arch/multi-distro setup
* Deployment of IBM Spectrum LSF Suite is no longer supported. The supported option remains the deployment of LSF Standard Edition
* Automatically remove the WLM settings from the Auto Scaler configuration when the WLM is disabled

- Fixed Issues

* In some cases, an issue with deploying (Slurm) WLM when there are multiple categories and software images

== cmsh ==

- Fixed Issues

* An issue where the XSD validation is not always loaded in cmsh when configuring a disk setup for the compute nodes

== head node installer client ==

- Fixed Issues

* An issue where using the "Continue remotely" feature of the head node installer can result in double execution of the installation steps

== ml ==

- New Features

* Introduced packages cm-cudnn8.2-cuda11.4
* Introduced packages cm-cudnn8.4-cuda11.4

== openpbs20 ==

- Fixed Issues

* Rebuild OpenPBS 20 with hwloc version 1, to resolve an issue where qsub -V crashes

== pythoncm ==

- Fixed Issues

* An issue where the pythoncm drain_status command in node.py passes an incorrectly named "wlms" argument, which leads to a crash when executed

== slurm ==

- Fixed Issues

* Rebuild the Ubuntu Slurm packages with cm-pmix3

== slurm21.08 ==

- Improvements

* Add a systemd service file to the slurm21.08-slurmrestd package