Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.1-13

== General ==

- New Features

* cm-docker has been upgraded to v20.10.17

- Improvements

* cm-kubernetes: Increase the resources requests and limits for flannel to prevent the cgroups running out of memory
* Add Mellanox 5.6 OFED stack (mlnx-ofed56 packages)

- Fixed Issues

* mlnx-ofed: Incorrect values for the LD* environment variables in the Ubuntu openmpi module file
* An issue with installing individual Bright packages on RHEL8 / Rocky8 clusters with FIPS enabled due to the use of MD5 file digests rather than SHA256 file digests
* cm-kubernetes: make the https://:30443/dashboard ingress redirect to /dashboard/ to resolve browser-side issues wherein the browser shows an empty page instead of the dashboard
* An issue with using the pip command with cm-python38 due to a missing PackageFinder package
* mlnx-ofed56, mlnx-ofed55, mlnx-ofed54, mlnx-ofed49: added Mellanox OFED KMOD/KMP package build functionality for RPM based distributions

== cmdaemon ==

- New Features

* REST endpoint for workload/jobs

- Improvements

* Fix possible deadlock in the Prometheus manager
* Added an endpoint prometheus/api/v1/status/buildinfo for the latest Grafana
* Allow for monitoring triggers to set post-drain actions
* Add REST support to dump monitoring data for jobs
* Optimize the internal cmdaemon WLM state checks, which reduces the cmdaemon load when the job information is being cleaned up
* Rewrite of the mysql health check, so that it does not require the mysql password to be included on the command line
* An issue where the charge back filter for user project managers does not return all matches
* Include the command line arguments in the information events generated by cmdaemon when a kubectl command times out
* Modifying a network in cmdaemon that is used by Kubernetes will now request the relevant Kubernetes services to update their configuration and restart
* An issue where the monitoring data for completed jobs is not always removed, which can lead to cmdaemon allocating too much memory
* Add capability to monitor and capture CUDA GPU XID errors
* Ensure the head node(s) do not fall back to running in a compute node mode when mariadb is not in a good state while cmdaemon is starting

- Fixed Issues

* Fix an issue with an extra white space at the end of the CMD_NODE_INSTALLER_PATH environment variable passed to the ilo_power.pl script, which affects power operations against nodes with iLO BMCs
* An issue where the oomkiller health check may not detect the OOM killer has run on RHEL8 compute nodes
* An issue where the monitoring trigger information is not pulled from all monitoring nodes
* In some cases, cmdaemon stop may be too slow due to an issue in the WLM NodeRunningJobCache
* An issue where the monitoring pickup intervals for lite nodes is not shown in cmsh
* Increase the length of the cmdaemon category names to a maximum of 128 characters
* An issue where password crypt can generate duplicate edge site secret hashes
* An issue where some older base distribution versions of openssl are unable create FIPS compliant DH parameters during add-on installation
* An issue with configuring the Postfix root alias in /etc/aliases on distros using Postfix 3.0 and higher, where emails to root on the compute nodes can no longer be delivered
* Fix a typo in the cmdaemon's cookie manager which in some cases can result in the users unable to login to the user portal
* Add cm-apt-conf-image by default to the software image packages for Ubuntu, and implementation in cmdaemon to resolve an issue where an update of initramfs-tools in the software image prevents the compute nodes from booting on Ubuntu base distro

== cluster-tools ==

- Fixed Issues

* An issue with cm-chroot-sw-img unable to chroot to the software image on SLES12sp3 due to missing a /bin/bash file
* An issue with the external-user-cert tool skipping the remaining users in a list if a certificate for one of the users already exists

== cm-clone-install ==

- Fixed Issues

* Fix a typo in cm-clone-install which causes scp failures and an error in logs

== cm-create-image ==

- Fixed Issues

* An issue where the sanity checks fail for archives created with a leading "./"
* In some cases, cm-create-image may fail on RHEL8 as it attempts to install an empty list of packages

== cm-kubernetes-setup ==

- Improvements

* The 'enabled' fields under the 'calico:' and 'flannel:' blocks in the cm-kubernetes-setup configuration files are no longer used and have been removed
* Use the --overwrite command line flag when running kubectl taint to avoid errors when taint already exists

- Fixed Issues

* Ensure swap is disabled on the compute nodes running Kubernetes
* In some cases, a crash in cm-kubernetes-setup when dealing with interfaces without a network (such as interfaces that are part of a bond)
* Allow shorewall traffic between calico (cali+) wildcard interfaces to be routed back to the same interface, to resolve an issue where some services are unable to connect and are reporting a timeout

== cm-libpam ==

- New Features

* Allow groups to be whitelisted for WLM+PAM in /etc/security/pam_bright.d/pam_whitelist_group.conf

== cm-wlm-setup ==

- Improvements

* Deployment of IBM Spectrum LSF Suite is no longer supported. The supported option remains the deployment of LSF Standard Edition

== cmdameon ==

- Fixed Issues

* An issue with creating ramdisk on SLES15SP3-HPC base distro

== user portal ==

- Fixed Issues

* An issue with filtering jobs by date in the User Portal