Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.0-19

== General ==

- New Features

* Update cm-docker to v20.10.17

- Improvements

* Add cuda11.7 packages
* Add Mellanox 5.6 OFED stack (mlnx-ofed56 packages)
* Update cuda-dcgm to version 2.4.6.1
* Update cuda-driver to version 510.47.03
* Update cuda11.6 toolkit packages to 11.6 update 2
* Update mlnx-ofed49 to version 4.9-5.1.0.0
* Update mlnx-ofed56 to version 5.6-2.0.9.0
* Update nvhpc to version 22.3

- Fixed Issues

* mlnx-ofed: Incorrect values for the LD* environment variables in the Ubuntu openmpi module file
* An issue with installing individual Bright packages on RHEL8 / Rocky8 clusters with FIPS enabled due to the use of MD5 file digests rather than SHA256 file digests
* cm-kubernetes: make the https://:30443/dashboard ingress redirect to /dashboard/ to resolve browser-side issues where the browser will show an empty page instead of the dashboard
* mlnx-ofed56, mlnx-ofed55, mlnx-ofed54, mlnx-ofed49: added Mellanox OFED KMOD/KMP package build functionality for RPM based distributions
* cuda-driver: Load the nvidia_drm kernel module from the cuda-driver script, which otherwise can result in missing EGL devices in /dev/dri

== CMDaemon ==

- New Features

* Introduce new CMDaemon advanced configuration options for customizing global nginx.conf values
* Introduce new CMDaemon advanced configuration flag that will allow the use of the head node hostname instead of the default master value for the AccountingStorageHost and ControlAddr parameters in the slurm.conf file
* Introduce new CMDaemon advanced configuration flags that will allow specifying the From hostname in the email address for emails sent by the sendemail monitoring action
* Allow setting a custom per network interface MTU or disabling setting the MTU in the network interface configuration file

- Improvements

* Reduce verbosity of the 'result for obsolete tracker' messages, so that they are no longer included by default in the CMDaemon log file
* Improved logic when invalidating the nscd hosts cache on the compute nodes, to avoid cases where an outdated cache interferes with hostnames lookup
* CMDaemon certificates are now generated with a start date of 1 calendar day before the issue date, instead of the Unix epoch
* An issue where cm-manipulate-advanced-config.py is missing a python import, resulting in a crash when executed
* Added an endpoint prometheus/api/v1/status/buildinfo for the latest Grafana
* Allow for monitoring triggers to set post-drain actions
* Improved mysql health check no longer requires the mysql password to be included on the command line
* Include the command line arguments in the information events generated by CMDaemon when a kubectl command times out
* Modifying a network in CMDaemon that is used by Kubernetes will now request the relevant Kubernetes services to update their configuration and restart
* An issue where monitoring data for completed jobs is not always removed, which in some cases leads to CMDaemon allocating too much memory
* Add category labels to the devices in PromQL
* New Program Runner tracing levels to make it less verbose by default, which can decrease the number of logged lines in the CMDaemon log file
* Disable the software image /boot directory associations for cloud directors with a list of localimages and allimages set to "no", which means that /boot of unrelated software images will no longer be synced to the cloud director
* In some cases, the cmsh monitoringdataproducer command can crash CMDaemon if the command is executed while in the category mode
* Added extra API endpoints to the Prometheus interface for Grafana for version 8.0
* Ensure the head node(s) do not fall back to running in a compute node mode when mariadb is not in a good state while CMDaemon is starting
* Decrease the timeouts for the CMDaemon service so that CMDaemon is stopped faster

- Fixed Issues

* An issue with CMDaemon events delivery to edge nodes, which can result in an outdated information about committed entities
* An issue with setting up Kubernetes when the passive head node is the active leader according to Etcd, which results in some cases in Kubernetes not able to initialize properly
* An issue where PBS queue options set in CMDaemon may not be set in the PBS server configuration
* An issue with generating a valid Kubernetes kubeconfig for users with special characters in their login name. Performance improvements of the user manager
* Rare crash in CMDaemon while cloning an image
* An issue where CMDaemon can crash if the Bright View monitoring tree call does not pass a context
* Add full support for multi-value http request parameters, which resolves an issue where the "CMDaemon ready" service is not able to handle a list of services by name
* In some cases, terminating spot instances with CMDaemon may fail if the spot request has been cancelled outside of CMDaemon
* An issue where CMDaemon may occasionally hang on SSL_read while stopping
* An issue where the oomkiller health check may not detect the OOM killer has run on RHEL8 compute nodes
* An issue where password crypt can generate duplicate edge site secret hashes
* An issue where some older base distribution versions of openssl are unable create FIPS compliant DH parameters during add-on installation
* An issue with configuring the Postfix root alias in /etc/aliases on distros using Postfix 3.0 and higher, where emails to root on the compute nodes can no longer be delivered
* Do not retry CMProc::rexecCommand when the ptracker is no longer defined, which otherwise can result in error messages in the CMDaemon log file
* An issue with dumping the data for all entities and measurables when using the REST API
* An issue with the pythoncm programrunnerstatus kill method not working in some cases
* An issue where CMDaemon may attempt to start slurmdbd service before its configuration file has been updated after HA takeover
* Typo in the CMDaemon's cookie manager which in some cases can result in the users unable to login to the user portal
* An issue deploying openpbs with the server role assigned to multiple compute nodes

== Bright View ==

- Fixed Issues

* An issue with updated properties such as fsmount when a node or category has also static routes, resulting in an error message "The destination cannot be empty"
* An issue with showing the Last Change date for users
* An issue with clearing the BMC user-id setting in Bright View when the value is negative

== Node Installer ==

- Improvements

* New disableNodeInstallerNFSCertificateStore configuration setting in the node-installer.conf file to allow for disabling the certificates mount
* Allow the node-installer to continue configuring IPMI after a failure to set username and password if the user already exists

- Fixed Issues

* An issue with the configure_ipmi.pl script not working when the user id is set to 0
* An issue where disabled provisioning associations in the node-installer may still be rsynced

== Cluster Tools ==

- Fixed Issues

* An issue with cloning the mysql database when using cmha dbreclone when a configuration file /root/.my.cnf with other mysql credentials exists

== cmjob ==

- Fixed Issues

* An issue with transferring pbs job arrays outputs

== Machine Learning ==

- New Features

* Introduce packages cm-cudnn8.2-cuda11.4
* Introduce packages cm-cudnn8.4-cuda11.4

- Improvements

* Introduce environment variable JUPYTER_KERNEL_TEMPLATES_DIR for cm-jupyter-kernel-creator templates

== cm-create-image ==

- Fixed Issues

* An issue where images created with cm-create-image do not preserve the xattrs of the base tar image
* An issue where node-installer images created using the cm-create-image tool do not have an updated rsyslog.conf file
* An issue where the sanity checks fail for archives created with a leading "./"

== cm-kubernetes-setup ==

- Improvements

* Enable by default the selection of newer Kubernetes versions in the cm-kubernetes-setup screens
* Enable the selection of newer Kubernetes versions by default, which until now was available oly if a special command line option was used
* An issue with Kubernetes on Edge deployments, where the stage "waiting for Root Service Account" is performed too early and may not complete successfully in some cases
* In the Kubernetes module files, remove the MANPATH definitions which are no longer used
* The 'enabled' fields under the 'calico:' and 'flannel:' blocks in the cm-kubernetes-setup configuration files are no longer used and have been removed
* Use the --overwrite command line flag when running kubectl taint to avoid errors when taint already exists

- Fixed Issues

* Allow shorewall traffic between calico (cali+) wildcard interfaces to be routed back to the same interface, to resolve an issue where some services are unable to connect and are reporting a timeout

== cm-scale ==

- Fixed Issues

* In some cases, an issue with detecting failures to create cloud node instances

== cm-uge ==

- Improvements

* Update the default settings in cm-uge to allow running OpenMPI jobs without involving ssh

== cm-wlm-setup ==

- Improvements

* Deployment of IBM Spectrum LSF Suite is no longer supported. The supported option remains the deployment of LSF Standard Edition
* Automatically remove the WLM settings from the Auto Scaler configuration when the WLM is disabled

- Fixed Issues

* An issue with making the pbs.service file available on the compute nodes with offloaded PBSPro server role, which prevents the PBSPro server from starting during the setup

== cmsh ==

- Fixed Issues

* An issue where the XSD validation is not always loaded in cmsh when configuring a disk setup for the compute nodes
* An issue where tab completions do not work in the cmsh role mode
* cmsh color off command doesn't turn off all colors
* An issue where cloning users or groups in cmsh does not reset some of the settings to the correct default values
* cmsh permissions on Ubuntu are 700 instead of 755

== openpbs22.05 ==

- Improvements

* Add OpenPBS 22.05 integration

== slurm ==

- Improvements

* Rebuild the Ubuntu Slurm packages with cm-pmix3

== slurm21.08 ==

- Improvements

* Upgrade to 21.08.6

- Fixed Issues

* An issue with srun producing at the end if its execution messages it is unable to read files under /sys/fs/cgroup