Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.1-12

== General ==

- New Features

* Update cm-docker to v20.10.14
* Support for SUSE Linux Enterprise Server 15 SP3

- Improvements

* Update nvhpc to version 22.3
* Update cuda-dcgm to version 2.3.5-1
* Update openssl to 3.0.2 / 1.1.1n for CVE-2022-0778
* Update cuda-driver to version 510.47.03
* Update cuda11.6 toolkit packages to 11.6 update 2
* Use kernel version 5.13.0 by default when installing Bright with Ubuntu 20.04 base distribution
* Added CUDA 11.6 packages
* Added mlnx-ofed55 package
* Updated mlnx-ofed49 to version 4.9-4.1.7.0

- Fixed Issues

* Load the nvidia_drm kernel module from the cuda-driver script, which otherwise can result in missing EGL devices in /dev/dri

== cmdaemon ==

- New Features

* Added an option to get the latest monitoring counters using REST
* Allow setting a custom per network interface MTU or disabling setting the MTU in the network interface configuration file
* Allow the compute nodes to request a quick pickup of the monitoring data if a health check fails so that monitoring actions can be executed sooner

- Improvements

* Add category labels to the devices in PromQL
* New Program Runner tracing levels to make it less verbose by default, which can decrease the number of logged lines in the cmdaemon log file
* Allow /cm/shared to be provisioned to the passive edge director for cases without a shared storage in the edge
* Added the wlm filter cmsh command to cm-diagnose to be able to collect some workload managers jobs information
* New REST endpoints for license, version, device, and sysinfo
* In some cases, the cmsh monitoringdataproducer command can crash cmdaemon if the command is executed in the category mode
* Added extra API endpoints to the Prometheus interface for Grafana for version 8.0
* Allow extra lines to be added to the ifcfg configuration file from the network interface definition
* Added new accounting and reporting GPU queries for group, account, job_name, and total GPU*s used
* Determine if a user has a running job using cmutil PS process tracking
* Validate ramdisk creation is enabled for edge directors
* Assigning a director role to a category or an overlay is no longer allowed
* Add rsync excludes for named (DNS) files for edge and cloud directors

- Fixed Issues

* Add the monitoring action 'info' parameter value to the monitoring email
* Crash in cm-manipulate-advanced-config.py
* Do not retry CMProc::rexecCommand when the ptracker is no longer defined, which otherwise can result in error messages in the cmdaemon log file
* An issue with sampling the user counts metric for a head node
* GPU utilization metric can show a large number when MIG is enabled instead of no-data
* An issue with dumping the data for all entities and measurables when using the REST API
* An issue with pythoncm programrunnerstatus kill method not working in some cases
* A rare cmdaemon crash while terminating cloud nodes
* Sort the workload manager jobs before applying a filter when listing them in the User Portal or Bright View
* An issue where cmdaemon may attempt to start slurmdbd service before its configuration file has been updated after HA takeover
* Exclude /proc and /sys from the /cm/node-installer rsync
* An issue with regenerating user certificates when a new license is installed without re-using the private key
* In the case of head node HA, cmd.conf global configuration settings can get stored and sent to nodes multiple times
* In some cases, job detection can stop working due to kernel inotify issues
* An issue where cmdaemon still tries to get the job-end data from workload managers when due to configuration settings the workload managers may not provide the data
* An issue with pbspro/openbps server not becoming available on time while deploying pbspro/openbps on clusters with head node HA setup
* An issue deploying openpbs with the server role assigned to multiple compute nodes
* An issue with the generated DNS zone files when the cluster is extended to two or more cloud regions, which in some cases can prevent the named service from starting

== node-installer ==

- Improvements

* New disableNodeInstallerNFSCertificateStore configuration setting in the node-installer.conf file to allow for disabling the certificates mount

- Fixed Issues

* An issue where disabled provisioning associations in the node-installer may still be rsynced

== cluster-tools ==

- Fixed Issues

* Improve the log messages from the DAS shared storage mount and umount scripts to include the hostname of the head node
* An issue with executing the pre- and post-failover scripts for mounting and umounting DAS shared storage for head node HA setup

== Bright View ==

- Fixed Issues

* An issue with clearing the BMC user-id setting in Bright View when the value is negative
* An issue where some setup wizards were not available in Bright View when using Rocky base distribution

== cm-image ==

- Fixed Issues

* On multi-arch or multi-distro clusters, create a new soft-link /cm/node-installer to the distro/arch-specific node-installer to allow for tools such as the cm-clone-install script to detect the head node disk layout

== cm-kubernetes-setup ==

- Fixed Issues

* An issue with deploying Kubernetes when using Flannel for networking
* An issue with updating the Shorewall configuration when deploying Kubernetes on head nodes with bonded network interfaces
* An issue where newer versions of Kubernetes may not be allowed also when using the --include-newer-versions command line option

== cm-scale ==

- Improvements

* An issue with detecting the job's GPU requests for Slurm version >= 21

- Fixed Issues

* In some cases, an issue with detecting failures to create cloud node instances
* An issue with draining via cmdaemon multiple LSF nodes at once
* An issue with parsing the CPU cores from the cloud node settings for AWS

== cm-wlm-setup ==

- Fixed Issues

* An issue with setting up lsf-suite workload manager when /tmp is in a different partition
* An issue with setting up a second UGE instance when the first one is already set up on the head node(s) and /cm/shared is on NFS

== cmsh ==

- Improvements

* Display the full labeled entity index when the --verbose flag is used in cmsh

- Fixed Issues

* Tab completion for zones and policies in the cmsh roles mode
* An issue where tab completions do not work in the cmsh role mode
* cmsh color off command doesn't turn off all colors
* An issue where cloning users or groups in cmsh does not reset some of the settings to the correct default values
* Incomplete list of network interface types on the help page of the cmsh addinterface command
* cmsh permissions on Ubuntu are 700 instead of 755

== cod ==

- Improvements

* Apply the tags specified during the creation of an AWS COD cluster also to the AWS nodes and EBS volumes created by the head node after the cluster is setup

== ml ==

- New Features

* Updated cm-tensorflow-* packages to v2.7.0
* Introduced support for Machine Learning packages on py39/gcc9.

- Improvements

* Introduced environment variable JUPYTER_KERNEL_TEMPLATES_DIR for cm-jupyter-kernel-creator templates
* Stopped upgrading ML packages cm-dynet-*.

== slurm21.08 ==

- Improvements

* Upgrade to 21.08.6

- Fixed Issues

* NVIDIA MIG autodetection in slurm 21.08

== user portal ==

- Fixed Issues

* An issue where the User Portal may report "Cannot read properties of undefined" when visiting the cluster overview page