Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.1-12
== General ==
- New Features
* Update cm-docker to v20.10.14
* Support for SUSE Linux Enterprise Server 15 SP3
- Improvements
* Update nvhpc to version 22.3
* Update cuda-dcgm to version 2.3.5-1
* Update openssl to 3.0.2 / 1.1.1n for CVE-2022-0778
* Update cuda-driver to version 510.47.03
* Update cuda11.6 toolkit packages to 11.6 update 2
* Use kernel version 5.13.0 by default when installing Bright with Ubuntu 20.04 base distribution
* Added CUDA 11.6 packages
* Added mlnx-ofed55 package
* Updated mlnx-ofed49 to version 4.9-4.1.7.0
- Fixed Issues
* Load the nvidia_drm kernel module from the cuda-driver script, which otherwise can result in missing EGL devices in /dev/dri
== cmdaemon ==
- New Features
* Added an option to get the latest monitoring counters using REST
* Allow setting a custom per network interface MTU or disabling setting the MTU in the network interface configuration file
* Allow the compute nodes to request a quick pickup of the monitoring data if a health check fails so that monitoring actions can be executed sooner
- Improvements
* Add category labels to the devices in PromQL
* New Program Runner tracing levels to make it less verbose by default, which can decrease the number of logged lines in the cmdaemon log file
* Allow /cm/shared to be provisioned to the passive edge director for cases without a shared storage in the edge
* Added the wlm filter cmsh command to cm-diagnose to be able to collect some workload managers jobs information
* New REST endpoints for license, version, device, and sysinfo
* In some cases, the cmsh monitoringdataproducer command can crash cmdaemon if the command is executed in the category mode
* Added extra API endpoints to the Prometheus interface for Grafana for version 8.0
* Allow extra lines to be added to the ifcfg configuration file from the network interface definition
* Added new accounting and reporting GPU queries for group, account, job_name, and total GPU*s used
* Determine if a user has a running job using cmutil PS process tracking
* Validate ramdisk creation is enabled for edge directors
* Assigning a director role to a category or an overlay is no longer allowed
* Add rsync excludes for named (DNS) files for edge and cloud directors
- Fixed Issues
* Add the monitoring action 'info' parameter value to the monitoring email
* Crash in cm-manipulate-advanced-config.py
* Do not retry CMProc::rexecCommand when the ptracker is no longer defined, which otherwise can result in error messages in the cmdaemon log file
* An issue with sampling the user counts metric for a head node
* GPU utilization metric can show a large number when MIG is enabled instead of no-data
* An issue with dumping the data for all entities and measurables when using the REST API
* An issue with pythoncm programrunnerstatus kill method not working in some cases
* A rare cmdaemon crash while terminating cloud nodes
* Sort the workload manager jobs before applying a filter when listing them in the User Portal or Bright View
* An issue where cmdaemon may attempt to start slurmdbd service before its configuration file has been updated after HA takeover
* Exclude /proc and /sys from the /cm/node-installer rsync
* An issue with regenerating user certificates when a new license is installed without re-using the private key
* In the case of head node HA, cmd.conf global configuration settings can get stored and sent to nodes multiple times
* In some cases, job detection can stop working due to kernel inotify issues
* An issue where cmdaemon still tries to get the job-end data from workload managers when due to configuration settings the workload managers may not provide the data
* An issue with pbspro/openbps server not becoming available on time while deploying pbspro/openbps on clusters with head node HA setup
* An issue deploying openpbs with the server role assigned to multiple compute nodes
* An issue with the generated DNS zone files when the cluster is extended to two or more cloud regions, which in some cases can prevent the named service from starting
== node-installer ==
- Improvements
* New disableNodeInstallerNFSCertificateStore configuration setting in the node-installer.conf file to allow for disabling the certificates mount
- Fixed Issues
* An issue where disabled provisioning associations in the node-installer may still be rsynced
== cluster-tools ==
- Fixed Issues
* Improve the log messages from the DAS shared storage mount and umount scripts to include the hostname of the head node
* An issue with executing the pre- and post-failover scripts for mounting and umounting DAS shared storage for head node HA setup
== Bright View ==
- Fixed Issues
* An issue with clearing the BMC user-id setting in Bright View when the value is negative
* An issue where some setup wizards were not available in Bright View when using Rocky base distribution
== cm-image ==
- Fixed Issues
* On multi-arch or multi-distro clusters, create a new soft-link /cm/node-installer to the distro/arch-specific node-installer to allow for tools such as the cm-clone-install script to detect the head node disk layout
== cm-kubernetes-setup ==
- Fixed Issues
* An issue with deploying Kubernetes when using Flannel for networking
* An issue with updating the Shorewall configuration when deploying Kubernetes on head nodes with bonded network interfaces
* An issue where newer versions of Kubernetes may not be allowed also when using the --include-newer-versions command line option
== cm-scale ==
- Improvements
* An issue with detecting the job's GPU requests for Slurm version >= 21
- Fixed Issues
* In some cases, an issue with detecting failures to create cloud node instances
* An issue with draining via cmdaemon multiple LSF nodes at once
* An issue with parsing the CPU cores from the cloud node settings for AWS
== cm-wlm-setup ==
- Fixed Issues
* An issue with setting up lsf-suite workload manager when /tmp is in a different partition
* An issue with setting up a second UGE instance when the first one is already set up on the head node(s) and /cm/shared is on NFS
== cmsh ==
- Improvements
* Display the full labeled entity index when the --verbose flag is used in cmsh
- Fixed Issues
* Tab completion for zones and policies in the cmsh roles mode
* An issue where tab completions do not work in the cmsh role mode
* cmsh color off command doesn't turn off all colors
* An issue where cloning users or groups in cmsh does not reset some of the settings to the correct default values
* Incomplete list of network interface types on the help page of the cmsh addinterface command
* cmsh permissions on Ubuntu are 700 instead of 755
== cod ==
- Improvements
* Apply the tags specified during the creation of an AWS COD cluster also to the AWS nodes and EBS volumes created by the head node after the cluster is setup
== ml ==
- New Features
* Updated cm-tensorflow-* packages to v2.7.0
* Introduced support for Machine Learning packages on py39/gcc9.
- Improvements
* Introduced environment variable JUPYTER_KERNEL_TEMPLATES_DIR for cm-jupyter-kernel-creator templates
* Stopped upgrading ML packages cm-dynet-*.
== slurm21.08 ==
- Improvements
* Upgrade to 21.08.6
- Fixed Issues
* NVIDIA MIG autodetection in slurm 21.08
== user portal ==
- Fixed Issues
* An issue where the User Portal may report "Cannot read properties of undefined" when visiting the cluster overview page