Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.2-14

== General ==

- New Features

* Add cm-list-image-conf-files.py script to list all special files in /cm/conf/
* Add cuda12.2 packages
* Add cuda-driver-legacy-470 package to support older datacenter/Tesla GPUs requiring NVIDIA CUDA driver version 470

- Improvements

* Preserve files in /cm/images//cm/conf/{node,category}/ while updating images with rsync
* Remove field for the CPU frequency scaling governor
* Update cm-openssl package to 3.0.10
* Update mlnx-ofed58 package to 5.8-3.0.7.0
* Update mlnx-ofed54 package to 5.4-3.7.5.0
* Update mlnx-ofed49 package to 4.9-7.1.0.0

- Fixed Issues

* Delete duplicate entries in /etc/nginx/nginx.conf

== CMDaemon ==

- Improvements

* Allow cm-mig-manage to support GPUs that do not have index = minorID
* Improved daily cron script to create monthly backup files for the openldap-servers to also include backups older than 1 year
* Do not populate status for each node in the environment to avoid multiple slow RPCs
* Redirect all stdout/stderr from a cmburn test script to a log file
* Add --certificate --key options in cmsh help

- Fixed Issues

* Fix killing jobs on a node when CMDaemon is restarted on that node
* Update node environment cache when automatically changing FS exports
* Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
* Detect xvd* disk in sysinfo
* Fix help of cmsh cert removerequest command
* Ensure named gets reloaded when network changes made
* Fix doPrint call in mounts health check
* Fix false negative open --failbeforedown when a status value is unchanged
* Fix typo guage -> gauge

== Node Installer ==

- Fixed Issues

* Fix booting of compute nodes with separate /usr filesystem

== Cloud ==

- Fixed Issues

* Fix various issues with Azure locations caused by Azure API errors
* Improved support for AWS spot instances

== Kubernetes ==

- Improvements

* Update GPU operator to 23.3.2
* Update Kyverno to 3.0.4 (due to incompatibility with Kubernetes 1.27.x)

- Fixed Issues

* NVIDIA GPU Operator deployment always results in NVIDIA packages being installed
* Update exclude lists for Kubernetes to avoid failures on "grabimage"

== Workload Management ==

- New Features

* cm-wlm-setup now installs enroot on login nodes if pyxis is setup

- Improvements

* Update slurm23.02 package to 23.02.2
* Update PMIX to 4.1.3

== Machine Learning ==

- New Features

* Add ML package cm-cudnn8.8-cuda*

== Container Registries ==

- Fixed Issues

* Generate containerd certificates when a registry mirror is not configured

== Monitoring ==

- New Features

* Support for Graphana 10

- Improvements

* Reduce memory usage spike when using PromQL over short timespans
* Multiply metric value by 100 when displaying % in pythoncm

- Fixed Issues

* Fix the Slurm job_gpu_utilization and job_gpu_wasted metric calculations when running GPU process within sbatch scripts
* Fix samplenow CPUUsage metric
* Ensure first data sample of a Prometheus sampler is stored to the database
* Fix metrics sampling when temperatures are not provided by the Redfish API