Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.2-14
== General ==
- New Features
* Add cm-list-image-conf-files.py script to list all special files in /cm/conf/
* Add cuda12.2 packages
* Add cuda-driver-legacy-470 package to support older datacenter/Tesla GPUs requiring NVIDIA CUDA driver version 470
- Improvements
* Preserve files in /cm/images//cm/conf/{node,category}/ while updating images with rsync
* Remove field for the CPU frequency scaling governor
* Update cm-openssl package to 3.0.10
* Update mlnx-ofed58 package to 5.8-3.0.7.0
* Update mlnx-ofed54 package to 5.4-3.7.5.0
* Update mlnx-ofed49 package to 4.9-7.1.0.0
- Fixed Issues
* Delete duplicate entries in /etc/nginx/nginx.conf
== CMDaemon ==
- Improvements
* Allow cm-mig-manage to support GPUs that do not have index = minorID
* Improved daily cron script to create monthly backup files for the openldap-servers to also include backups older than 1 year
* Do not populate status for each node in the environment to avoid multiple slow RPCs
* Redirect all stdout/stderr from a cmburn test script to a log file
* Add --certificate --key options in cmsh help
- Fixed Issues
* Fix killing jobs on a node when CMDaemon is restarted on that node
* Update node environment cache when automatically changing FS exports
* Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
* Detect xvd* disk in sysinfo
* Fix help of cmsh cert removerequest command
* Ensure named gets reloaded when network changes made
* Fix doPrint call in mounts health check
* Fix false negative open --failbeforedown when a status value is unchanged
* Fix typo guage -> gauge
== Node Installer ==
- Fixed Issues
* Fix booting of compute nodes with separate /usr filesystem
== Cloud ==
- Fixed Issues
* Fix various issues with Azure locations caused by Azure API errors
* Improved support for AWS spot instances
== Kubernetes ==
- Improvements
* Update GPU operator to 23.3.2
* Update Kyverno to 3.0.4 (due to incompatibility with Kubernetes 1.27.x)
- Fixed Issues
* NVIDIA GPU Operator deployment always results in NVIDIA packages being installed
* Update exclude lists for Kubernetes to avoid failures on "grabimage"
== Workload Management ==
- New Features
* cm-wlm-setup now installs enroot on login nodes if pyxis is setup
- Improvements
* Update slurm23.02 package to 23.02.2
* Update PMIX to 4.1.3
== Machine Learning ==
- New Features
* Add ML package cm-cudnn8.8-cuda*
== Container Registries ==
- Fixed Issues
* Generate containerd certificates when a registry mirror is not configured
== Monitoring ==
- New Features
* Support for Graphana 10
- Improvements
* Reduce memory usage spike when using PromQL over short timespans
* Multiply metric value by 100 when displaying % in pythoncm
- Fixed Issues
* Fix the Slurm job_gpu_utilization and job_gpu_wasted metric calculations when running GPU process within sbatch scripts
* Fix samplenow CPUUsage metric
* Ensure first data sample of a Prometheus sampler is stored to the database
* Fix metrics sampling when temperatures are not provided by the Redfish API