Base Command Manager / Bright Cluster Manager Release Notes
Release notes for BCM 10.23.11
== General ==
New Features
* Added support for SLES15 SP5
Improvements
* Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false -> true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true -> false)
* Updated cuda-driver package to 535.129.03
== CMDaemon ==
New Features
* Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run
* Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd
* Added cmsh command to show dhcpd leases
* Added Border Gateway Protocol (BGP) overview for Cumulus switches
* Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches
* Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1
Improvements
* Allow nodes to be automatically powered off or reset upon installer failure
* Allow devices to be identified by serial in DHCP
* Relaxed SSL checks when registering a new Cumulus switch via ZTP
* Improved CMDaemon startup speed in HA mode
* Prevent multiple identical failover group status
* Added a flag to allow changing a user home directory to an existing directory
* Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions
* Write chrony.conf instead of ntp.conf in node-installer on RHEL9
* Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with '+'
Fixed Issues
* Fixed counting of nodes and accelerators towards the license limit
* Fixed service status in cmsh of a lite-node
* Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node
* Store services added to lite-node to DB
* Fixed cmsh imageupdate --pattern
== Workload Management ==
New Features
* Automatically configure non-MIG GPUs in Slurm when detected
* Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)
* Added new package pyxis-sources to allow building pyxis in air-gapped environments
Improvements
* Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf
Fixed Issues
* Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role
* Cleaned up database node entries of Slurm jobs that were requeued
* Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name
* Install enroot dependencies on Ubuntu 20.04
== Container Engines ==
Improvements
* Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)
* Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates
== Monitoring ==
New Features
* Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION
* Added ManagedServicesOk health check to lite devices
Improvements
* Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes
* Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs
* Do not use linear interpolation for health check data, but rather the last known value
Fixed Issues
* Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created
* Fixed job-metrics in the base-view monitoring tree