Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.2-17

== General ==

- New Features

* Added CUDA 12.4 packages
* Added CUDA 12.5 packages
* Added cuda-driver-550 and cuda-fabric-manager-550 packages.
* Added mlnx-ofed24.01 packages
* Added mlnx-ofed24.04 packages
* Added mlnx-ofed24.07 package
* Added support for SUSE Linux Enterprise Server (SLES) 15 SP6

- Improvements

* For new Ubuntu 22.04 head node installations, use fixed port numbers for the NFS lockd, statd, and mountd daemons
* Include the gsp firmware with the cuda-driver* packages
* Updated CuDNN to 9.1.1
* Updated PBS Professional 2024 to 2024.1.1
* Updated Ubuntu 22.04 base distribution to 22.04.4
* Updated cm-nvhpc to 24.5
* Updated cm-openssl to 3.0.15
* Updated cuda-driver to 565.57.01
* Updated cuda-driver-legacy-470 to 470.239.06
* Updated cuda12.6-toolkit to 12.6 Update 2
* Updated enroot to 3.5.0
* Updated munge to 0.5.16
* Updated mlnx-ofed23.10 to 23.10-3.2.2.0
* Updated mlnx-ofed58 to 5.8-5.1.1.2
* Updated NCCL2 for CUDA 12.5

== CMDaemon ==

- Improvements

* Added new advanced configuration option SoftwareImageDisableZFS to disable ZFS
* Added cm-remove-orphaned-pending-job-info.py helper script that can be used to remove jobs information in the CMDaemon database for WLM jobs that have been cached by cmd as pending while they are removed from the WLM
* Prevent cm-chroot-sw-img from running on the passive head node or provisioning nodes unless using the "force" CLI option
* Improved speed of the entity, measurable, and parameter comparisons in triggers
* Improved speed of the monitoring triggers evaluation when regex’s are used
* Optimized CMDaemon memory usage when a large number of jobs are running and job tracing is disabled
* Reduced load on the head node caused by many nodes changing state quickly
* Allow the option to use FQDN for the compute nodes with the global configuration option ShortHostname=0
* Allow the option to update the FrozenFile setting with the cm-manipulate-advanced-config.py helper script
* Allow the option to perform a periodic check if the head node IP has been changed on external DHCP renewal
* Allow the option to change the behavior of the monitoring drain action to not set a drain reason when draining the node(s)
* Allow the option to use negative matching such as "!resource!=category-name" in the monitoring comparison expressions

- Fixed Issues

* An issue where switch control scripts may not correctly separate the stdout and stderr output
* An issue with shared_mutex lock which can cause CMDaemon to crash or lead to increased memory usage if two threads update the same data
* An issue where AccountingStorageTRES setting is not added to the Slurm configuration file
* An issue where request-license may try to ssh to the same head node on which it is running
* An issue with sample_ibmetrics.py not returning floating point values
* An issue with the interfaces healthcheck when interface speeds are defined with a unit
* An issue where the dhcpd range may not be equally distributed between the two head nodes in an HA setup
* An issue where stopping the CMDaemon service on compute nodes takes a long time when the active head node is down
* An issue where invalid SysInfoCollector entries with uniqueKey = 0 can be left in the CMDaemon database after an HA takeover
* An issue where incomplete http headers can result in cmdaemon threads reading ssl and consuming memory
* An issue with setting the user/group ownership of static configuration files managed by a generic role
* An issue where executing MIG, Bios, or DPU commands may not clear the "busy" flag if the commands time out
* An issue with clearing the memory when a large number of entities that have failing health checks are added and then removed
* An issue in gzip inflate that can cause a RPC to loop forever consuming 1 CPU
* An issue with the monitoring trigger actuator when many samples are picked up at once or when using complex expressions
* An issue where CMDaemon may perform systemctl daemon-reload also when the Slurm service drop-in file has not changed
* An issue with parsing of requested CPUs setting for UGE jobs
* An issue in the drain action manager code that in some cases can lead to high CMDaemon memory usage
* An issue with determining the number of requested CPUs for multi-node Slurm jobs when storing the jobs information in CMDaemon
* An issue with hard-coded references to /sbin/arping which in some cases can prevent CMDaemon from using arping in the event of a failover
* An issue where the head nodes can provision each other with some ProvisioningRole configurations

== Node Installer ==

- Fixed Issues

* An issue with setting up bonded network interfaces on diskless nodes
* An issue where the auditd service may not be disabled by node-installer when SELinux is not enabled

== Cluster Tools ==

- Fixed Issues

* An issue with an infinite loop in request-remote-assistance with nohup
* An issue that prevents cm-diagnose from completing when single quotes are used in cmd.conf for dbuser/dbpass
* An issue where cm-mysql-sanitize.py, which is required by cm-diagnose, is not part of the cluster-tools package

== Head Node Installer ==

- Improvements

* Updated the STIG disk setup configurations

== Machine Learning ==

- Improvements

* Migrate WLM kernel templates to Jypter Kernel Starter

- Fixed Issues

* An issue where WLM kernels may be unexpectedly restarted if one of the kernels fails to start

== cm-cluster-extension ==

- Fixed Issues

* An issue where 'germany' is incorrectly listed as an Azure region

== cm-kubernetes-setup ==

- Improvements

* Added support for Kubernetes 1.28
* The use of kube-rbac-proxy is now deprecated in the Jupyter operator and in the permissions manager in favor of using the internal kubebuilder mechanism
* Updated local path provisioner to version 0.0.29

- Fixed Issues

* An issue with handling older versions of the Kubernetes permission manager where not all API endpoints exist
* An issue in cm-kubernetes-setup --pull where the process would fail to complete if a pod was evicted due to disk pressure while pulling images

== cm-scale ==

- Fixed Issues

* In some cases, an issue with saving the state of drained nodes when the head node is restarted, which can prevent the Auto Scaler from considering the nodes as available to the Auto Scaler
* An issue where the shutdown state from files may be used incorrectly

== cmsh ==

- Improvements

* Allow the option to override the default timeout for monitoring scripts when running samplenow with --max-run-time option

- Fixed Issues

* An issue where cmsh may crash when showing a category after clearing the image property
* An issue where the cmsh monitoring trigger info command does not show grouped expressions

== jupyter ==

- New Features

* Restrict the access to Jupyter based on group memberships

- Improvements

* Migrate WLM kernel templates to Jypter Kernel Starter
* Allow the option to install and configure VNC when setting up Jupyter

== pyxis-sources ==

- Improvements

* Update pyxis to 0.20.0

== slurm ==

- New Features

* Added Slurm 24.05 packages
* Added pmix3 plugin to Slurm 23.02, 23.11 and 24.05 in addition to the already existing pmix4 plugin
* Set the ENROOT_MOUNT_HOME configuration option to "no" by default for new setups

- Improvements

* Updated Slurm 23.02 to 23.02.8
* Updated Slurm 23.11 to 23.11.10