Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.2-16
== General ==
- Improvements
* Added cuda-driver-535 package
* Added CUDA 12.3 packages
* Updated cuda-driver to 550.54.15
* Include the gsp firmware with the cuda-driver package
* Updated cm-nvhpc to 23.11
* Updated cm-openssl to 3.0.13
* Updated mlnx-ofed58 to 5.8-4.1.5.0
* Updated mlnx-ofed23.10 to 23.10-1.1.9.0
* The mlnx-ofed packages' installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED
- Fixed Issues
* An issue with the runtime and PID path settings in the nvidia-persistenced service unit file from the cuda-driver-* packages
* Increase the stack and nofile limits in cm-config-limits for the root user on Ubuntu 22.04 to prevent possible issues with head nodes hanging under heavy load
== CMDaemon ==
- Improvements
* Support for instantaneous MIG profiles on H100
* Redirect the output from cm-burn to tty1
* Added a periodic CMDaemon maintenance task to free the heap allocations that are not automatically freed, which reduces the memory usage
* Added an hourly cron job to clean up files left behind when the sample-ipmi script is terminated
- Fixed Issues
* An issue with the passive head node forwarding labeled entity information to the active head node, preventing it from being used in PromQL queries
* An issue with sorting the data passed to the PromQL engine, which can result in an error "expanding series: closed SeriesSet" message when running instant queries
* An issue with parsing the Slurm jobs requested memory information
* An issue with the job wasted GPU calculation when using 1 out of 2 (or more) GPUs
* An issue where the exclude list snippets are not being cloned when cloning a software image
* An issue with configuring the GATEWAYDEV on RHEL 9 when a VLAN network interface is configured on top of the BOOTIF interface
* An issue where /etc/systemd/resolved.conf was not added to the imageupdate exclude list for compute nodes
* An issue with the Prometheus exporter when entities have recently been removed
* An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change
* An issue where the cmsh call to create a certificate may return before the certificate is written
* An issue where a WLM job process ID may be added to an incorrect cgroup, which in some cases may result in the process being killed when another WLM job running on the same node completes
* An issue with collecting GPU job metrics for containerized Pyxis jobs
* An issue where an imageupdate of the compute nodes can remove the Kubernetes controller-manager configuration and service files, preventing the (re)start of the service
== Head Node Installer ==
- Fixed Issues
* An issue with head node installations with Lmod where the DefaultModules.lua module file is not created by default, resulting in messages about empty LMOD_SYSTEM_DEFAULT_MODULES environment variable
== cm-diagnose ==
- Improvements
* Include syslog in cm-diagnose
== cm-kubernetes-setup ==
- New Features
* Allow the option to setup Kubernetes v1.27
- Fixed Issues
* An issue with the interactive uninstall question in cm-kubernetes-setup when the Kubernetes API is not responsive
== cm-scale ==
- New Features
* Improved handling of a lack of VCPUs in AWS in the same way as a lack of capacity
== cm-wlm-setup ==
- Fixed Issues
* Allow the options to setup AGE 2023.1.1 (8.8.1) with cm-wlm-setup
== cmsh ==
- Improvements
* Added new cmsh WLM jobs mode command pidsgpus to list the pids and the gpus used by a WLM job
- Fixed Issues
* An issue in cmsh user mode with case sensitive compare of profile names
* An issue with entering the SlurmJobQueueAccessList submode of the SlurmSubmit role when the role is assigned directly to a node
== jupyter ==
- Improvements
* In some cases, an issue where duplicated pods or services may be created due to a race condition in the Kubernetes API
== openpbs23.06 ==
- Improvements
* Added OpenPBS 23.06 packages
== pyxis-sources ==
- New Features
* Updated pyxis sources package to 0.17.0
== slurm23.11 ==
- New Features
* Added Slurm 23.11 packages. NOTE: Both the cm-setup and cmdaemon packages need to be updated to their most recent versions to support Slurm 23.11 setup