Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.0-16

== General ==

- Improvements

* Added cuda 11.4 packages
* Added Mellanox OFED 5.4 packages
* Added mlnx-ofed53 package
* An issue with Grafana, where since v8.0 it defaults to POST instead of GET for requests
* Update mlnx-ofed52 to version 5.2-2.2.3.0
* Update mlnx-ofed49 to version 4.9-3.1.5.0

== cmdaemon ==

- Improvements

* In some cases, an issue with the directories permissions for k8s certificates
* Ability to process all Jobs in the job-info-management.py maintenance script
* Improved Bright head node HA failover handling w/r/t Kubernetes

- Fixed Issues

* Parallel versions of RPC missing in the API docs
* An issue with showing metrics owned by a category in the monitoring tree
* An issue with ssh access to non-master job nodes when usernodelogin=onlywhenjob for PBS, UGE, and LSF
* sshd banner can cause failure of the ssh2node health check
* In some cases, an issue with sorting the WLM server nodes by their rank
* In some cases, failure to collect partition metrics after cmd has been running for a long time
* A BMC interface can be marked as external interface if it is the only interface on the external network in an edge setup
* Ensure Bright Kubernetes labels are set correctly for category and racks names that contain spaces
* Improved power operation result handling when a cloud node AWS instance has been terminated outside of cmdaemon
* Allow spaces in Slurm queue {Allow,Deny}{Groups,Accounts} parameters
* In some cases, bad nullptr check can cause the configuration dumper to crash
* In some cases, cloning an image with revisions in pythoncm duplicates too much data, which can result in an image that cannot be cleanly removed
* Added nullptr checks for image-remove, to prevent possible crashes on incorrectly configured revision information
* Negative intervals passed to PromQL can cause cmdaemon to crash
* An issue with hwloc integration on Ubuntu
* An issue with time conversion of the additional node information active timeout
* Call 'nscd -i hosts' after hostname of a device is changed to invalidate the cache
* An issue with cmdaemon keeping track of canceled job arrays that have never been scheduled with Slurm
* In some cases, ssh configuration for the edge nodes is not always being generated until the edge site is committed again
* Improved detection and wait for the raid arrays to assemble

== node-installer ==

- Fixed Issues

* Update the standalone script to work with REL/Centos 8.4
* Exclude dracut network modules from being added to the original initrd image for RHEL
* In some cases, an issue with detecting InfiniBand boot device names
* Support for boot-over-IB with grub

== Bright View ==

- Improvements

* Do not hide the grid columns when dragged out of the grid

- Fixed Issues

* An issue with the redirect to the active head node when connected to the passive head node

== cm-kubernetes-setup ==

- New Features

* Added Kubernetes state metrics addon for additional metrics

== cm-openssl ==

- New Features

* Upgrade cm-openssl packages to 1.1.1l (CVE-2021-3711)

== cm-scale ==

- Fixed Issues

* The Auto Scaler now honors the Slurm job start time, when a job is submitted with a begin time in the future

== cm-setup ==

- Improvements

* An issue with adding shorewall rules to both head nodes in the case of head node HA

== cm-wlm-setup ==

- Fixed Issues

* An issue with the mail program path in smail utility on Ubuntu
* An issue with configuring Slurm slots for GPU nodes

== cmsh ==

- Fixed Issues

* escape '>' in cmsh tab completion
* An issue with parsing selection/pattern options for the grabimage command

== ml ==

- New Features

* Updated cm-opencv4-* packages to v4.5.3
* Updated cm-fastai2-* packages to v2.4.1
* Updated cm-pytorch-* packages to v1.9.0
* Updated cm-horovod-* packages to v0.22.1
* Updated cm-gpytorch-* packages to v1.5.0
* Updated cm-xgboost-* packages to v1.4.2
* Extended support for ML package for CUDA 11.2

- Improvements

* Deprecate cm-openmpi-geib-* packages
* Extended support of B4DS addon to NVIDIA DGXs

== pbspro2020 ==

- New Features

* Upgrade PBS Pro 2020 to 2020.1.4

== pbspro2021 ==

- New Features

* Introduce pbspro2021 packages

== pythoncm ==

- Improvements

* New pythoncm example to calculate the percentage of time a node is UP

- Fixed Issues

* An issue with the pythoncm parallel.version_information RPC call

== slurm ==

- New Features

* slurm20 and slurm20.11 are now built with JWT

== slurm20.11 ==

- Improvements

* Update slurm20.11 packages to 20.11.8

== user portal ==

- Fixed Issues

* An issue where the user portal may use timestamps in milliseconds instead of seconds for API calls