Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.0-18
== General ==
- New Features
* Update cm-docker to v20.10.12
* Update golang-go-latest to 1.16.12
* Added CUDA 11.6 packages
* Added CUDA 11.5 packages, updated cuda-driver to version 495.29.05
* Added mlnx-ofed55 package.
* mlnx-ofed49: updated to version 4.9-4.1.7.0
* mlnx-ofed54: updated to version 5.4-3.1.0.0
== cmdaemon ==
- Improvements
* Allow extra lines to be added to the ifcfg configuration file from the network interface definition
* Determine if a user has a running job using cmutil PS process tracking
* Validate ramdisk creation is enabled for edge directors
* Assigning a director role to a category or an overlay is no longer allowed
* Add rsync excludes for named (DNS) files for edge and cloud directors
* In cmsh and Bright View, show the PBS job array tasks with their final stdout and stderr paths
* Increase the hashed_secret mysql table size
* Clarified error message displayed if user does not have running jobs and is not allowed to log in to a node.
* New advanced configuration option "GenderAll" added to change the "all" group in /etc/genders
* New advanced configuration option "PostfixAdditionalMyDestinationEntries" added to add custom entries to the main.cf mydestination line
* Upgrade PromQL
- Fixed Issues
* Exclude /proc and /sys from the /cm/node-installer rsync
* An issue with regenerating user certificates when a new license is installed without re-using the private key
* In the case of head node HA, global configs can get stored and sent to nodes multiple times
* In some cases, job detection can stop working due to kernel inotify issues
* Optimized the power operation tracker to speed up large-scale operations
* An issue with executing monitoring actions on incorrectly filtered-out data
* An issue with removing image-update provisioning requests when the node-installer is running, which can cause an unexpected image update right after cmdaemon has started while the mount scripts may still be busy
* ntp health check can time out on cloud nodes because the director IP is not inside the environment
* An issue with handling trailing '/' in the mounts health check
* An issue where cmsh promql table can show no data for some queries
* DellSettings serviceTab and modelName are read-only fields
* The node status is not changed to DOWN if the node is powered off while it is BOOTING
* cmjob is not able to detect the WLM scheduler when using Lmod modules
* Allow to specify excludelist for the provisioning of fsparts, such as for the provisioning of /cm/shared to directors
* An issue with the generated DNS zone files when the cluster is extended to two or more cloud regions, which in some cases can prevent the named service from starting
== node-installer ==
- Fixed Issues
* In some cases, an issue with umount while creating a ramdisk
== Bright View ==
- Fixed Issues
* Increase the software image revision property size
== cm-scale ==
- Improvements
* An issue with detecting the job's GPU requests for Slurm version >= 21
* Allow to stop or terminate nodes that take too long to start
- Fixed Issues
* An issue with handling CONFIGURING Slurm jobs' state
* An issue with handling Slurm job array tasks' pending reasons and dependencies
* An issue with starting nodes for PBS jobs with memory requirements not defined in bytes
* In some cases, an issue with using a generic tracker. Update of the generic tracker example
== cm-wlm-setup ==
- Fixed Issues
* An issue with setting up second UGE instance when the first one is already setup on the head node(s) and /cm/shared is on NFS
== cmsh ==
- Fixed Issues
* An issue where cmsh -q does not quit on a format command failure
* An issue with the autoscaler engine name lookup in cmsh
== cmsub/cmjob ==
- Fixed Issues
* An issue with pbspro job array submission and handling of the job name
== jupyter ==
- Improvements
* Updated Jupyter's Node.js to v13.14.0 to fix 2-minute timeouts for kernels submission
== ml ==
- New Features
* Introduce support for Machine Learning packages on py39/gcc9.
* Introduce ML package cm-cudnn8.2 for CUDA 10.2 and CUDA 11.4
* Updated cm-chainer-* packages to v7.8.0
* Updated cm-cub-cuda11.2 packages to v1.14.0
* Updated cm-fastai2-* packages to v2.5.1
* Updated cm-gpytorch-* packages to v1.5.1
* Updated cm-nccl2-* packages to v2.11.4
* Updated cm-opencv4-* packages to v4.5.4
* Updated cm-pytorch-* packages to v1.9.1
* Updated cm-tensorflow2-* packages to v2.5.2
* Updated cm-tensorflow-* packages to v2.7.0
* Updated cm-tensorrt-* packages to v8.0.3
* Updated cm-xgboost-* packages to v1.5.0
- Changes
* Stopped upgrading ML packages cm-dynet-*.
* Stopped upgrading the machine learning packages for sles12
* Stopped upgrading the machine learning packages for cuda10.2
== slurm ==
- Fixed Issues
* In some cases, incorrect permissions of slurmdbd.conf file
== slurm21.08 ==
- Improvements
* Introduce Slurm 21.08 packages
- Fixed Issues
* NVIDIA MIG autodetection in slurm 21.08