Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.0-15
== General ==
- Fixed Issues
* Critical cmdaemon security issue
* Critical NVIDIA DCGM security issue
- Improvements
* Updated cuda-driver to version 460.73.01
* Added CUDA 11.3 packages
* Updated cuda11.2 to version 11.2.2
* Update Kubernetes to version 1.18.15
* Changed the default Bright repositories definitions on Ubuntu from using a mirrorlists to using the us-east server by default, to avoid an issue where apt may hang while trying to access the servers when using mirrors
* The administrator can select a geographically-close Bright Ubuntu packages server from the list in the cm.list and cm-ml.list files in /etc/apt/sources.list.d/
== cmdaemon ==
- New Features
* New adv. config. option WlmDefaultDrainMessage allowing to change the default drain message
* Ability to define (mode,type) conditional extra rsync arguments
* Allow to specify a per-network security group id for extra network interfaces of cloud nodes in AWS
- Improvements
* Added extra checks for a rare crash in head nodes IPs RPC
* Allow for occupation rate to be sampled for other groupings than the partition
* New adv. config. option RsyncAlwaysExclude allowing for a global exclude list to be added to all rsyncs
* Improved power scheduler, now using less CPU when more than 128 operations are done in parallel
* Added cmd.service hooks so that start, stop, or crashes can be reported other than by email
- Fixed Issues
* In some cases, namerange with maximal_groups=1 segfaults
* An issue with megaraid healthcheck when megacli returns less data than usual information
* In some cases, corrupted monitoring data could lead to memory hogs and a crash
* Improved handling of InsufficientInstanceCapacity when powering on multiple AWS cloud compute nodes
* In some cases, an issue with generating hostlist expression in a [] format
* Exclude loop devices from disk metrics with newer psutil.disk_io_counters
* New global conf. option KeepOutsideSectionFSTabContent to allow the node-installer to keep lines outside of the auto-generated section of /etc/fstab of the compute nodes
* An issue with adding Slurm NodeName custom parameters located in the slurmclient role's nodecustomizations
* Rare crash when performing PDU power operations
* Power delay for a 2nd node is not adhered to
* An issue with Unicode decode in the rogueprocess health check
* Make sure certain Kubernetes resource names generated for the users adhere to the DNS naming rules
* Make sure the correct home directory for users is used in the PodSecurityPolicy definition
* An issue in the external-user-cert script with determining the home path for users
* In some cases, an issue with submitting pbspro jobs from submission-only hosts
* In the case of pbspro, allow by default to submit from non-server nodes (by setting flatuid=true, which can also be disabled if required)
== node-installer ==
- Improvements
* Improved detection and wait for the raid arrays to assemble
- Fixed Issues
* Added BOOTIF information to the kernel cmdline when booting with grub, which improves the boot interface detection by the node-installer
== cluster-tools ==
- Fixed Issues
* An issue with Slurm DB storage user credentials when Slurm is setup after head node HA has been setup
== Bright View ==
- Fixed Issues
* Allow the BrightView Kubernetes wizard can assign a etcd role to the head nodes
== cm-lite-daemon ==
- Fixed Issues
* An issue when using websocket 0.59.0
== cm-scale ==
- Fixed Issues
* In some cases, an issue with handling UGE jobs' restarted states
* An issue when mix_locations=false
== cm-tftpboot ==
- Fixed Issues
* Added BOOTIF information to the kernel cmdline when booting with grub, which improves the boot interface detection by the node-installer
== cmsh ==
- New Features
* New --type selector to the cmsh fspart's foreach
- Improvements
* New --class option to samplenow to select metrics / healthchecks
* Monitoring counters are displayed for the wrong node in cmsh
- Fixed Issues
* An issue with setting the postfix relayhost port number in the main.cf configuration file
== cuda-dcgm ==
- Fixed Issues
* An issue with DCGM python bindings linking with DCGM libraries
== head node installer client ==
- Fixed Issues
* In some cases, the generated disksetup XML labels may not be unique for some RAID layouts for the head node installation
== jupyter ==
- Fixed Issues
* An issue with Jupyter default Python 3 kernel paths handling during upgrades
== ml ==
- New Features
* Introduced ML package cm-cub for CUDA 11.2
* Introduced ML package cm-dynet for CUDA 11.2
* Introduced ML package cm-fastai2 for CUDA 11.2
* Introduced ML package cm-gpytorch for CUDA 11.2
* Introduced ML package cm-pytorch-extra for CUDA 10.2 and CUDA 11.2
* Introduced ML package cm-pytorch for CUDA 11.2
* Updated cm-cub-* packages to v1.12.0
* Updated cm-cutensor-* packages to v1.3.0
* Updated cm-fastai2-* packages to v2.3.1
* Updated cm-gpytorch-* packages to v1.4.1
* Updated cm-mxnet-* packages to v1.8.0
* Updated cm-nccl2-* packages to v2.9.8
* Updated cm-opencv3-* packages to v3.4.14
* Updated cm-opencv4-* packages to v4.5.2
* Updated cm-pytorch-* packages to v1.8.1 and moved extra dependencies to cm-pytorch-extra-* packages (e.g. torchvision, torchtext)
* Updated cm-tensorrt-* packages to v7.2.3.4 (cuDNN 8.1)
* Updated cm-xgboost-* packages to v1.4.1
- Improvements
* Stopped upgrading PyTorch and its related ML packages for sles12
* Unified the git packages under cm-git
== pbspro2020 ==
- New Features
* Upgrade PBS Pro 2020 to 2020.1.3