Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.1-7
== General ==
- Fixed Issues
* Critical cmdaemon security issue
* Critical NVIDIA DCGM security issue
- Improvements
* Added CUDA 11.3 packages
* Updated mlnx-ofed49 to version 4.9-3.1.5.0
* Updated mlnx-ofed52 to version 5.2-2.2.3.0
* Updated cuda-driver to version 460.73.01
* Updated cuda11.2 to version 11.2.2
* Updated cm-nvidia-container-toolkit to version 3.5.0.
* In some cases, RHEL7u6 Dell kmod packages were added to RHEL7u8/7u9 ISOs
* Changed the default Bright repositories definitions on Ubuntu from using a mirrorlists to using the us-east server by default, to avoid an issue where apt may hang while trying to access the servers when using mirrors
* The administrator can select a geographically-close Bright Ubuntu packages server from the list in the cm.list and cm-ml.list files in /etc/apt/sources.list.d/
== cmdaemon ==
- New Features
* New adv. config. option WlmDefaultDrainMessage allowing to change the default drain message
* Ability to define (mode,type) conditional extra rsync arguments
* New configurable job_metadata_last_change_timestamp metric
* New metrics UsedFiles and FreeFiles for each mount point
* New contiguous memory metric
* Allow to specify a per-network security group id for extra network interfaces of cloud nodes in AWS
* Added extra checks for a rare crash in head nodes IPs RPC
* Allow for occupation rate to be sampled for other groupings than the partition
* New adv. config. option RsyncAlwaysExclude allowing for a global exclude list to be added to all rsyncs
* Improved power scheduler, now using less CPU when more than 128 operations are done in parallel
* Added cmd.service hooks so that start, stop, or crashes can be reported other than by email
- Fixed Issues
* Possible cmdaemon crash when setting head node rack positions (different racks)
* In some cases, namerange with maximal_groups=1 segfaults
* An issue with time conversion of the additional node information active timeout
* Call 'nscd -i hosts' after hostname of a device is changed to invalidate the cache
* An issue with cmdaemon keeping track of canceled job arrays that have never been scheduled with Slurm
* An issue with megaraid healthcheck when megacli returns less data than usual information
* In some cases, ssh configuration for the edge nodes is not always being generated until the edge site is committed again
* In some cases, corrupted monitoring data could lead to memory hogs and a crash
* In some cases, an issue with resolv.conf on cloud nodes with the head node in the cloud
* In some cases, an issue with generating hostlist expression in a [] format
* Exclude loop devices from disk metrics with newer psutil.disk_io_counters
* New global conf. option KeepOutsideSectionFSTabContent to allow the node-installer to keep lines outside of the auto-generated section of /etc/fstab of the compute nodes
* An issue with adding Slurm NodeName custom parameters located in the slurmclient role's nodecustomizations
* An issue with LSF jobs' requested nodes parsing
* Make sure certain Kubernetes resource names generated for the users adhere to the DNS naming rules
* Make sure the correct home directory for users is used in the PodSecurityPolicy definition
* Improved detection and wait for the raid arrays to assemble
* In some cases, an issue with submitting pbspro jobs from submission-only hosts
* In the case of pbspro, allow by default to submit from non-server nodes (by setting flatuid=true, which can also be disabled if required)
* In some cases, an issue with LSF submit host configuration
== node-installer ==
- Fixed Issues
* Support for boot-over-IB with grub
== Bright View ==
- Fixed Issues
* Allow the BrightView Kubernetes wizard can assign a etcd role to the head nodes
* The occupation rate chart in the cluster dashboard now shows occupation rate over the past two hours by default.
== cm-kubernetes-setup ==
- Fixed Issues
* Make sure Kubernetes removal cleans up correctly the opened ports in Shorewall
== cm-lite-daemon ==
- Fixed Issues
* An issue when using websocket 0.59.0
== cm-scale ==
- Fixed Issues
* In some cases, an issue with handling UGE jobs' restarted states
== cm-wlm-setup ==
- Fixed Issues
* An issue with adding GPU nodes to the new GPU configuration overlay
* An issue with configuring Slurm slots for GPU nodes
== cmsh ==
- New Features
* New --type selector to the cmsh fspart's foreach
* New --class option to samplenow to select metrics / healthchecks
- Fixed Issues
* Possible cmsh device clone crash when arguments are supplied
* Monitoring counters are displayed for the wrong node in cmsh
== head node installer client ==
- Fixed Issues
* In some cases, the generated disksetup XML labels may not be unique for some RAID layouts for the head node installation
== jupyter ==
- Fixed Issues
* An issue with Jupyter default Python 3 kernel paths handling during upgrades
== ml ==
- New Features
* Introduced ML package cm-dynet for CUDA 11.2
* Introduced ML package cm-fastai2 for CUDA 11.2
* Introduced ML package cm-gpytorch for CUDA 11.2
* Introduced ML package cm-pytorch-extra for CUDA 10.2 and CUDA 11.2
* Introduced ML package cm-pytorch for CUDA 11.2
* Introduced ML packages cm-cub-* for ubuntu2004
* Introduced ML packages cm-cudnn8.1-* for ubuntu2004
* Introduced ML package cm-cub for CUDA 11.2
* Updated cm-cub-* packages to v1.12.0
* Updated cm-cutensor-* packages to v1.3.0
* Updated cm-fastai2-* packages to v2.3.1
* Updated cm-gpytorch-* packages to v1.4.1
* Updated cm-mxnet-* packages to v1.8.0
* Updated cm-nccl2-* packages to v2.9.8
* Updated cm-opencv3-* packages to v3.4.14
* Updated cm-opencv4-* packages to v4.5.2
* Updated cm-pytorch-* packages to v1.8.1 and moved extra dependencies to cm-pytorch-extra-* packages (e.g. torchvision, torchtext)
* Updated cm-tensorrt-* packages to v7.2.3.4 (cuDNN 8.1)
* Updated cm-xgboost-* packages to v1.4.1
- Improvements
* Stopped upgrading PyTorch and its related ML packages for sles12
* Unified the git packages under cm-git
== pbspro2020 ==
- New Features
* Upgraded PBS Pro 2020 to 2020.1.3
== pbspro2021 ==
- New Features
* Introduced pbspro2021 packages
== pythoncm ==
- Improvements
* New pythoncm example to calculate the percentage of time a node is UP