Base Command Manager / Bright Cluster Manager Release Notes

Release notes for Bright 9.0-8

== General ==

- New Features

* Added Kubernetes in AWS

- Improvements

* Improved parallel executor bookkeeping so they don't timeout after a day
* updated cuda-driver to version 440.95.01

- Fixed Issues

* Added Support for cm-openssl on RHEL 8.2
* Kubernetes metrics server was not providing any metrics for nodes

== cmdaemon ==

- New Features

* Compress large fields in the DB
* Added indentation in json audit / event logger

- Improvements

* Open files in /proc with read only
* Edge director /etc/host missing spaces between some host definitions
* Ping edge director via IP and hostname to make sure ssh keys are correct
* Monitor cuda-dcgm service from cmd if the node has nvidia GPUs
* Do not mark nodes as restart required if they are not UP or pingable

- Fixed Issues

* Edge hash secrets did not always get saved in DB
* Added WLM module files to default exclude lists
* Coredump if status information was requested just after cmdaemon started
* Escape # in password for freeipmi config file
* Monitoring reinitialize didn't work for edge nodes
* Unable to update user to set groupID = 0
* Pbspro drain overview doesn't work if pbsnodes -a doesn't display a queue
* Data science nodes not displayed in license info
* Automatic edge secret didn't work with multi OS
* Added cmsh filemode display format and use the filemask for the file in generic role configurations
* An issue with parsing of comma's in openssl subject fields of existing license in request-license
* Ldapserver not added to SAN slapd certificate in directors
* Automatic fsexport disabled flag was not updated
* Jobqueue property MaxTime was not set correctly in slurm.conf
* Project manager didn't warn when having wrong settings
* Gpu sampler was not reinizialized after DCGM is connected
* An issue with PBS Pro job array metric collection
* An issue with cm-component-certificate
* Nodes remained up after changing the head node certificate but were not allowed to communicate with the head node
* Improved default promQL for sums of memory
* Rare deadlock when updating device and partition at the same time
* Slurmctld was being restarted when cmdaemon was restarted
* Prometheus data was not marked as stale after jobs ended
* Slurm queue assignment failed when the queue name included upper case characters
* Account project manager rules were not used for prometheus queries
* PBS Pro node properties were not updated in qmgr
* Accelerator count not updated in DB when a different GPU card is inserted into the node
* An issue when removing an user with a kubernetes role binding and hint with an event if a user is deleted that has role bindings in kubernetes
* User portal showed running jobs when jobs were already finished
* An issue with HA where LSF_MASTER_LIST was being reordered on change of active head node

== node-installer ==

- Fixed Issues

* bond under a vlan IP was not brought up

== cluster-tools ==

- Fixed Issues

* An issue with configuring failover network with cmha-setup
* An issue for Ubuntu 16.04 for Kubernetes CoreDNS with /etc/resolv.conf
* An issue when removing an user with a kubernetes role binding and hint with an event if a user is deleted that has role bindings in kubernetes

== Bright View ==

- Improvements

* Added "Sync Info" operation for software images, equivalent of updateprovisioners in cmsh

- Fixed Issues

* Additional hostnames for node interfaces were not saved
* An issue with timestamps formatting for User edit form
* An issue with Disk templates selection for category entity
* Window remained open when entity is deleted/reverted
* Jobs list were not automatically refreshed

== bright-installer ==

- Fixed Issues

* Rename "IP address" to "Base IP address" for external network form

== cm-kubernetes-setup ==

- Improvements

* Added additional ingress to expose kubernetes dashboard via path based ingress on new installations

- Fixed Issues

* Installation was allowing to continue with nodes being down, missing GPU info was not handled correctly, Docker registry certificates were not distributed to cloud nodes
* The case when system information was not available due to nodes being down was not being handled correctly
* An issue regarding internal network detection when deploying Kubernetes on cluster-extension nodes

== cm-wlm-setup ==

- Fixed Issues

* An issue with munge.key synchronization to edge sites during Slurm setup

== cmha-setup ==

- Fixed Issues

* fuser path setting for Ubuntu and SLES in dasumount.sh script

== cmsh ==

- New Features

* Added new cmsh option to wait for provisioning status to be completed

- Improvements

* Added an option to trigger all fsparts of a particular type
* Added to instantquery and rangequery commands a delimiter options

- Fixed Issues

* Unable to access cgroups submode in ugeclient role
* An issue where job states were not caught from slurm sacct and pbs qstat

== head node installer client ==

- Fixed Issues

* Rename "IP address" to "Base IP address" for external network form

== ml ==

- New Features

* Extended support for CUDA 10.2 with cm-xgboost-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-theano-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-tensorflow-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-open3d-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-mxnet-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-horovod-*-py37-cuda10.2-gcc packages
* Extended support for CUDA 10.2 with cm-gpytorch-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-fastai-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-dynet-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-chainer-py37-cuda10.2-gcc package
* Updated cm-xgboost-* packages to v1.1.1
* Updated cm-horovod-* packages to v0.19.4
* Updated cm-horovod-* packages to v0.19.3
* Updated cm-tensorflow-* packages to v1.15.3 to address some vulnerability issues
* Updated cm-xgboost-* packages to v1.1.0
* Updated cm-horovod-* packages to v0.19.2
* Updated cm-cmake-* packages to v3.17.2
* Updated cm-tensorflow-* packages to v2.2.0

- Improvements

* Introduced cm-open3d-* packages (v0.10.0)
* Introduced cm-horovod-tensorflow2-* packages (v0.19.4)

== pythoncm ==

- Improvements

* HTTP proxies were used by default

- Fixed Issues

* An issue where HA was decided after updating the active head node host, which otherwise can interfere in some cases with cmha-setup

== slurm ==

- Improvements

* Updated Slurm packages to 20.02.3 and 19.05.7 (CVE-2020-12693)

- Fixed Issues

* Slurmctld was being restarted when cmdaemon was restarted

== slurm19 ==

- Fixed Issues

* Slurm queue assignment failed when the queue name included upper case characters

== user portal ==

- Fixed Issues

* User portal showed running jobs when jobs were already finished