Base Command Manager / Bright Cluster Manager Release Notes
Release notes for Bright 9.0-8
== General ==
- New Features
* Added Kubernetes in AWS
- Improvements
* Improved parallel executor bookkeeping so they don't timeout after a day
* updated cuda-driver to version 440.95.01
- Fixed Issues
* Added Support for cm-openssl on RHEL 8.2
* Kubernetes metrics server was not providing any metrics for nodes
== cmdaemon ==
- New Features
* Compress large fields in the DB
* Added indentation in json audit / event logger
- Improvements
* Open files in /proc with read only
* Edge director /etc/host missing spaces between some host definitions
* Ping edge director via IP and hostname to make sure ssh keys are correct
* Monitor cuda-dcgm service from cmd if the node has nvidia GPUs
* Do not mark nodes as restart required if they are not UP or pingable
- Fixed Issues
* Edge hash secrets did not always get saved in DB
* Added WLM module files to default exclude lists
* Coredump if status information was requested just after cmdaemon started
* Escape # in password for freeipmi config file
* Monitoring reinitialize didn't work for edge nodes
* Unable to update user to set groupID = 0
* Pbspro drain overview doesn't work if pbsnodes -a doesn't display a queue
* Data science nodes not displayed in license info
* Automatic edge secret didn't work with multi OS
* Added cmsh filemode display format and use the filemask for the file in generic role configurations
* An issue with parsing of comma's in openssl subject fields of existing license in request-license
* Ldapserver not added to SAN slapd certificate in directors
* Automatic fsexport disabled flag was not updated
* Jobqueue property MaxTime was not set correctly in slurm.conf
* Project manager didn't warn when having wrong settings
* Gpu sampler was not reinizialized after DCGM is connected
* An issue with PBS Pro job array metric collection
* An issue with cm-component-certificate
* Nodes remained up after changing the head node certificate but were not allowed to communicate with the head node
* Improved default promQL for sums of memory
* Rare deadlock when updating device and partition at the same time
* Slurmctld was being restarted when cmdaemon was restarted
* Prometheus data was not marked as stale after jobs ended
* Slurm queue assignment failed when the queue name included upper case characters
* Account project manager rules were not used for prometheus queries
* PBS Pro node properties were not updated in qmgr
* Accelerator count not updated in DB when a different GPU card is inserted into the node
* An issue when removing an user with a kubernetes role binding and hint with an event if a user is deleted that has role bindings in kubernetes
* User portal showed running jobs when jobs were already finished
* An issue with HA where LSF_MASTER_LIST was being reordered on change of active head node
== node-installer ==
- Fixed Issues
* bond under a vlan IP was not brought up
== cluster-tools ==
- Fixed Issues
* An issue with configuring failover network with cmha-setup
* An issue for Ubuntu 16.04 for Kubernetes CoreDNS with /etc/resolv.conf
* An issue when removing an user with a kubernetes role binding and hint with an event if a user is deleted that has role bindings in kubernetes
== Bright View ==
- Improvements
* Added "Sync Info" operation for software images, equivalent of updateprovisioners in cmsh
- Fixed Issues
* Additional hostnames for node interfaces were not saved
* An issue with timestamps formatting for User edit form
* An issue with Disk templates selection for category entity
* Window remained open when entity is deleted/reverted
* Jobs list were not automatically refreshed
== bright-installer ==
- Fixed Issues
* Rename "IP address" to "Base IP address" for external network form
== cm-kubernetes-setup ==
- Improvements
* Added additional ingress to expose kubernetes dashboard via path based ingress on new installations
- Fixed Issues
* Installation was allowing to continue with nodes being down, missing GPU info was not handled correctly, Docker registry certificates were not distributed to cloud nodes
* The case when system information was not available due to nodes being down was not being handled correctly
* An issue regarding internal network detection when deploying Kubernetes on cluster-extension nodes
== cm-wlm-setup ==
- Fixed Issues
* An issue with munge.key synchronization to edge sites during Slurm setup
== cmha-setup ==
- Fixed Issues
* fuser path setting for Ubuntu and SLES in dasumount.sh script
== cmsh ==
- New Features
* Added new cmsh option to wait for provisioning status to be completed
- Improvements
* Added an option to trigger all fsparts of a particular type
* Added to instantquery and rangequery commands a delimiter options
- Fixed Issues
* Unable to access cgroups submode in ugeclient role
* An issue where job states were not caught from slurm sacct and pbs qstat
== head node installer client ==
- Fixed Issues
* Rename "IP address" to "Base IP address" for external network form
== ml ==
- New Features
* Extended support for CUDA 10.2 with cm-xgboost-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-theano-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-tensorflow-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-open3d-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-mxnet-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-horovod-*-py37-cuda10.2-gcc packages
* Extended support for CUDA 10.2 with cm-gpytorch-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-fastai-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-dynet-py37-cuda10.2-gcc package
* Extended support for CUDA 10.2 with cm-chainer-py37-cuda10.2-gcc package
* Updated cm-xgboost-* packages to v1.1.1
* Updated cm-horovod-* packages to v0.19.4
* Updated cm-horovod-* packages to v0.19.3
* Updated cm-tensorflow-* packages to v1.15.3 to address some vulnerability issues
* Updated cm-xgboost-* packages to v1.1.0
* Updated cm-horovod-* packages to v0.19.2
* Updated cm-cmake-* packages to v3.17.2
* Updated cm-tensorflow-* packages to v2.2.0
- Improvements
* Introduced cm-open3d-* packages (v0.10.0)
* Introduced cm-horovod-tensorflow2-* packages (v0.19.4)
== pythoncm ==
- Improvements
* HTTP proxies were used by default
- Fixed Issues
* An issue where HA was decided after updating the active head node host, which otherwise can interfere in some cases with cmha-setup
== slurm ==
- Improvements
* Updated Slurm packages to 20.02.3 and 19.05.7 (CVE-2020-12693)
- Fixed Issues
* Slurmctld was being restarted when cmdaemon was restarted
== slurm19 ==
- Fixed Issues
* Slurm queue assignment failed when the queue name included upper case characters
== user portal ==
- Fixed Issues
* User portal showed running jobs when jobs were already finished