OpenShift 4.2 on Red Hat OpenStack Platform 13 + GPU
Red Hat OpenShift Container Platform 4.2 introduces the general availability of full-stack automated deployments on OpenStack. With OpenShift 4.2, containers can be managed across multiple public and private clouds, including OpenStack. Red Hat and NVIDIA are working to provide the best platform for Artificial Intelligence and Machine Learning workloads.
Note: This blog post shows how to deploy GPU-enabled nodes running Red Hat Enterprise Linux CoreOS. With Red Hat OpenShift Container Platform 4.2, GPUs with OpenShift are supported in Red Hat Enterprise Linux 7 nodes only. This process is not supported. Please see the OpenShift 4.2 release notes for details.
The OpenShift 4.2 installer can fully automate the installation on OpenStack:
- Network configuration (networks, subnets, trunks, load balancers)
- VM creation
- Storage configuration
- OpenShift setup
- Routing
Summary:
- OpenStack lab environment
- Prepare the OpenShift installer
- Deployment of OpenShift, first step with the bootstrap node and three masters
- Deployment of OpenShift, second step with three additional worker nodes
- Deployment of OpenShift, last step bootstrap node is deleted
- Check the OpenShift deployment
- Connect to the console
- Adding a GPU worker node
- Deploy the Node Feature Discovery Operator
- Deploy the Special Resource Operator
- Test nvidia-smi
- TensorFlow benchmarks with GPU
- TensorFlow benchmarks with CPU
- Grafana dashboards
- Product documentation
We will use the openshift-installer binary to spawn the OpenShift cluster.
The openshift-installer binary is directly consuming the OpenStack API.
At the end of the installation, we will have one OpenShift cluster running on seven OpenStack Virtual Machines:
- 3 x OpenShift masters VMs
- 3 x OpenShift workers for CPU workloads VMs
- 1 x OpenShift worker for GPU workload VM
You can run the same process with other IaaS platforms as AWS or Azure.
The OpenStack Virtual Machine used as a worker for GPU workloads is using PCI passtrough to a NVIDIA Tesla V100 GPU board.
The OpenShift 4.2 cluster will use two Kubernetes operators to setup the GPU configuration:
- Node Feature Discovery for Kubernetes (NFD) to label the GPU nodes
- Special Resource Operator for Kubernetes (SRO) to enable the NVIDIA driver stack on the GPU worker node
OpenStack lab environment
We are using a deployed Red Hat OpenStack Platform with Red Hat OpenStack Platform 13z8:
[stack@perflab-director ~]$ cat /etc/rhosp-release
Red Hat OpenStack Platform release 13.0.8 (Queens)
The compute nodes have two NVIDIA Tesla v100 with 16GB of GPU Memory:
List the PCI device IDs on one OpenStack compute node (two V100 boards plugged):
[heat-admin@overcloud-compute-0 ~]$ lspci -nn | grep -i nvidia
3b:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
d8:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
Create a flavor for the Masters and Workers nodes:
[stack@perflab-director ~]$ source ~/overcloudrc
(overcloud) [stack@perflab-director ~]$ openstack flavor create --ram 1024 --disk 200 --vcpus 2 m1.xlarge
Add swiftoperator role to admin:
(overcloud) [stack@perflab-director ~]$ openstack role add --user admin --project admin swiftoperator
Set a temporary URL property:
(overcloud) [stack@perflab-director ~]$ openstack object store account set --property Temp-URL-Key=superkey
Before the deployment we have only the external and the load balancer networks available:
Prepare the OpenShift installer
The deployment process will run in multiple steps:
To get the OpenShift installer and resources, log with your RHN account here, click on “Red Hat OpenShift Container Platform”:
https://access.redhat.com/downloads
You will have to download two binaries and one QCOW image:
- “OpenShift v4.2 Linux Client”
- “OpenShift v4.2 Linux Installer”
- “Red Hat Enterprise Linux CoreOS - OpenStack Image (QCOW)”
I had to add the .gz extension and uncompress the Red Hat Enterprise Linux CoreOS downloaded qcow2:
[stack@perflab-director x86_64]$ curl --compressed -J -L -o rhcos-4.2.0-x86_64-openstack.qcow2.gz "https://access.cdn.redhat.com/content/origin/files/sha256/XX/XXXXXXXXXXXXXX/rhcos-4.2.0-x86_64-openstack.qcow2?user=XXXXXXXXX&_auth_=XXXXXXXX_XXXXX"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 667M 100 667M 0 0 14.1M 0 0:00:47 0:00:47 --:--:-- 15.9M
[stack@perflab-director x86_64]$ du -h rhcos-4.2.0-x86_64-openstack.qcow2.gz
668M rhcos-4.2.0-x86_64-openstack.qcow2.gz
[stack@perflab-director x86_64]$ gzip -t -v rhcos-4.2.0-x86_64-openstack.qcow2.gz
rhcos-4.2.0-x86_64-openstack.qcow2.gz: OK
[stack@perflab-director x86_64]$ gunzip rhcos-4.2.0-x86_64-openstack.qcow2.gz
[stack@perflab-director x86_64]$ du -h rhcos-4.2.0-x86_64-openstack.qcow2
1.8G rhcos-4.2.0-x86_64-openstack.qcow2
Upload Red Hat Enterprise Linux CoreOS image into OpenStack Glance:
(overcloud) [stack@perflab-director x86_64]$ openstack image create --container-format=bare --disk-format=qcow2 --file /var/images/x86_64/rhcos-4.2.0-x86_64-openstack.qcow2 rhcos
+------------------+------------------------------------------------------------------------------+
| Field | Value |
+------------------+------------------------------------------------------------------------------+
| checksum | 592f9d70784d1ce8ee97cdb96cdf53c7 |
| container_format | bare |
| created_at | 2019-10-24T13:59:26Z |
| disk_format | qcow2 |
| file | /v2/images/e93658ff-bbcc-4af2-9d13-f39afbedb7dc/file |
| id | e93658ff-bbcc-4af2-9d13-f39afbedb7dc |
| min_disk | 0 |
| min_ram | 0 |
| name | rhcos |
| owner | d88919769d1943b997338a89bdd991da |
| properties | direct_url='swift+config://ref1/glance/e93658ff-bbcc-4af2-9d13-f39afbedb7dc' |
| protected | False |
| schema | /v2/schemas/image |
| size | 1911160832 |
| status | active |
| tags | |
| updated_at | 2019-10-24T13:59:40Z |
| virtual_size | None |
| visibility | shared |
+------------------+------------------------------------------------------------------------------+
Verify the name and ID of the OpenStack ‘External’ network:
(overcloud) [stack@perflab-director openshift]$ openstack network list --long -c ID -c Name -c "Router Type"
+--------------------------------------+-----------+-------------+
| ID | Name | Router Type |
+--------------------------------------+-----------+-------------+
| a6e5e7e6-0ff7-4610-9940-a89c0aa11efc | external | External |
+--------------------------------------+-----------+-------------+
Disable OpenStack quotas (not mandatory, but more simple for this lab):
(overcloud) [stack@perflab-director openshift]$ openstack quota set --secgroups -1 --secgroup-rules -1 --cores -1 --ram -1 --gigabytes -1 admin
(overcloud) [stack@perflab-director openshift]$ openstack quota show admin
+----------------------+----------------------------------+
| Field | Value |
+----------------------+----------------------------------+
| backup-gigabytes | 1000 |
| backups | 10 |
| cores | -1 |
| fixed-ips | -1 |
| floating-ips | 50 |
| gigabytes | -1 |
| gigabytes_tripleo | -1 |
| groups | 10 |
| health_monitors | None |
| injected-file-size | 10240 |
| injected-files | 5 |
| injected-path-size | 255 |
| instances | 10 |
| key-pairs | 100 |
| l7_policies | None |
| listeners | None |
| load_balancers | None |
| location | None |
| name | None |
| networks | 100 |
| per-volume-gigabytes | -1 |
| pools | None |
| ports | 500 |
| project | d88919769d1943b997338a89bdd991da |
| project_name | admin |
| properties | 128 |
| ram | -1 |
| rbac_policies | 10 |
| routers | 10 |
| secgroup-rules | -1 |
| secgroups | -1 |
| server-group-members | 10 |
| server-groups | 10 |
| snapshots | 10 |
| snapshots_tripleo | -1 |
| subnet_pools | -1 |
| subnets | 100 |
| volumes | 10 |
| volumes_tripleo | -1 |
+----------------------+----------------------------------+
Create an OpenStack flavor with 32GB of RAM and 4 vCPUS:
(overcloud) [stack@perflab-director openshift]$ openstack flavor create --ram 32768 --disk 200 --vcpus 4 m1.large
+----------------------------+--------------------------------------+
| Field | Value |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| disk | 200 |
| id | 2a90dead-ea97-434e-9bc8-8560cc0b88e4 |
| name | m1.large |
| os-flavor-access:is_public | True |
| properties | |
| ram | 32768 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 4 |
+----------------------------+--------------------------------------+
Prepare OpenShift cloud.yaml configuration, first take you overcloud password:
[stack@perflab-director ~]$ cat ~/overcloudrc | grep OS_PASSWORD
export OS_PASSWORD=XXXXXXXXX
Download your clouds.yaml file in OpenStack Horizon, “Project” > “API Access” > “OpenStack clouds.yaml File”:
Prepare cloud.yaml configuration, add the password and rename “openstack” by “shiftstack”:
[stack@perflab-director openshift]$ mkdir -p ~/.config/openstack/
[stack@perflab-director ~]$ cat << EOF > ~/.config/openstack/clouds.yaml
# This is a clouds.yaml file, which can be used by OpenStack tools as a source
# of configuration on how to connect to a cloud. If this is your only cloud,
# just put this file in ~/.config/openstack/clouds.yaml and tools like
# python-openstackclient will just work with no further config. (You will need
# to add your password to the auth section)
# If you have more than one cloud account, add the cloud entry to the clouds
# section of your existing file and you can refer to them by name with
# OS_CLOUD=openstack or --os-cloud=openstack
clouds:
openstack:
auth:
auth_url: http://192.168.168.54:5000/v3
username: "admin"
password: XXXXXXXXXXXXXX
project_id: XXXXXXXXX
project_name: "admin"
user_domain_name: "Default"
region_name: "regionOne"
interface: "public"
identity_api_version: 3
EOF
Create an OpenShift account and download your OpenShift Pull secret key by clicking on “Copy Pull Secret” here, you will have to paste this content with the command"openshift-install create install-config":
https://cloud.redhat.com/openshift/install/openstack/installer-provisioned
Prepare the OpenShift install-config.yaml file
(overcloud) [stack@perflab-director openshift]$ ./openshift-install create install-config --dir='/home/stack/openshift'
? SSH Public Key /home/stack/.ssh/id_rsa.pub
? Platform openstack
? Cloud openstack
? ExternalNetwork external
? APIFloatingIPAddress 192.168.168.20
? FlavorName m1.large
? Base Domain lan.redhat.com
? Cluster Name perflab
? Pull Secret [? for help] *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Setup the /etc/hosts file with the Floating IP:
echo -e "192.168.168.20 api.perflab.lan.redhat.com" | sudo tee -a /etc/hosts
Check the install-config.yaml prepared:
(overcloud) [stack@perflab-director openshift]$ cat install-config.yaml
apiVersion: v1
baseDomain: lan.redhat.com
compute:
- hyperthreading: Enabled
name: worker
platform: {}
replicas: 3
controlPlane:
hyperthreading: Enabled
name: master
platform: {}
replicas: 3
metadata:
creationTimestamp: null
name: perflab
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineCIDR: 10.0.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
platform:
openstack:
cloud: openstack
computeFlavor: m1.large
externalNetwork: external
lbFloatingIP: 192.168.168.28
octaviaSupport: "0"
region: ""
trunkSupport: "1"
pullSecret: '{"auths":{"cloud.openshift.com":{"auth":"XXXXXXXXXXX==","email":"mymail@redhat.com"},"quay.io":{"auth":"XXXXXXXXXXXX==","email":"mymail@redhat.com"},"registry.connect.redhat.com":{"auth":"NzMxNDXXXXXXXXX","email":"mymail@redhat.com"}}}'
sshKey: |
ssh-rsa XXXXXXXXX
Deployment of OpenShift, first step with the bootstrap node and three masters
Launch the OpenShift 4.2 deployment:
[stack@perflab-director openshift]$ /home/stack/openshift/openshift-install create cluster --dir='/home/stack/openshift'
INFO Consuming Install Config from target directory
INFO Creating infrastructure resources...
INFO Waiting up to 30m0s for the Kubernetes API at https://api.perflab.lan.redhat.com:6443...
...
First step, the OpenShift bootstrap node and three masters are started:
(overcloud) [stack@perflab-director archive]$ openstack server list
+--------------------------------------+-------------------------+--------+---------------------------------------------------+-------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-------------------------+--------+---------------------------------------------------+-------+----------+
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
| 48adccff-1d50-4e5a-9937-02d6801dd6d4 | perflab-x7szb-bootstrap | ACTIVE | perflab-x7szb-openshift=10.0.0.17, 192.168.168.42 | rhcos | m1.large |
+--------------------------------------+-------------------------+--------+---------------------------------------------------+-------+----------+
We can follow the installation one OpenShift master node:
(overcloud) [stack@perflab-director ~]$ openstack console log show perflab-x7szb-master-0
...
[[0;32m OK [0m] Started Generate /run/issue.d/console-login-helper-messages.issue.
Starting Permit User Sessions...
[[0;32m OK [0m] Started Permit User Sessions.
[[0;32m OK [0m] Started Getty on tty1.
[[0;32m OK [0m] Started Serial Getty on ttyS0.
[[0;32m OK [0m] Reached target Login Prompts.
[[0;32m OK [0m] Started Authorization Manager.
[[0;32m OK [0m] Started RPC Bind.
[[0;32m OK [0m] Started NFS status monitor for NFSv2/3 locking..
[[0;32m OK [0m] Started RPM-OSTree System Management Daemon.
[[0;32m OK [0m] Started Log RPM-OSTree Booted Deployment Status To Journal.
[[0;32m OK [0m] Started CRI-O Auto Update Script.
Starting Open Container Initiative Daemon...
[[0;32m OK [0m] Started Afterburn Hostname.
[[0;32m OK [0m] Started Open Container Initiative Daemon.
Starting Kubernetes Kubelet...
[[0;32m OK [0m] Started Kubernetes systemd probe.
[[0;32m OK [0m] Started Kubernetes Kubelet.
[[0;32m OK [0m] Reached target Multi-User System.
Starting Update UTMP about System Runlevel Changes...
[[0;32m OK [0m] Started Update UTMP about System Runlevel Changes.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0 (Ootpa) 4.2
SSH host key: SHA256:+eXqSFWX/4ki72aOTPLY/14V+FAeTGrQvEV6vzyTmPc (ECDSA)
SSH host key: SHA256:r/dd/ST8waK64EtjyStaXjbThCpu27a9BWwrVHWsL5E (ED25519)
SSH host key: SHA256:csBr96eGedSzymkINjxx2E4lu2zLlWzxhvd1VExhXiU (RSA)
ens3: 10.0.0.13 fe80::cdb9:517d:1c4c:dd77
perflab-x7szb-master-0 login:
We wan check the bootkube.service logs in the OpenShift bootstrap node if needed with the core user:
(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.42
The authenticity of host '192.168.168.42 (192.168.168.42)' can't be established.
ECDSA key fingerprint is SHA256:kUeiWFayprqQWNkmeUl0aOloCefpiyhBWMUJfwX1Hmw.
ECDSA key fingerprint is MD5:7d:2b:ec:9f:56:69:07:1c:55:b0:c7:77:bb:d3:75:d5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.42' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191002.0
WARNING: Direct SSH access to machines is not recommended.
---
This is the bootstrap node; it will be destroyed when the master is fully up.
The primary service is "bootkube.service". To watch its status, run e.g.
journalctl -b -f -u bootkube.service
[core@perflab-x7szb-bootstrap ~]$ journalctl -b -f -u bootkube.service
-- Logs begin at Fri 2019-10-25 23:25:09 UTC. --
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-cluster-version/cluster-version-operator Ready
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-kube-apiserver/kube-apiserver Ready
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler RunningNotReady
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-kube-controller-manager/kube-controller-manager Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-kube-apiserver/kube-apiserver Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-kube-controller-manager/kube-controller-manager Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: Pod Status:openshift-cluster-version/cluster-version-operator Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: All self-hosted control plane components successfully started
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: Sending bootstrap-success event.Waiting for remaining assets to be created.
We can check the OpenStack swift containers:
(overcloud) [stack@perflab-director ~]$ swift list
perflab-x7szb
(overcloud) [stack@perflab-director ~]$ swift list perflab-x7szb
bootstrap.ign
(overcloud) [stack@perflab-director ~]$ swift download perflab-x7szb bootstrap.ign
bootstrap.ign [auth 0.803s, headers 1.028s, total 1.030s, 1.328 MB/s]
At this step we can see the OpenShift bootstrap node and the three OpenShift workers in the OpenStack dashboard:
Deployment of OpenShift, second step with three additional OpenShift worker nodes
Second step, the workers are started:
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+---------------------------------------------------+-------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+-------+----------+
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
| 48adccff-1d50-4e5a-9937-02d6801dd6d4 | perflab-x7szb-bootstrap | ACTIVE | perflab-x7szb-openshift=10.0.0.17, 192.168.168.42 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+-------+----------+
We can check the OpenShift bootstrap console logs:
[stack@perflab-director ~]$ openstack console log show perflab-x7szb-bootstrap
Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa) 4.2
SSH host key: SHA256:kUeiWFayprqQWNkmeUl0aOloCefpiyhBWMUJfwX1Hmw (ECDSA)
SSH host key: SHA256:i22uUacaVf/BTyrOZvCGNAu1ycr5C1OkoPiBYgT36s8 (ED25519)
SSH host key: SHA256:EDOvuU8/Ehs65PdKpDAVw9KB55aoCLCrFJJs9gT6uGw (RSA)
ens3: 10.0.0.17 fe80::2974:17f1:db12:3ee6
perflab-x7szb-bootstrap login:
Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa) 4.2
SSH host key: SHA256:kUeiWFayprqQWNkmeUl0aOloCefpiyhBWMUJfwX1Hmw (ECDSA)
SSH host key: SHA256:i22uUacaVf/BTyrOZvCGNAu1ycr5C1OkoPiBYgT36s8 (ED25519)
SSH host key: SHA256:EDOvuU8/Ehs65PdKpDAVw9KB55aoCLCrFJJs9gT6uGw (RSA)
ens3: 10.0.0.17 fe80::2974:17f1:db12:3ee6
perflab-x7szb-bootstrap login:
bootkube.service complete on the bootstrap node:
[core@perflab-x7szb-bootstrap ~]$ journalctl -b -f -u bootkube.service
Oct 25 23:38:48 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-initial-kube-controller-manager-service-account-private-key.yaml" secrets.v1./initial-service-account-private-key -n openshift-config as it already exists
Oct 25 23:38:48 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-kube-apiserver-to-kubelet-signer.yaml" secrets.v1./kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:49 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-loadbalancer-serving-signer.yaml" secrets.v1./loadbalancer-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:49 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-localhost-serving-signer.yaml" secrets.v1./localhost-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:50 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:50 perflab-x7szb-bootstrap bootkube.sh[1667]: Tearing down temporary bootstrap control plane...
Oct 25 23:38:50 perflab-x7szb-bootstrap bootkube.sh[1667]: bootkube.service complete
Deployment of OpenShift, last step bootstrap node is deleted
Third step, the OpenShift bootstrap node is deleted:
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
Now with the same floating IP we can connect on the master vIP:
(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.20
The authenticity of host '192.168.168.20 (192.168.168.20)' can't be established.
ECDSA key fingerprint is SHA256:+eXqSFWX/4ki72aOTPLY/14V+FAeTGrQvEV6vzyTmPc.
ECDSA key fingerprint is MD5:8a:fc:f1:32:01:0f:2e:a5:ff:ce:8b:02:40:4d:e8:30.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.20' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.
---
[core@perflab-x7szb-master-0 ~]$
The same IP can connect to any master:
(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.20
The authenticity of host '192.168.168.20 (192.168.168.20)' can't be established.
ECDSA key fingerprint is SHA256:lVNLKbg2g5qvUvZxKZBjPhDXzQyC1XZAYU3uIFmDrW4.
ECDSA key fingerprint is MD5:52:89:b8:41:8e:54:05:9f:b6:3a:7d:53:c8:97:9c:c0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.20' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.
---
[core@perflab-x7szb-master-2 ~]$
At the end, the “openshift-install create cluster” command print the info access and kubeadmin password:
INFO Consuming Install Config from target directory
INFO Creating infrastructure resources...
INFO Waiting up to 30m0s for the Kubernetes API at https://api.perflab.lan.redhat.com:6443...
INFO API v1.16.0-beta.2+a90a577 up
INFO Waiting up to 30m0s for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 30m0s for the cluster at https://api.perflab.lan.redhat.com:6443 to initialize...
INFO Waiting up to 10m0s for the openshift-console route to be created...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/openshift/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.perflab.lan.redhat.com
INFO Login to the console with user: kubeadmin, password: XXXXX-XXXXX-XXXXX-XXXXX
Check the OpenShift deployment
Load OpenShift environment variables:
(undercloud) [stack@perflab-director ~]$ export KUBECONFIG=/home/stack/openshift/auth/kubeconfig
The OpenShift API is now listening on port 6443:
(overcloud) [stack@perflab-director ~]$ curl --insecure https://api.perflab.lan.redhat.com:6443
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {
},
"code": 403
Check the OpenShift version:
(overcloud) [stack@perflab-director openshift]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.2.0 True False 33h Cluster version is 4.2.0
List OpenShift nodes:
(undercloud) [stack@perflab-director ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
perflab-x7szb-master-0 Ready master 30m v1.14.6+c07e432da
perflab-x7szb-master-1 Ready master 10m v1.14.6+c07e432da
perflab-x7szb-master-2 Ready master 10m v1.14.6+c07e432da
perflab-x7szb-worker-2jqns Ready worker 30m v1.14.6+c07e432da
perflab-x7szb-worker-7gk2p Ready worker 30m v1.14.6+c07e432da
perflab-x7szb-worker-v6xwp Ready worker 30m v1.14.6+c07e432da
We can ping the OpenShift console IP:
[core@perflab-c8x75-master-0 ~]$ ping console-openshift-console.apps.perflab.lan.redhat.com
PING console-openshift-console.apps.perflab.lan.redhat.com (10.0.0.7) 56(84) bytes of data.
64 bytes from 10.0.0.7 (10.0.0.7): icmp_seq=1 ttl=64 time=0.206 ms
List the OpenStack security groups:
(overcloud) [stack@perflab-director ~]$ openstack security group list
+--------------------------------------+-----------------------+------------------------+----------------------------------+
| ID | Name | Description | Project |
+--------------------------------------+-----------------------+------------------------+----------------------------------+
| 29789b3e-6b9e-41cf-82ba-55ebba5dfd76 | lb-health-mgr-sec-grp | lb-health-mgr-sec-grp | be3b187d3a264957bc2320cf77c55681 |
| 2cb57630-29ce-4376-9850-0da170f738f2 | default | Default security group | be3b187d3a264957bc2320cf77c55681 |
| 2f453b24-7b3f-43c0-8c43-d9520cd74680 | default | Default security group | c942a792fd6f447186e5bafd6d4cbce0 |
| 3db1417b-b4af-4768-8ff6-94544e72ffa5 | perflab-x7szb-master | | c942a792fd6f447186e5bafd6d4cbce0 |
| 50bc9acc-7942-41a3-962f-aa511085f3f8 | default | Default security group | |
| 6b4d4bc4-6f04-4466-9da6-27be5a0c5fb7 | perflab-x7szb-worker | | c942a792fd6f447186e5bafd6d4cbce0 |
| 93cb85c9-5821-47e8-ad85-de18706d63f5 | web | Web servers | c942a792fd6f447186e5bafd6d4cbce0 |
| cdae4bda-7040-4fc3-b28f-e7555e2225e4 | lb-mgmt-sec-grp | lb-mgmt-sec-grp | be3b187d3a264957bc2320cf77c55681 |
+--------------------------------------+-----------------------+------------------------+----------------------------------+
List OpenStack floating IPs:
(overcloud) [stack@perflab-director ~]$ openstack floating ip list
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
| ID | Floating IP Address | Fixed IP Address | Port | Floating Network | Project |
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
| 215a6925-84f2-40fa-897a-44ce53f01dea | 192.168.168.41 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 274fa26b-1aa4-48c2-a6c5-0c07ecd62429 | 192.168.168.23 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 28965300-a668-4348-b2a0-f51660735383 | 192.168.168.44 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 290c7e9d-0c88-47ea-b214-36b93a77672d | 192.168.168.21 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 2c36f3d3-53fc-4c79-a0ee-32d92b4ff27b | 192.168.168.30 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 3b1109ee-2a4a-46cd-acaa-213c4ee6a85c | 192.168.168.33 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 4cc948ad-d636-4370-8dab-3205fe1de992 | 192.168.168.48 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 5ae03e7d-597f-4689-83f6-0ccb7fc9758b | 192.168.168.27 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 634d7790-cae2-4261-b76d-19799826761e | 192.168.168.36 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 665fcc37-82e6-4405-b68b-09757d221c79 | 192.168.168.47 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 6853d0b9-336d-45db-ae24-3ab48a5c8c65 | 192.168.168.29 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 6cfe8bf8-df5c-46df-9ab5-cfb4229d7823 | 192.168.168.25 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| ada5ba51-e8c3-449b-aba6-27a39c15720f | 192.168.168.26 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| c6d62f66-d4e8-4c55-8fcf-4e48e6fa4108 | 192.168.168.31 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| db9a742d-3534-45a1-8bf8-11455ace3255 | 192.168.168.20 | 10.0.0.5 | cfc93b3d-2559-459a-85d2-9e62b08e7905 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| e03bdb6c-c508-444e-8e3d-730a26f1dfb0 | 192.168.168.22 | None | None | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
List OpenStack trunks:
(overcloud) [stack@perflab-director ~]$ openstack network trunk list
+--------------------------------------+------------------------------+--------------------------------------+-------------+
| ID | Name | Parent Port | Description |
+--------------------------------------+------------------------------+--------------------------------------+-------------+
| 65e7f1ec-ffa7-47e4-8b55-ec1769de0f4b | perflab-x7szb-master-trunk-2 | 0e3a7e47-f22f-4f18-9400-0c7862410133 | |
| bb81afaf-a9aa-46b6-8ec7-9dd341d48717 | perflab-x7szb-master-trunk-0 | dd3e2310-9af8-4c6c-8dff-09055a62ce96 | |
| e44364f7-53e8-4d56-933f-78c42af152fe | perflab-x7szb-master-trunk-1 | c7194cee-601e-4a64-9624-51fdb461a4ee | |
| e98ef7fd-c8a4-4bfc-a412-a910d28d4914 | perflab-x7szb-worker-7gk2p | 5a28e142-1c86-470b-b2df-4c4df5d67e8c | |
| ef6b5ad3-2af3-4fa3-9a44-c799074b93dc | perflab-x7szb-worker-v6xwp | 2e7a5ebf-1bee-4860-99d9-faf1f770d11c | |
| f4de190d-7910-48f2-ad35-0dc1c959dd8f | perflab-x7szb-worker-2jqns | c7a9c5da-97fd-4ae1-a86a-911b00b22f5e | |
+--------------------------------------+------------------------------+--------------------------------------+-------------+
Detail of the OpenStack trunk:
(overcloud) [stack@perflab-director ~]$ openstack network trunk show perflab-x7szb-master-trunk-0
+-----------------+---------------------------------------+
| Field | Value |
+-----------------+---------------------------------------+
| admin_state_up | UP |
| created_at | 2019-10-25T23:24:53Z |
| description | |
| id | bb81afaf-a9aa-46b6-8ec7-9dd341d48717 |
| name | perflab-x7szb-master-trunk-0 |
| port_id | dd3e2310-9af8-4c6c-8dff-09055a62ce96 |
| project_id | c942a792fd6f447186e5bafd6d4cbce0 |
| revision_number | 2 |
| status | ACTIVE |
| sub_ports | |
| tags | [u'openshiftClusterID=perflab-x7szb'] |
| tenant_id | c942a792fd6f447186e5bafd6d4cbce0 |
| updated_at | 2019-10-25T23:25:10Z |
+-----------------+---------------------------------------+
(overcloud) [stack@perflab-director ~]$ openstack port show dd3e2310-9af8-4c6c-8dff-09055a62ce96
+-----------------------+--------------------------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------------------------------------------------------------------+
| admin_state_up | UP |
| allowed_address_pairs | ip_address='10.0.0.5', mac_address='fa:16:3e:26:e9:a8' |
| | ip_address='10.0.0.6', mac_address='fa:16:3e:26:e9:a8' |
| | ip_address='10.0.0.7', mac_address='fa:16:3e:26:e9:a8' |
| binding_host_id | overcloud-compute-0.lan.redhat.com |
| binding_profile | |
| binding_vif_details | bridge_name='tbr-bb81afaf-a', datapath_type='system', ovs_hybrid_plug='True', port_filter='True' |
| binding_vif_type | ovs |
| binding_vnic_type | normal |
| created_at | 2019-10-25T23:24:44Z |
| data_plane_status | None |
| description | |
| device_id | e4fa8300-b24c-4d64-95fb-2b0c19c86b17 |
| device_owner | compute:nova |
| dns_assignment | None |
| dns_name | None |
| extra_dhcp_opts | ip_version='4', opt_name='domain-search', opt_value='perflab.lan.redhat.com' |
| fixed_ips | ip_address='10.0.0.13', subnet_id='4065cc81-0959-42e9-b36a-ccc9b0cdc073' |
| id | dd3e2310-9af8-4c6c-8dff-09055a62ce96 |
| ip_address | None |
| mac_address | fa:16:3e:26:e9:a8 |
| name | perflab-x7szb-master-port-0 |
| network_id | 9a0e321e-d825-4713-82be-fca6f6dccd1b |
| option_name | None |
| option_value | None |
| port_security_enabled | True |
| project_id | c942a792fd6f447186e5bafd6d4cbce0 |
| qos_policy_id | None |
| revision_number | 14 |
| security_group_ids | 3db1417b-b4af-4768-8ff6-94544e72ffa5 |
| status | ACTIVE |
| subnet_id | None |
| tags | openshiftClusterID=perflab-x7szb |
| trunk_details | {u'trunk_id': u'bb81afaf-a9aa-46b6-8ec7-9dd341d48717', u'sub_ports': []} |
| updated_at | 2019-10-25T23:25:12Z |
+-----------------------+--------------------------------------------------------------------------------------------------+
Connect with ssh into one OpenShift master node:
[stack@perflab-director ~]$ ssh -o "StrictHostKeyChecking=no" core@192.168.168.20
Warning: Permanently added '192.168.168.20' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.
---
Last login: Sat Oct 26 00:02:00 2019 from 192.168.168.2
[core@perflab-x7szb-master-2 ~]$
The DNS entry “console-openshift-console.apps.perflab.lan.redhat.com” is pointing to 10.0.0.7:
[core@perflab-x7szb-master-2 ~]$ dig console-openshift-console.apps.perflab.lan.redhat.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-17.P2.el8_0.1 <<>> console-openshift-console.apps.perflab.lan.redhat.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7306
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 35d2b3e88a842b9c (echoed)
;; QUESTION SECTION:
;console-openshift-console.apps.perflab.lan.redhat.com. IN A
;; ANSWER SECTION:
console-openshift-console.apps.perflab.lan.redhat.com. 20 IN A 10.0.0.7
;; Query time: 0 msec
;; SERVER: 10.0.0.6#53(10.0.0.6)
;; WHEN: Sat Oct 26 00:04:19 UTC 2019
;; MSG SIZE rcvd: 163
Enable one OpenStack Floating IP to be able to connect with ssh on the GPU worker node:
(overcloud) [stack@perflab-director ~]$ openstack floating ip set --port 7b7afaee-79cb-4adc-b6c5-be2ed07acb6b 192.168.168.30
Setup the /etc/hosts file with the Floating IP:
echo -e "192.168.168.30 console-openshift-console.apps.perflab.lan.redhat.com" | sudo tee -a /etc/hosts
Scan the ports:
(overcloud) [stack@perflab-director ~]$ sudo nmap console-openshift-console.apps.perflab.lan.redhat.com
Starting Nmap 6.40 ( http://nmap.org ) at 2019-10-25 20:09 EDT
Nmap scan report for console-openshift-console.apps.perflab.lan.redhat.com (192.168.168.30)
Host is up (0.0015s latency).
Not shown: 997 filtered ports
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
443/tcp open https
Nmap done: 1 IP address (1 host up) scanned in 5.03 seconds
Look the kubeconfig generated:
[stack@perflab-director ~]$ cat /home/stack/openshift/auth/kubeconfig
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
server: https://api.perflab.lan.redhat.com:6443
name: perflab
contexts:
- context:
cluster: perflab
user: admin
name: admin
current-context: admin
kind: Config
preferences: {}
users:
- name: admin
user:
client-certificate-data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Connect to the console
We can connect to the console in a browser, console URL:
https://console-openshift-console.apps.perflab.lan.redhat.com
OpenShift 4.2 console prompt:
OpenShift 4.2 console home:
OpenShift 4.2 console developer:
Adding a GPU worker node
Now we have a set of master and worker nodes, but we want to add a GPU worker node using an OpenStack instance with GPU passthrough.
Check the current list of OpenShift machines:
(undercloud) [stack@perflab-director ~]$ oc get machines -n openshift-machine-api
NAME STATE TYPE REGION ZONE AGE
perflab-x7szb-master-0 ACTIVE m1.large regionOne nova 34h
perflab-x7szb-master-1 ACTIVE m1.large regionOne nova 34h
perflab-x7szb-master-2 ACTIVE m1.large regionOne nova 34h
perflab-x7szb-worker-2jqns ACTIVE m1.large regionOne nova 34h
perflab-x7szb-worker-7gk2p ACTIVE m1.large regionOne nova 34h
perflab-x7szb-worker-v6xwp ACTIVE m1.large regionOne nova 34h
Check the current list of OpenShift machinesets:
(undercloud) [stack@perflab-director ~]$ oc get machinesets -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
perflab-x7szb-worker 3 3 3 3 34h
Copy an existing worker machine set definition for a GPU-enabled worker machine set definition.
(undercloud) [stack@perflab-director ~]$ oc get machineset perflab-x7szb-worker -n openshift-machine-api -o json > perflab-x7szb-worker.json
(undercloud) [stack@perflab-director ~]$ cp perflab-x7szb-worker.json perflab-x7szb-worker-gpu.json
Change the flavor in the GPU machineset type with NVIDIA V100 GPU, reduce the replicas from 3 to 1, replace the flavor from m1.large to m1-gpu.large:
(overcloud) [stack@perflab-director machinesets]$ diff perflab-x7szb-worker.json perflab-x7szb-worker-gpu.json
5d4
< "creationTimestamp": "2019-10-25T23:34:28Z",
12c11
< "name": "perflab-x7szb-worker",
---
> "name": "perflab-x7szb-worker-gpu",
15,16c14
< "selfLink": "/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/perflab-x7szb-worker",
< "uid": "fcdfce7e-f77f-11e9-9d32-fa163e3cd288"
---
> "selfLink": "/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/perflab-x7szb-worker-gpu"
19c17
< "replicas": 3,
---
> "replicas": 1,
23c21
< "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker"
---
> "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker-gpu"
33c31
< "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker"
---
> "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker-gpu"
48c46
< "flavor": "m1.large",
---
> "flavor": "m1-gpu.large",
90,91c88,89
< "availableReplicas": 3,
< "fullyLabeledReplicas": 3,
---
> "availableReplicas": 1,
> "fullyLabeledReplicas": 1,
93,94c91,92
< "readyReplicas": 3,
< "replicas": 3
---
> "readyReplicas": 1,
> "replicas": 1
Create a new GPU flavor:
(overcloud) [stack@perflab-director ~]$ openstack flavor create --ram 32768 --disk 200 --vcpus 4 m1-gpu.large
+----------------------------+--------------------------------------+
| Field | Value |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| disk | 200 |
| id | 5c6843b5-89ae-4fe8-92c5-fac5a707c241 |
| name | m1-gpu.large |
| os-flavor-access:is_public | True |
| properties | |
| ram | 32768 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 4 |
+----------------------------+--------------------------------------+
Set the alias to the OpenStack flavor:
(overcloud) [stack@perflab-director ~]$ openstack flavor set m1-gpu.large --property "pci_passthrough:alias"="v100:1"
Try to boot a RHEL77 instance with this flavor:
(overcloud) [stack@perflab-director templates]$ openstack server create --flavor m1-gpu.large --image rhel77 --security-group web --nic net-id=perflab-x7szb-openshift --key-name lambda instance0
+-------------------------------------+-----------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | J886yg7sz7MP |
| config_drive | |
| created | 2019-10-27T11:10:26Z |
| flavor | m1-gpu.large (5c6843b5-89ae-4fe8-92c5-fac5a707c241) |
| hostId | |
| id | ad86a6cf-6115-4944-88c1-568c1bc58da0 |
| image | rhel77 (ad740f80-83ad-4af3-8fe7-f255276c0453) |
| key_name | lambda |
| name | instance0 |
| progress | 0 |
| project_id | c942a792fd6f447186e5bafd6d4cbce0 |
| properties | |
| security_groups | name='93cb85c9-5821-47e8-ad85-de18706d63f5' |
| status | BUILD |
| updated | 2019-10-27T11:10:26Z |
| user_id | 721b251122304444bfee09c97f441042 |
| volumes_attached | |
+-------------------------------------+-----------------------------------------------------+
(overcloud) [stack@perflab-director ~]$ FLOATING_IP_ID=$( openstack floating ip list -f value -c ID --status 'DOWN' | head -n 1 )
(overcloud) [stack@perflab-director ~]$ openstack server add floating ip instance0 $FLOATING_IP_ID
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+---------------------------------------------------+--------+--------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+--------+--------------+
| ad86a6cf-6115-4944-88c1-568c1bc58da0 | instance0 | ACTIVE | perflab-x7szb-openshift=10.0.0.12, 192.168.168.41 | rhel77 | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+--------+--------------+
Connect to the instance to check if we can finf the GPU device:
(overcloud) [stack@perflab-director ~]$ ssh cloud-user@192.168.168.41
[cloud-user@instance0 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[cloud-user@instance0 ~]$ sudo lspci | grep -i nvidia
00:05.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
We are good the OSP passthrough is working, we can delete this instance:
(overcloud) [stack@perflab-director ~]$ openstack server delete instance0
List the existing OpenStack nodes before adding the new machineset:
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
Import the OpenShift GPU worker machine set:
(overcloud) [stack@perflab-director ~]$ oc create -f perflab-x7szb-worker-gpu.json
machineset.machine.openshift.io/perflab-x7szb-worker-gpu created
List OpenShift machinesets:
(overcloud) [stack@perflab-director ~]$ oc get machinesets -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
perflab-x7szb-worker 3 3 3 3 36h
perflab-x7szb-worker-gpu 1 1 53s
(overcloud) [stack@perflab-director ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
perflab-x7szb-master-0 Ready master 36h v1.14.6+c07e432da
perflab-x7szb-master-1 Ready master 36h v1.14.6+c07e432da
perflab-x7szb-master-2 Ready master 36h v1.14.6+c07e432da
perflab-x7szb-worker-2jqns Ready worker 35h v1.14.6+c07e432da
perflab-x7szb-worker-7gk2p Ready worker 35h v1.14.6+c07e432da
perflab-x7szb-worker-v6xwp Ready worker 36h v1.14.6+c07e432da
(overcloud) [stack@perflab-director ~]$ oc get machines -n openshift-machine-api
NAME STATE TYPE REGION ZONE AGE
perflab-x7szb-master-0 ACTIVE m1.large regionOne nova 36h
perflab-x7szb-master-1 ACTIVE m1.large regionOne nova 36h
perflab-x7szb-master-2 ACTIVE m1.large regionOne nova 36h
perflab-x7szb-worker-2jqns ACTIVE m1.large regionOne nova 36h
perflab-x7szb-worker-7gk2p ACTIVE m1.large regionOne nova 36h
perflab-x7szb-worker-gpu-rrstz ACTIVE m1-gpu.large regionOne nova 53s
perflab-x7szb-worker-v6xwp ACTIVE m1.large regionOne nova 36h
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| b211621b-97cb-477e-a6c5-895181e4747f | perflab-x7szb-worker-gpu-rrstz | ACTIVE | perflab-x7szb-openshift=10.0.0.30 | rhcos | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
Check the status during the deployment:
(overcloud) [stack@perflab-director ~]$ oc -n openshift-machine-api get machinesets | grep gpu
perflab-x7szb-worker-gpu 1 1 1 1 8m
One additional worker is spawned:
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| b211621b-97cb-477e-a6c5-895181e4747f | perflab-x7szb-worker-gpu-rrstz | ACTIVE | perflab-x7szb-openshift=10.0.0.30 | rhcos | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
(overcloud) [stack@perflab-director ~]$ FLOATING_IP_ID=$( openstack floating ip list -f value -c ID --status 'DOWN' | head -n 1 )
(overcloud) [stack@perflab-director ~]$ openstack server add floating ip perflab-x7szb-worker-gpu-rrstz $FLOATING_IP_ID
(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+---------------------------------------------------+-------+--------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------------------------+--------+---------------------------------------------------+-------+--------------+
| b211621b-97cb-477e-a6c5-895181e4747f | perflab-x7szb-worker-gpu-rrstz | ACTIVE | perflab-x7szb-openshift=10.0.0.30, 192.168.168.41 | rhcos | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0 | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2 | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1 | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+--------------------------------+--------+---------------------------------------------------+-------+--------------+
We can connect into the worker to check the status and find the NVIDIA Tesla V100::
(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.41
The authenticity of host '192.168.168.41 (192.168.168.41)' can't be established.
ECDSA key fingerprint is SHA256:D5SUxj513jGdhKE/Z2or+9s4RKl6milx+/aa5vm1bcM.
ECDSA key fingerprint is MD5:6a:fb:9b:53:fd:79:46:34:31:c8:db:8b:2e:3b:07:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.41' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.
---
[core@perflab-x7szb-worker-gpu-rrstz ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux CoreOS release 4.2
[core@perflab-x7szb-worker-gpu-rrstz ~]$ lspci -nn |grep -i nvidia
00:05.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
The GPU machineset is available:
(overcloud) [stack@perflab-director ~]$ oc -n openshift-machine-api get machines | grep gpu
perflab-x7szb-worker-gpu-rrstz ACTIVE m1-gpu.large regionOne nova 8m47s
(overcloud) [stack@perflab-director ~]$ oc get node perflab-x7szb-worker-gpu-rrstz -o json | jq .metadata.labels
{
"node.openshift.io/os_id": "rhcos",
"node-role.kubernetes.io/worker": "",
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/instance-type": "m1-gpu.large",
"beta.kubernetes.io/os": "linux",
"failure-domain.beta.kubernetes.io/region": "regionOne",
"failure-domain.beta.kubernetes.io/zone": "nova",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "perflab-x7szb-worker-gpu-rrstz",
"kubernetes.io/os": "linux"
}
Deploy the Node Feature Discovery Operator
The Node Feature Discovery operator identifies hardware device features in nodes.
You can find all the information about Node Feature Discovery operator in his git: https://github.com/openshift/cluster-nfd-operator
To install Node Feature Discovery operator, go in he OpenShift console, to “Administrator > Operators > OperatorHub”, and search NFD:
In Node Feature Discovery operator detail page, click on “Install”:
Create the Operator Subscription by clicking on “Suscribe”:
Node Feature Discovery is subscribed:
The Node Feature Discovery is now “Created”:
We can list the installation steps of the Node Feature Discovery installation:
Check the cluster-nfd-operator container image tags:
(overcloud) [stack@perflab-director cluster-nfd-operator]$ skopeo inspect docker://quay.io/zvonkok/cluster-nfd-operator | jq ".Tag , .RepoTags"
"latest"
[
"v0.0.1",
"v4.1",
"p3",
"e2e",
"operand",
"configmap",
"nvidia-label",
"latest"
]
Check the openshift-nfd status:
(overcloud) [stack@perflab-director ~]$ oc get pods -n openshift-operators
NAME READY STATUS RESTARTS AGE
nfd-operator-fd55688bd-hrf9c 1/1 Running 0 2m46s
Check the status during the setup:
(overcloud) [stack@perflab-director ~]$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-master-ksslc 0/1 ContainerCreating 0 7s
nfd-master-qbzcb 0/1 ContainerCreating 0 7s
nfd-master-xw622 0/1 ContainerCreating 0 7s
nfd-worker-84fs2 0/1 ContainerCreating 0 8s
nfd-worker-ljdqk 0/1 ContainerCreating 0 8s
nfd-worker-nbxsm 0/1 ContainerCreating 0 8s
nfd-worker-sr7pq 0/1 ContainerCreating 0 8s
(overcloud) [stack@perflab-director ~]$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-master-ksslc 1/1 Running 0 21s
nfd-master-qbzcb 1/1 Running 0 21s
nfd-master-xw622 1/1 Running 0 21s
nfd-worker-84fs2 1/1 Running 0 22s
nfd-worker-ljdqk 1/1 Running 0 22s
nfd-worker-nbxsm 1/1 Running 0 22s
nfd-worker-sr7pq 1/1 Running 0 22s
The Node Feature Discovery Operator is available, and the GPU workers are tagged:
(overcloud) [stack@perflab-director openshift]$ oc describe node perflab-x7szb-worker-gpu-rrstz|grep 10de
feature.node.kubernetes.io/pci-10de.present=true
(overcloud) [stack@perflab-director ~]$ oc describe node perflab-x7szb-worker-gpu-rrstz | egrep 'Roles|pci'
Roles: worker
feature.node.kubernetes.io/pci-1013.present=true
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-1af4.present=true
Deploy the Special Resource Operator
Clone the special-resource-operaror and witch to release-4.2 branch:
(overcloud) [stack@perflab-director openshift]$ git clone https://github.com/openshift-psap/special-resource-operator
Cloning into 'special-resource-operator'...
remote: Enumerating objects: 11558, done.
remote: Counting objects: 100% (11558/11558), done.
remote: Compressing objects: 100% (5857/5857), done.
remote: Total 11558 (delta 4434), reused 11515 (delta 4396), pack-reused 0
Receiving objects: 100% (11558/11558), 15.76 MiB | 3.50 MiB/s, done.
Resolving deltas: 100% (4434/4434), done.
(overcloud) [stack@perflab-director openshift]$ cd special-resource-operator/
(overcloud) [stack@perflab-director special-resource-operator]$ git checkout release-4.2
Branch release-4.2 set up to track remote branch release-4.2 from origin.
Switched to a new branch 'release-4.2'
(overcloud) [stack@perflab-director special-resource-operator]$ PULLPOLICY=Always make deploy
customresourcedefinition.apiextensions.k8s.io/specialresources.sro.openshift.io created
sleep 1
for obj in namespace.yaml service_account.yaml role.yaml role_binding.yaml operator.yaml crds/sro_v1alpha1_specialresource_cr.yaml; do \
sed 's+REPLACE_IMAGE+quay.io/openshift-psap/special-resource-operator:release-4.2+g; s+REPLACE_NAMESPACE+openshift-sro+g; s+Always+Always+' deploy/$obj | kubectl apply -f - ; \
done
namespace/openshift-sro created
serviceaccount/special-resource-operator created
role.rbac.authorization.k8s.io/special-resource-operator created
clusterrole.rbac.authorization.k8s.io/special-resource-operator created
rolebinding.rbac.authorization.k8s.io/special-resource-operator created
clusterrolebinding.rbac.authorization.k8s.io/special-resource-operator created
deployment.apps/special-resource-operator created
specialresource.sro.openshift.io/example-specialresource created
specialresource.sro.openshift.io/example-specialresource unchanged
The installation of the Special Resource Operator is completed:
(overcloud) [stack@perflab-director openshift]$ oc get pods -n openshift-sro
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-exporter-8hl6j 2/2 Running 0 10m
nvidia-device-plugin-daemonset-6xptf 1/1 Running 0 10m
nvidia-device-plugin-validation 0/1 Completed 0 10m
nvidia-driver-daemonset-cqp62 1/1 Running 0 12m
nvidia-driver-validation 0/1 Completed 0 12m
nvidia-feature-discovery-ckjsn 1/1 Running 0 10m
nvidia-grafana-67bdb6d6-shp8f 1/1 Running 0 10m
special-resource-operator-7cbb8f5d67-pqj84 1/1 Running 0 13m
We can see the final deployment with the GPU worker in the dashboard Horizon:
We can see the final topology with the GPU worker in the dashboard Horizon:
Check the security groups created in OpenStack:
Check the trunks created in OpenStack:
Test nvidia-smi
Create a nvidia-smi POD definition YAML file:
(overcloud) [stack@perflab-director openshift]$ cat << EOF > nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
containers:
- image: nvidia/cuda
name: nvidia-smi
command: [ nvidia-smi ]
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
EOF
Create the nvidia-smi pod:
(overcloud) [stack@perflab-director openshift]$ oc create -f nvidia-smi.yaml
pod/nvidia-smi created
(overcloud) [stack@perflab-director openshift]$ oc get pods
NAME READY STATUS RESTARTS AGE
nvidia-smi 0/1 ContainerCreating 0 5s
(overcloud) [stack@perflab-director openshift]$ oc get pods
NAME READY STATUS RESTARTS AGE
nvidia-smi 0/1 Completed 0 15s
Success, the NVIDIA drivers are available in the pod:
(overcloud) [stack@perflab-director openshift]$ oc logs nvidia-smi
Sun Oct 27 15:03:29 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34 Driver Version: 430.34 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:00:05.0 Off | Off |
| N/A 31C P0 25W / 250W | 0MiB / 16160MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Delete the nvidia-smi pod:
(overcloud) [stack@perflab-director openshift]$ oc delete pod nvidia-smi
pod "nvidia-smi" deleted
TensorFlow benchmarks with GPU
Create the GPU benchmark Pod Definition YAML file:
(overcloud) [stack@perflab-director pods]$ cat << EOF > tensorflow-benchmarks-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: tensorflow-benchmarks-gpu
spec:
containers:
- image: nvcr.io/nvidia/tensorflow:19.09-py3
name: cudnn
command: ["/bin/sh","-c"]
args: ["git clone https://github.com/tensorflow/benchmarks.git;cd benchmarks/scripts/tf_cnn_benchmarks;python3 tf_cnn_benchmarks.py --num_gpus=1 --data_format=NHWC --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
restartPolicy: Never
EOF
Create the GPU benchmark pod:
(overcloud) [stack@perflab-director pods]$ oc create -f tensorflow-benchmarks-gpu.yaml
pod/tensorflow-benchmarks-gpu created
The pod switch to “Completed” status after 30 seconds:
(overcloud) [stack@perflab-director pods]$ oc get pod
NAME READY STATUS RESTARTS AGE
tensorflow-benchmarks-gpu 0/1 Completed 0 30s
Check the GPU benchark results, the training is fast with 325.03 images/sec:
(overcloud) [stack@perflab-director pods]$ oc logs tensorflow-benchmarks-gpu
TensorFlow: 1.14
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NHWC
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 327.4 +/- 0.0 (jitter = 0.0) 8.108
10 images/sec: 326.5 +/- 0.7 (jitter = 1.0) 8.122
20 images/sec: 327.2 +/- 0.4 (jitter = 0.6) 7.983
30 images/sec: 327.1 +/- 0.6 (jitter = 0.5) 7.780
40 images/sec: 327.5 +/- 0.4 (jitter = 0.5) 7.848
50 images/sec: 327.3 +/- 0.4 (jitter = 0.6) 7.779
60 images/sec: 326.5 +/- 0.4 (jitter = 0.9) 7.826
70 images/sec: 326.7 +/- 0.3 (jitter = 0.7) 7.840
80 images/sec: 326.1 +/- 0.4 (jitter = 0.8) 7.819
90 images/sec: 325.5 +/- 0.4 (jitter = 1.3) 7.646
100 images/sec: 325.3 +/- 0.4 (jitter = 1.7) 7.918
----------------------------------------------------------------
total images/sec: 325.03
----------------------------------------------------------------
Tensorflow benchmarks with CPU
To compare, create a CPU Pod Definition YAML file:
(overcloud) [stack@perflab-director pods]$ cat << EOF > tensorflow-benchmarks-cpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: tensorflow-benchmarks-cpu
spec:
containers:
- image: nvcr.io/nvidia/tensorflow:19.09-py3
name: cudnn
command: ["/bin/sh","-c"]
args: ["git clone https://github.com/tensorflow/benchmarks.git;cd benchmarks/scripts/tf_cnn_benchmarks;python3 tf_cnn_benchmarks.py --device=cpu --data_format=NHWC --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
restartPolicy: Never
EOF
Create the CPU benchmark pod:
(overcloud) [stack@perflab-director pods]$ oc create -f tensorflow-benchmarks-cpu.yaml
pod/tensorflow-benchmarks-cpu created
Because it takes a lot of time with CPU only, lets have a look in the container:
(overcloud) [stack@perflab-director pods]$ oc rsh tensorflow-benchmarks-cpu
(overcloud) [stack@perflab-director pods]$ top
top - 22:18:38 up 10:35, 0 users, load average: 6.07, 5.90, 5.10
Tasks: 5 total, 1 running, 4 sleeping, 0 stopped, 0 zombie
%Cpu0 : 85.9 us, 2.7 sy, 0.0 ni, 10.4 id, 0.0 wa, 1.0 hi, 0.0 si, 0.0 st
%Cpu1 : 86.7 us, 2.3 sy, 0.0 ni, 8.7 id, 0.0 wa, 1.0 hi, 1.3 si, 0.0 st
%Cpu2 : 87.9 us, 2.7 sy, 0.0 ni, 8.7 id, 0.0 wa, 0.7 hi, 0.0 si, 0.0 st
%Cpu3 : 85.3 us, 3.0 sy, 0.0 ni, 10.3 id, 0.0 wa, 1.0 hi, 0.3 si, 0.0 st
KiB Mem : 32936388 total, 5645924 free, 5718028 used, 21572436 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 26818208 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30 root 20 0 27.200g 4.241g 512752 S 342.7 13.5 89:00.66 tf_cnn_benchmar
1 root 20 0 20124 3588 3268 S 0.0 0.0 0:00.02 bash
40 root 20 0 32100 10952 5444 S 0.0 0.0 0:00.03 python3
386 root 20 0 23516 7076 3392 S 0.0 0.0 0:00.03 bash
693 root 20 0 40460 3460 2976 R 0.0 0.0 0:00.00 top
We can also follow the CPU load in the console:
The pod switch to “Completed” status after 28 minutes:
[stack@perflab-director ~]$ oc get pod ; oc logs tensorflow-benchmarks-cpu|tail -20
NAME READY STATUS RESTARTS AGE
tensorflow-benchmarks-cpu 0/1 Completed 0 28m
Check the CPU benchark results, the training is slow with 2.24 images/sec:
[stack@perflab-director ~]$ oc logs tensorflow-benchmarks-cpu
TensorFlow: 1.14
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/cpu:0']
NUMA bind: False
Data format: NHWC
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 8.108
10 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 8.122
20 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.983
30 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.780
40 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.848
50 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.779
60 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.825
70 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.839
80 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.818
90 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.648
100 images/sec: 2.2 +/- 0.0 (jitter = 0.0) 7.915
----------------------------------------------------------------
total images/sec: 2.24
----------------------------------------------------------------
With this setup one pod can increase by 145 the resnet50 training performance with Red Hat OpenShift, Red Hat OpenStack Platform and NVIDIA GPU.
Grafana dashboards
We can connect into the Grafana dashboard:
https://grafana-openshift-monitoring.apps.perflab.lan.redhat.com
NRO data:
NFD data:
Prometheus data:
Etcd data: