Erwan Gallen
Erwan Gallen

Oct 27, 2019 34 min read

OpenShift 4.2 on Red Hat OpenStack Platform 13 + GPU

thumbnail for this post

Red Hat OpenShift Container Platform 4.2 introduces the general availability of full-stack automated deployments on OpenStack. With OpenShift 4.2, containers can be managed across multiple public and private clouds, including OpenStack. Red Hat and NVIDIA are working to provide the best platform for Artificial Intelligence and Machine Learning workloads.

Note: This blog post shows how to deploy GPU-enabled nodes running Red Hat Enterprise Linux CoreOS. With Red Hat OpenShift Container Platform 4.2, GPUs with OpenShift are supported in Red Hat Enterprise Linux 7 nodes only. This process is not supported. Please see the OpenShift 4.2 release notes for details.

The OpenShift 4.2 installer can fully automate the installation on OpenStack:

  • Network configuration (networks, subnets, trunks, load balancers)
  • VM creation
  • Storage configuration
  • OpenShift setup
  • Routing

Summary:

We will use the openshift-installer binary to spawn the OpenShift cluster.

The openshift-installer binary is directly consuming the OpenStack API.

At the end of the installation, we will have one OpenShift cluster running on seven OpenStack Virtual Machines:

  • 3 x OpenShift masters VMs
  • 3 x OpenShift workers for CPU workloads VMs
  • 1 x OpenShift worker for GPU workload VM

You can run the same process with other IaaS platforms as AWS or Azure.

The OpenStack Virtual Machine used as a worker for GPU workloads is using PCI passtrough to a NVIDIA Tesla V100 GPU board.

The OpenShift 4.2 cluster will use two Kubernetes operators to setup the GPU configuration:

  • Node Feature Discovery for Kubernetes (NFD) to label the GPU nodes
  • Special Resource Operator for Kubernetes (SRO) to enable the NVIDIA driver stack on the GPU worker node

OpenShift GPU operators

OpenStack lab environment

We are using a deployed Red Hat OpenStack Platform with Red Hat OpenStack Platform 13z8:

[stack@perflab-director ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 13.0.8 (Queens)

The compute nodes have two NVIDIA Tesla v100 with 16GB of GPU Memory:
NVIDIA Tesla V100

List the PCI device IDs on one OpenStack compute node (two V100 boards plugged):

[heat-admin@overcloud-compute-0 ~]$ lspci -nn | grep -i nvidia
3b:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
d8:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)

Create a flavor for the Masters and Workers nodes:

[stack@perflab-director ~]$ source ~/overcloudrc 

(overcloud) [stack@perflab-director ~]$ openstack flavor create --ram 1024 --disk 200 --vcpus 2 m1.xlarge

Add swiftoperator role to admin:

(overcloud) [stack@perflab-director ~]$ openstack role add --user admin --project admin swiftoperator

Set a temporary URL property:

(overcloud) [stack@perflab-director ~]$ openstack object store account set --property Temp-URL-Key=superkey

Before the deployment we have only the external and the load balancer networks available: OpenStack networks

Prepare the OpenShift installer

The deployment process will run in multiple steps: OpenShift bootstrap

To get the OpenShift installer and resources, log with your RHN account here, click on “Red Hat OpenShift Container Platform”:
https://access.redhat.com/downloads OpenShift downloads

You will have to download two binaries and one QCOW image:

  • “OpenShift v4.2 Linux Client”
  • “OpenShift v4.2 Linux Installer”
  • “Red Hat Enterprise Linux CoreOS - OpenStack Image (QCOW)”

I had to add the .gz extension and uncompress the Red Hat Enterprise Linux CoreOS downloaded qcow2:

[stack@perflab-director x86_64]$ curl --compressed -J -L -o rhcos-4.2.0-x86_64-openstack.qcow2.gz "https://access.cdn.redhat.com/content/origin/files/sha256/XX/XXXXXXXXXXXXXX/rhcos-4.2.0-x86_64-openstack.qcow2?user=XXXXXXXXX&_auth_=XXXXXXXX_XXXXX" 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  667M  100  667M    0     0  14.1M      0  0:00:47  0:00:47 --:--:-- 15.9M

[stack@perflab-director x86_64]$ du -h rhcos-4.2.0-x86_64-openstack.qcow2.gz
668M  rhcos-4.2.0-x86_64-openstack.qcow2.gz

[stack@perflab-director x86_64]$ gzip -t -v rhcos-4.2.0-x86_64-openstack.qcow2.gz 
rhcos-4.2.0-x86_64-openstack.qcow2.gz: OK

[stack@perflab-director x86_64]$ gunzip rhcos-4.2.0-x86_64-openstack.qcow2.gz

[stack@perflab-director x86_64]$ du -h rhcos-4.2.0-x86_64-openstack.qcow2 
1.8G  rhcos-4.2.0-x86_64-openstack.qcow2

Upload Red Hat Enterprise Linux CoreOS image into OpenStack Glance:

(overcloud) [stack@perflab-director x86_64]$ openstack image create --container-format=bare --disk-format=qcow2 --file /var/images/x86_64/rhcos-4.2.0-x86_64-openstack.qcow2 rhcos
+------------------+------------------------------------------------------------------------------+
| Field            | Value                                                                        |
+------------------+------------------------------------------------------------------------------+
| checksum         | 592f9d70784d1ce8ee97cdb96cdf53c7                                             |
| container_format | bare                                                                         |
| created_at       | 2019-10-24T13:59:26Z                                                         |
| disk_format      | qcow2                                                                        |
| file             | /v2/images/e93658ff-bbcc-4af2-9d13-f39afbedb7dc/file                         |
| id               | e93658ff-bbcc-4af2-9d13-f39afbedb7dc                                         |
| min_disk         | 0                                                                            |
| min_ram          | 0                                                                            |
| name             | rhcos                                                                        |
| owner            | d88919769d1943b997338a89bdd991da                                             |
| properties       | direct_url='swift+config://ref1/glance/e93658ff-bbcc-4af2-9d13-f39afbedb7dc' |
| protected        | False                                                                        |
| schema           | /v2/schemas/image                                                            |
| size             | 1911160832                                                                   |
| status           | active                                                                       |
| tags             |                                                                              |
| updated_at       | 2019-10-24T13:59:40Z                                                         |
| virtual_size     | None                                                                         |
| visibility       | shared                                                                       |
+------------------+------------------------------------------------------------------------------+

Verify the name and ID of the OpenStack ‘External’ network:

(overcloud) [stack@perflab-director openshift]$ openstack network list --long -c ID -c Name -c "Router Type"
+--------------------------------------+-----------+-------------+
| ID                                   | Name      | Router Type |
+--------------------------------------+-----------+-------------+
| a6e5e7e6-0ff7-4610-9940-a89c0aa11efc | external  | External    |
+--------------------------------------+-----------+-------------+

Disable OpenStack quotas (not mandatory, but more simple for this lab):

(overcloud) [stack@perflab-director openshift]$ openstack quota set --secgroups -1 --secgroup-rules -1 --cores -1 --ram -1 --gigabytes -1 admin

(overcloud) [stack@perflab-director openshift]$ openstack quota show admin
+----------------------+----------------------------------+
| Field                | Value                            |
+----------------------+----------------------------------+
| backup-gigabytes     | 1000                             |
| backups              | 10                               |
| cores                | -1                               |
| fixed-ips            | -1                               |
| floating-ips         | 50                               |
| gigabytes            | -1                               |
| gigabytes_tripleo    | -1                               |
| groups               | 10                               |
| health_monitors      | None                             |
| injected-file-size   | 10240                            |
| injected-files       | 5                                |
| injected-path-size   | 255                              |
| instances            | 10                               |
| key-pairs            | 100                              |
| l7_policies          | None                             |
| listeners            | None                             |
| load_balancers       | None                             |
| location             | None                             |
| name                 | None                             |
| networks             | 100                              |
| per-volume-gigabytes | -1                               |
| pools                | None                             |
| ports                | 500                              |
| project              | d88919769d1943b997338a89bdd991da |
| project_name         | admin                            |
| properties           | 128                              |
| ram                  | -1                               |
| rbac_policies        | 10                               |
| routers              | 10                               |
| secgroup-rules       | -1                               |
| secgroups            | -1                               |
| server-group-members | 10                               |
| server-groups        | 10                               |
| snapshots            | 10                               |
| snapshots_tripleo    | -1                               |
| subnet_pools         | -1                               |
| subnets              | 100                              |
| volumes              | 10                               |
| volumes_tripleo      | -1                               |
+----------------------+----------------------------------+

Create an OpenStack flavor with 32GB of RAM and 4 vCPUS:

(overcloud) [stack@perflab-director openshift]$  openstack flavor create --ram 32768 --disk 200 --vcpus 4 m1.large
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 200                                  |
| id                         | 2a90dead-ea97-434e-9bc8-8560cc0b88e4 |
| name                       | m1.large                             |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 32768                                |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 4                                    |
+----------------------------+--------------------------------------+

Prepare OpenShift cloud.yaml configuration, first take you overcloud password:

[stack@perflab-director ~]$ cat ~/overcloudrc | grep OS_PASSWORD
export OS_PASSWORD=XXXXXXXXX

Download your clouds.yaml file in OpenStack Horizon, “Project” > “API Access” > “OpenStack clouds.yaml File”:
OpenShift bootstrap

Prepare cloud.yaml configuration, add the password and rename “openstack” by “shiftstack”:

[stack@perflab-director openshift]$ mkdir -p ~/.config/openstack/

[stack@perflab-director ~]$ cat << EOF > ~/.config/openstack/clouds.yaml
# This is a clouds.yaml file, which can be used by OpenStack tools as a source
# of configuration on how to connect to a cloud. If this is your only cloud,
# just put this file in ~/.config/openstack/clouds.yaml and tools like
# python-openstackclient will just work with no further config. (You will need
# to add your password to the auth section)
# If you have more than one cloud account, add the cloud entry to the clouds
# section of your existing file and you can refer to them by name with
# OS_CLOUD=openstack or --os-cloud=openstack
clouds:
  openstack:
    auth:
      auth_url: http://192.168.168.54:5000/v3
      username: "admin"
      password: XXXXXXXXXXXXXX
      project_id: XXXXXXXXX
      project_name: "admin"
      user_domain_name: "Default"
    region_name: "regionOne"
    interface: "public"
    identity_api_version: 3
EOF

Create an OpenShift account and download your OpenShift Pull secret key by clicking on “Copy Pull Secret” here, you will have to paste this content with the command”openshift-install create install-config”:
https://cloud.redhat.com/openshift/install/openstack/installer-provisioned Copy OpenShift Pull Secret

Prepare the OpenShift install-config.yaml file

(overcloud) [stack@perflab-director openshift]$ ./openshift-install create install-config --dir='/home/stack/openshift'                                                                                        
? SSH Public Key /home/stack/.ssh/id_rsa.pub
? Platform openstack
? Cloud openstack
? ExternalNetwork external
? APIFloatingIPAddress 192.168.168.20
? FlavorName m1.large
? Base Domain lan.redhat.com
? Cluster Name perflab
? Pull Secret [? for help] *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

Setup the /etc/hosts file with the Floating IP:

echo -e "192.168.168.20 api.perflab.lan.redhat.com" | sudo tee -a /etc/hosts

Check the install-config.yaml prepared:

(overcloud) [stack@perflab-director openshift]$ cat install-config.yaml 
apiVersion: v1
baseDomain: lan.redhat.com
compute:
- hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: perflab
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  openstack:
    cloud: openstack
    computeFlavor: m1.large
    externalNetwork: external
    lbFloatingIP: 192.168.168.28
    octaviaSupport: "0"
    region: ""
    trunkSupport: "1"
pullSecret: '{"auths":{"cloud.openshift.com":{"auth":"XXXXXXXXXXX==","email":"mymail@redhat.com"},"quay.io":{"auth":"XXXXXXXXXXXX==","email":"mymail@redhat.com"},"registry.connect.redhat.com":{"auth":"NzMxNDXXXXXXXXX","email":"mymail@redhat.com"}}}'
sshKey: |
  ssh-rsa XXXXXXXXX

Deployment of OpenShift, first step with the bootstrap node and three masters

Launch the OpenShift 4.2 deployment:

[stack@perflab-director openshift]$ /home/stack/openshift/openshift-install create cluster --dir='/home/stack/openshift'
INFO Consuming Install Config from target directory       
INFO Creating infrastructure resources...      
INFO Waiting up to 30m0s for the Kubernetes API at https://api.perflab.lan.redhat.com:6443...
...

First step, the OpenShift bootstrap node and three masters are started:

(overcloud) [stack@perflab-director archive]$ openstack server list
+--------------------------------------+-------------------------+--------+---------------------------------------------------+-------+----------+
| ID                                   | Name                    | Status | Networks                                          | Image | Flavor   |
+--------------------------------------+-------------------------+--------+---------------------------------------------------+-------+----------+
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0  | ACTIVE | perflab-x7szb-openshift=10.0.0.13                 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2  | ACTIVE | perflab-x7szb-openshift=10.0.0.29                 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1  | ACTIVE | perflab-x7szb-openshift=10.0.0.22                 | rhcos | m1.large |
| 48adccff-1d50-4e5a-9937-02d6801dd6d4 | perflab-x7szb-bootstrap | ACTIVE | perflab-x7szb-openshift=10.0.0.17, 192.168.168.42 | rhcos | m1.large |
+--------------------------------------+-------------------------+--------+---------------------------------------------------+-------+----------+

We can follow the installation one OpenShift master node:

(overcloud) [stack@perflab-director ~]$ openstack console log show perflab-x7szb-master-0
...
[[0;32m  OK  [0m] Started Generate /run/issue.d/console-login-helper-messages.issue.
         Starting Permit User Sessions...
[[0;32m  OK  [0m] Started Permit User Sessions.
[[0;32m  OK  [0m] Started Getty on tty1.
[[0;32m  OK  [0m] Started Serial Getty on ttyS0.
[[0;32m  OK  [0m] Reached target Login Prompts.
[[0;32m  OK  [0m] Started Authorization Manager.
[[0;32m  OK  [0m] Started RPC Bind.
[[0;32m  OK  [0m] Started NFS status monitor for NFSv2/3 locking..
[[0;32m  OK  [0m] Started RPM-OSTree System Management Daemon.
[[0;32m  OK  [0m] Started Log RPM-OSTree Booted Deployment Status To Journal.
[[0;32m  OK  [0m] Started CRI-O Auto Update Script.
         Starting Open Container Initiative Daemon...
[[0;32m  OK  [0m] Started Afterburn Hostname.
[[0;32m  OK  [0m] Started Open Container Initiative Daemon.
         Starting Kubernetes Kubelet...
[[0;32m  OK  [0m] Started Kubernetes systemd probe.
[[0;32m  OK  [0m] Started Kubernetes Kubelet.
[[0;32m  OK  [0m] Reached target Multi-User System.
         Starting Update UTMP about System Runlevel Changes...
[[0;32m  OK  [0m] Started Update UTMP about System Runlevel Changes.

Red Hat Enterprise Linux CoreOS 42.80.20191010.0 (Ootpa) 4.2
SSH host key: SHA256:+eXqSFWX/4ki72aOTPLY/14V+FAeTGrQvEV6vzyTmPc (ECDSA)
SSH host key: SHA256:r/dd/ST8waK64EtjyStaXjbThCpu27a9BWwrVHWsL5E (ED25519)
SSH host key: SHA256:csBr96eGedSzymkINjxx2E4lu2zLlWzxhvd1VExhXiU (RSA)
ens3: 10.0.0.13 fe80::cdb9:517d:1c4c:dd77
perflab-x7szb-master-0 login: 

We wan check the bootkube.service logs in the OpenShift bootstrap node if needed with the core user:

(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.42
The authenticity of host '192.168.168.42 (192.168.168.42)' can't be established.
ECDSA key fingerprint is SHA256:kUeiWFayprqQWNkmeUl0aOloCefpiyhBWMUJfwX1Hmw.
ECDSA key fingerprint is MD5:7d:2b:ec:9f:56:69:07:1c:55:b0:c7:77:bb:d3:75:d5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.42' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191002.0
WARNING: Direct SSH access to machines is not recommended.

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary service is "bootkube.service". To watch its status, run e.g.

  journalctl -b -f -u bootkube.service

[core@perflab-x7szb-bootstrap ~]$ journalctl -b -f -u bootkube.service
-- Logs begin at Fri 2019-10-25 23:25:09 UTC. --
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-kube-apiserver/kube-apiserver        Ready
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        RunningNotReady
Oct 25 23:37:36 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-kube-controller-manager/kube-controller-manager        Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-kube-apiserver/kube-apiserver        Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-kube-controller-manager/kube-controller-manager        Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: All self-hosted control plane components successfully started
Oct 25 23:37:51 perflab-x7szb-bootstrap bootkube.sh[1667]: Sending bootstrap-success event.Waiting for remaining assets to be created.

We can check the OpenStack swift containers:

(overcloud) [stack@perflab-director ~]$ swift list
perflab-x7szb

(overcloud) [stack@perflab-director ~]$ swift list perflab-x7szb
bootstrap.ign

(overcloud) [stack@perflab-director ~]$ swift download perflab-x7szb bootstrap.ign
bootstrap.ign [auth 0.803s, headers 1.028s, total 1.030s, 1.328 MB/s]

At this step we can see the OpenShift bootstrap node and the three OpenShift workers in the OpenStack dashboard: OpenStack networks

Deployment of OpenShift, second step with three additional OpenShift worker nodes

Second step, the workers are started:

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+---------------------------------------------------+-------+----------+
| ID                                   | Name                       | Status | Networks                                          | Image | Flavor   |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+-------+----------+
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16                 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37                 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20                 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0     | ACTIVE | perflab-x7szb-openshift=10.0.0.13                 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2     | ACTIVE | perflab-x7szb-openshift=10.0.0.29                 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1     | ACTIVE | perflab-x7szb-openshift=10.0.0.22                 | rhcos | m1.large |
| 48adccff-1d50-4e5a-9937-02d6801dd6d4 | perflab-x7szb-bootstrap    | ACTIVE | perflab-x7szb-openshift=10.0.0.17, 192.168.168.42 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+-------+----------+

We can check the OpenShift bootstrap console logs:

[stack@perflab-director ~]$ openstack console log show perflab-x7szb-bootstrap
Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa) 4.2
SSH host key: SHA256:kUeiWFayprqQWNkmeUl0aOloCefpiyhBWMUJfwX1Hmw (ECDSA)
SSH host key: SHA256:i22uUacaVf/BTyrOZvCGNAu1ycr5C1OkoPiBYgT36s8 (ED25519)
SSH host key: SHA256:EDOvuU8/Ehs65PdKpDAVw9KB55aoCLCrFJJs9gT6uGw (RSA)
ens3: 10.0.0.17 fe80::2974:17f1:db12:3ee6
perflab-x7szb-bootstrap login: 
Red Hat Enterprise Linux CoreOS 42.80.20191002.0 (Ootpa) 4.2
SSH host key: SHA256:kUeiWFayprqQWNkmeUl0aOloCefpiyhBWMUJfwX1Hmw (ECDSA)
SSH host key: SHA256:i22uUacaVf/BTyrOZvCGNAu1ycr5C1OkoPiBYgT36s8 (ED25519)
SSH host key: SHA256:EDOvuU8/Ehs65PdKpDAVw9KB55aoCLCrFJJs9gT6uGw (RSA)
ens3: 10.0.0.17 fe80::2974:17f1:db12:3ee6
perflab-x7szb-bootstrap login:

bootkube.service complete on the bootstrap node:

[core@perflab-x7szb-bootstrap ~]$ journalctl -b -f -u bootkube.service
Oct 25 23:38:48 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-initial-kube-controller-manager-service-account-private-key.yaml" secrets.v1./initial-service-account-private-key -n openshift-config as it already exists
Oct 25 23:38:48 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-kube-apiserver-to-kubelet-signer.yaml" secrets.v1./kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:49 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-loadbalancer-serving-signer.yaml" secrets.v1./loadbalancer-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:49 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-localhost-serving-signer.yaml" secrets.v1./localhost-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:50 perflab-x7szb-bootstrap bootkube.sh[1667]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 25 23:38:50 perflab-x7szb-bootstrap bootkube.sh[1667]: Tearing down temporary bootstrap control plane...
Oct 25 23:38:50 perflab-x7szb-bootstrap bootkube.sh[1667]: bootkube.service complete

Deployment of OpenShift, last step bootstrap node is deleted

Third step, the OpenShift bootstrap node is deleted:

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| ID                                   | Name                       | Status | Networks                          | Image | Flavor   |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0     | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2     | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1     | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+

Now with the same floating IP we can connect on the master vIP:

(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.20
The authenticity of host '192.168.168.20 (192.168.168.20)' can't be established.
ECDSA key fingerprint is SHA256:+eXqSFWX/4ki72aOTPLY/14V+FAeTGrQvEV6vzyTmPc.
ECDSA key fingerprint is MD5:8a:fc:f1:32:01:0f:2e:a5:ff:ce:8b:02:40:4d:e8:30.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.20' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.

---
[core@perflab-x7szb-master-0 ~]$ 

The same IP can connect to any master:

(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.20
The authenticity of host '192.168.168.20 (192.168.168.20)' can't be established.
ECDSA key fingerprint is SHA256:lVNLKbg2g5qvUvZxKZBjPhDXzQyC1XZAYU3uIFmDrW4.
ECDSA key fingerprint is MD5:52:89:b8:41:8e:54:05:9f:b6:3a:7d:53:c8:97:9c:c0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.20' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.

---
[core@perflab-x7szb-master-2 ~]$ 

At the end, the “openshift-install create cluster” command print the info access and kubeadmin password:

INFO Consuming Install Config from target directory       
INFO Creating infrastructure resources...      
INFO Waiting up to 30m0s for the Kubernetes API at https://api.perflab.lan.redhat.com:6443...
INFO API v1.16.0-beta.2+a90a577 up                                  
INFO Waiting up to 30m0s for bootstrapping to complete...                                                                                                                 
INFO Destroying the bootstrap resources...                                     
INFO Waiting up to 30m0s for the cluster at https://api.perflab.lan.redhat.com:6443 to initialize...
INFO Waiting up to 10m0s for the openshift-console route to be created...
INFO Install complete!                                   
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/openshift/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.perflab.lan.redhat.com
INFO Login to the console with user: kubeadmin, password: XXXXX-XXXXX-XXXXX-XXXXX

Check the OpenShift deployment

Load OpenShift environment variables:

(undercloud) [stack@perflab-director ~]$ export KUBECONFIG=/home/stack/openshift/auth/kubeconfig

The OpenShift API is now listening on port 6443:

(overcloud) [stack@perflab-director ~]$ curl --insecure https://api.perflab.lan.redhat.com:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403

Check the OpenShift version:

(overcloud) [stack@perflab-director openshift]$  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0     True        False         33h     Cluster version is 4.2.0

List OpenShift nodes:

(undercloud) [stack@perflab-director ~]$ oc get nodes
NAME                         STATUS   ROLES    AGE   VERSION
perflab-x7szb-master-0       Ready    master   30m   v1.14.6+c07e432da
perflab-x7szb-master-1       Ready    master   10m   v1.14.6+c07e432da
perflab-x7szb-master-2       Ready    master   10m   v1.14.6+c07e432da
perflab-x7szb-worker-2jqns   Ready    worker   30m   v1.14.6+c07e432da
perflab-x7szb-worker-7gk2p   Ready    worker   30m   v1.14.6+c07e432da
perflab-x7szb-worker-v6xwp   Ready    worker   30m   v1.14.6+c07e432da

We can ping the OpenShift console IP:

[core@perflab-c8x75-master-0 ~]$ ping console-openshift-console.apps.perflab.lan.redhat.com                                                                                   
PING console-openshift-console.apps.perflab.lan.redhat.com (10.0.0.7) 56(84) bytes of data.                                                                                   
64 bytes from 10.0.0.7 (10.0.0.7): icmp_seq=1 ttl=64 time=0.206 ms

List the OpenStack security groups:

(overcloud) [stack@perflab-director ~]$ openstack security group list
+--------------------------------------+-----------------------+------------------------+----------------------------------+
| ID                                   | Name                  | Description            | Project                          |
+--------------------------------------+-----------------------+------------------------+----------------------------------+
| 29789b3e-6b9e-41cf-82ba-55ebba5dfd76 | lb-health-mgr-sec-grp | lb-health-mgr-sec-grp  | be3b187d3a264957bc2320cf77c55681 |
| 2cb57630-29ce-4376-9850-0da170f738f2 | default               | Default security group | be3b187d3a264957bc2320cf77c55681 |
| 2f453b24-7b3f-43c0-8c43-d9520cd74680 | default               | Default security group | c942a792fd6f447186e5bafd6d4cbce0 |
| 3db1417b-b4af-4768-8ff6-94544e72ffa5 | perflab-x7szb-master  |                        | c942a792fd6f447186e5bafd6d4cbce0 |
| 50bc9acc-7942-41a3-962f-aa511085f3f8 | default               | Default security group |                                  |
| 6b4d4bc4-6f04-4466-9da6-27be5a0c5fb7 | perflab-x7szb-worker  |                        | c942a792fd6f447186e5bafd6d4cbce0 |
| 93cb85c9-5821-47e8-ad85-de18706d63f5 | web                   | Web servers            | c942a792fd6f447186e5bafd6d4cbce0 |
| cdae4bda-7040-4fc3-b28f-e7555e2225e4 | lb-mgmt-sec-grp       | lb-mgmt-sec-grp        | be3b187d3a264957bc2320cf77c55681 |
+--------------------------------------+-----------------------+------------------------+----------------------------------+

List OpenStack floating IPs:

(overcloud) [stack@perflab-director ~]$ openstack floating ip list
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
| ID                                   | Floating IP Address | Fixed IP Address | Port                                 | Floating Network                     | Project                          |
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
| 215a6925-84f2-40fa-897a-44ce53f01dea | 192.168.168.41      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 274fa26b-1aa4-48c2-a6c5-0c07ecd62429 | 192.168.168.23      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 28965300-a668-4348-b2a0-f51660735383 | 192.168.168.44      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 290c7e9d-0c88-47ea-b214-36b93a77672d | 192.168.168.21      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 2c36f3d3-53fc-4c79-a0ee-32d92b4ff27b | 192.168.168.30      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 3b1109ee-2a4a-46cd-acaa-213c4ee6a85c | 192.168.168.33      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 4cc948ad-d636-4370-8dab-3205fe1de992 | 192.168.168.48      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 5ae03e7d-597f-4689-83f6-0ccb7fc9758b | 192.168.168.27      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 634d7790-cae2-4261-b76d-19799826761e | 192.168.168.36      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 665fcc37-82e6-4405-b68b-09757d221c79 | 192.168.168.47      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 6853d0b9-336d-45db-ae24-3ab48a5c8c65 | 192.168.168.29      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| 6cfe8bf8-df5c-46df-9ab5-cfb4229d7823 | 192.168.168.25      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| ada5ba51-e8c3-449b-aba6-27a39c15720f | 192.168.168.26      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| c6d62f66-d4e8-4c55-8fcf-4e48e6fa4108 | 192.168.168.31      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| db9a742d-3534-45a1-8bf8-11455ace3255 | 192.168.168.20      | 10.0.0.5         | cfc93b3d-2559-459a-85d2-9e62b08e7905 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
| e03bdb6c-c508-444e-8e3d-730a26f1dfb0 | 192.168.168.22      | None             | None                                 | 211ae1cd-eb6b-4360-ae7b-027fd66f86d1 | c942a792fd6f447186e5bafd6d4cbce0 |
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+

List OpenStack trunks:

(overcloud) [stack@perflab-director ~]$ openstack network trunk list
+--------------------------------------+------------------------------+--------------------------------------+-------------+
| ID                                   | Name                         | Parent Port                          | Description |
+--------------------------------------+------------------------------+--------------------------------------+-------------+
| 65e7f1ec-ffa7-47e4-8b55-ec1769de0f4b | perflab-x7szb-master-trunk-2 | 0e3a7e47-f22f-4f18-9400-0c7862410133 |             |
| bb81afaf-a9aa-46b6-8ec7-9dd341d48717 | perflab-x7szb-master-trunk-0 | dd3e2310-9af8-4c6c-8dff-09055a62ce96 |             |
| e44364f7-53e8-4d56-933f-78c42af152fe | perflab-x7szb-master-trunk-1 | c7194cee-601e-4a64-9624-51fdb461a4ee |             |
| e98ef7fd-c8a4-4bfc-a412-a910d28d4914 | perflab-x7szb-worker-7gk2p   | 5a28e142-1c86-470b-b2df-4c4df5d67e8c |             |
| ef6b5ad3-2af3-4fa3-9a44-c799074b93dc | perflab-x7szb-worker-v6xwp   | 2e7a5ebf-1bee-4860-99d9-faf1f770d11c |             |
| f4de190d-7910-48f2-ad35-0dc1c959dd8f | perflab-x7szb-worker-2jqns   | c7a9c5da-97fd-4ae1-a86a-911b00b22f5e |             |
+--------------------------------------+------------------------------+--------------------------------------+-------------+

Detail of the OpenStack trunk:

(overcloud) [stack@perflab-director ~]$ openstack network trunk show perflab-x7szb-master-trunk-0
+-----------------+---------------------------------------+
| Field           | Value                                 |
+-----------------+---------------------------------------+
| admin_state_up  | UP                                    |
| created_at      | 2019-10-25T23:24:53Z                  |
| description     |                                       |
| id              | bb81afaf-a9aa-46b6-8ec7-9dd341d48717  |
| name            | perflab-x7szb-master-trunk-0          |
| port_id         | dd3e2310-9af8-4c6c-8dff-09055a62ce96  |
| project_id      | c942a792fd6f447186e5bafd6d4cbce0      |
| revision_number | 2                                     |
| status          | ACTIVE                                |
| sub_ports       |                                       |
| tags            | [u'openshiftClusterID=perflab-x7szb'] |
| tenant_id       | c942a792fd6f447186e5bafd6d4cbce0      |
| updated_at      | 2019-10-25T23:25:10Z                  |
+-----------------+---------------------------------------+

(overcloud) [stack@perflab-director ~]$ openstack port show dd3e2310-9af8-4c6c-8dff-09055a62ce96
+-----------------------+--------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                            |
+-----------------------+--------------------------------------------------------------------------------------------------+
| admin_state_up        | UP                                                                                               |
| allowed_address_pairs | ip_address='10.0.0.5', mac_address='fa:16:3e:26:e9:a8'                                           |
|                       | ip_address='10.0.0.6', mac_address='fa:16:3e:26:e9:a8'                                           |
|                       | ip_address='10.0.0.7', mac_address='fa:16:3e:26:e9:a8'                                           |
| binding_host_id       | overcloud-compute-0.lan.redhat.com                                                               |
| binding_profile       |                                                                                                  |
| binding_vif_details   | bridge_name='tbr-bb81afaf-a', datapath_type='system', ovs_hybrid_plug='True', port_filter='True' |
| binding_vif_type      | ovs                                                                                              |
| binding_vnic_type     | normal                                                                                           |
| created_at            | 2019-10-25T23:24:44Z                                                                             |
| data_plane_status     | None                                                                                             |
| description           |                                                                                                  |
| device_id             | e4fa8300-b24c-4d64-95fb-2b0c19c86b17                                                             |
| device_owner          | compute:nova                                                                                     |
| dns_assignment        | None                                                                                             |
| dns_name              | None                                                                                             |
| extra_dhcp_opts       | ip_version='4', opt_name='domain-search', opt_value='perflab.lan.redhat.com'                     |
| fixed_ips             | ip_address='10.0.0.13', subnet_id='4065cc81-0959-42e9-b36a-ccc9b0cdc073'                         |
| id                    | dd3e2310-9af8-4c6c-8dff-09055a62ce96                                                             |
| ip_address            | None                                                                                             |
| mac_address           | fa:16:3e:26:e9:a8                                                                                |
| name                  | perflab-x7szb-master-port-0                                                                      |
| network_id            | 9a0e321e-d825-4713-82be-fca6f6dccd1b                                                             |
| option_name           | None                                                                                             |
| option_value          | None                                                                                             |
| port_security_enabled | True                                                                                             |
| project_id            | c942a792fd6f447186e5bafd6d4cbce0                                                                 |
| qos_policy_id         | None                                                                                             |
| revision_number       | 14                                                                                               |
| security_group_ids    | 3db1417b-b4af-4768-8ff6-94544e72ffa5                                                             |
| status                | ACTIVE                                                                                           |
| subnet_id             | None                                                                                             |
| tags                  | openshiftClusterID=perflab-x7szb                                                                 |
| trunk_details         | {u'trunk_id': u'bb81afaf-a9aa-46b6-8ec7-9dd341d48717', u'sub_ports': []}                         |
| updated_at            | 2019-10-25T23:25:12Z                                                                             |
+-----------------------+--------------------------------------------------------------------------------------------------+

Connect with ssh into one OpenShift master node:

[stack@perflab-director ~]$ ssh -o "StrictHostKeyChecking=no" core@192.168.168.20
Warning: Permanently added '192.168.168.20' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.

---
Last login: Sat Oct 26 00:02:00 2019 from 192.168.168.2
[core@perflab-x7szb-master-2 ~]$ 

The DNS entry “console-openshift-console.apps.perflab.lan.redhat.com” is pointing to 10.0.0.7:

[core@perflab-x7szb-master-2 ~]$ dig console-openshift-console.apps.perflab.lan.redhat.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-17.P2.el8_0.1 <<>> console-openshift-console.apps.perflab.lan.redhat.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7306
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 35d2b3e88a842b9c (echoed)
;; QUESTION SECTION:
;console-openshift-console.apps.perflab.lan.redhat.com. IN A

;; ANSWER SECTION:
console-openshift-console.apps.perflab.lan.redhat.com. 20 IN A 10.0.0.7

;; Query time: 0 msec
;; SERVER: 10.0.0.6#53(10.0.0.6)
;; WHEN: Sat Oct 26 00:04:19 UTC 2019
;; MSG SIZE  rcvd: 163

Enable one OpenStack Floating IP to be able to connect with ssh on the GPU worker node:

(overcloud) [stack@perflab-director ~]$ openstack floating ip set --port 7b7afaee-79cb-4adc-b6c5-be2ed07acb6b 192.168.168.30

Setup the /etc/hosts file with the Floating IP:

echo -e "192.168.168.30 console-openshift-console.apps.perflab.lan.redhat.com" | sudo tee -a /etc/hosts

Scan the ports:

(overcloud) [stack@perflab-director ~]$ sudo nmap console-openshift-console.apps.perflab.lan.redhat.com

Starting Nmap 6.40 ( http://nmap.org ) at 2019-10-25 20:09 EDT
Nmap scan report for console-openshift-console.apps.perflab.lan.redhat.com (192.168.168.30)
Host is up (0.0015s latency).
Not shown: 997 filtered ports
PORT    STATE SERVICE
22/tcp  open  ssh
80/tcp  open  http
443/tcp open  https

Nmap done: 1 IP address (1 host up) scanned in 5.03 seconds

Look the kubeconfig generated:

[stack@perflab-director ~]$ cat /home/stack/openshift/auth/kubeconfig
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    server: https://api.perflab.lan.redhat.com:6443
  name: perflab
contexts:
- context:
    cluster: perflab
    user: admin
  name: admin
current-context: admin
kind: Config
preferences: {}
users:
- name: admin
  user:
    client-certificate-data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Connect to the console

We can connect to the console in a browser, console URL:
https://console-openshift-console.apps.perflab.lan.redhat.com

OpenShift 4.2 console prompt: OpenShift 4.2 console prompt

OpenShift 4.2 console home: OpenShift 4.2 console home

OpenShift 4.2 console developer: OpenShift 4.2 console developer

Adding a GPU worker node

Now we have a set of master and worker nodes, but we want to add a GPU worker node using an OpenStack instance with GPU passthrough.

Check the current list of OpenShift machines:

(undercloud) [stack@perflab-director ~]$ oc get machines -n openshift-machine-api
NAME                         STATE    TYPE       REGION      ZONE   AGE
perflab-x7szb-master-0       ACTIVE   m1.large   regionOne   nova   34h
perflab-x7szb-master-1       ACTIVE   m1.large   regionOne   nova   34h
perflab-x7szb-master-2       ACTIVE   m1.large   regionOne   nova   34h
perflab-x7szb-worker-2jqns   ACTIVE   m1.large   regionOne   nova   34h
perflab-x7szb-worker-7gk2p   ACTIVE   m1.large   regionOne   nova   34h
perflab-x7szb-worker-v6xwp   ACTIVE   m1.large   regionOne   nova   34h

Check the current list of OpenShift machinesets:

(undercloud) [stack@perflab-director ~]$ oc get machinesets -n openshift-machine-api
NAME                   DESIRED   CURRENT   READY   AVAILABLE   AGE
perflab-x7szb-worker   3         3         3       3           34h

Copy an existing worker machine set definition for a GPU-enabled worker machine set definition.

(undercloud) [stack@perflab-director ~]$ oc get machineset perflab-x7szb-worker -n openshift-machine-api -o json > perflab-x7szb-worker.json

(undercloud) [stack@perflab-director ~]$ cp perflab-x7szb-worker.json perflab-x7szb-worker-gpu.json

Change the flavor in the GPU machineset type with NVIDIA V100 GPU, reduce the replicas from 3 to 1, replace the flavor from m1.large to m1-gpu.large:

(overcloud) [stack@perflab-director machinesets]$ diff perflab-x7szb-worker.json perflab-x7szb-worker-gpu.json 
5d4
<         "creationTimestamp": "2019-10-25T23:34:28Z",
12c11
<         "name": "perflab-x7szb-worker",
---
>         "name": "perflab-x7szb-worker-gpu",
15,16c14
<         "selfLink": "/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/perflab-x7szb-worker",
<         "uid": "fcdfce7e-f77f-11e9-9d32-fa163e3cd288"
---
>         "selfLink": "/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/perflab-x7szb-worker-gpu"
19c17
<         "replicas": 3,
---
>         "replicas": 1,
23c21
<                 "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker"
---
>                 "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker-gpu"
33c31
<                     "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker"
---
>                     "machine.openshift.io/cluster-api-machineset": "perflab-x7szb-worker-gpu"
48c46
<                         "flavor": "m1.large",
---
>                         "flavor": "m1-gpu.large",
90,91c88,89
<         "availableReplicas": 3,
<         "fullyLabeledReplicas": 3,
---
>         "availableReplicas": 1,
>         "fullyLabeledReplicas": 1,
93,94c91,92
<         "readyReplicas": 3,
<         "replicas": 3
---
>         "readyReplicas": 1,
>         "replicas": 1

Create a new GPU flavor:

(overcloud) [stack@perflab-director ~]$ openstack flavor create --ram 32768 --disk 200 --vcpus 4 m1-gpu.large
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 200                                  |
| id                         | 5c6843b5-89ae-4fe8-92c5-fac5a707c241 |
| name                       | m1-gpu.large                         |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 32768                                |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 4                                    |
+----------------------------+--------------------------------------+

Set the alias to the OpenStack flavor:

(overcloud) [stack@perflab-director ~]$ openstack flavor set m1-gpu.large --property "pci_passthrough:alias"="v100:1"

Try to boot a RHEL77 instance with this flavor:

(overcloud) [stack@perflab-director templates]$ openstack server create --flavor m1-gpu.large --image rhel77 --security-group web --nic net-id=perflab-x7szb-openshift --key-name lambda instance0
+-------------------------------------+-----------------------------------------------------+
| Field                               | Value                                               |
+-------------------------------------+-----------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                              |
| OS-EXT-AZ:availability_zone         |                                                     |
| OS-EXT-SRV-ATTR:host                | None                                                |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                |
| OS-EXT-SRV-ATTR:instance_name       |                                                     |
| OS-EXT-STS:power_state              | NOSTATE                                             |
| OS-EXT-STS:task_state               | scheduling                                          |
| OS-EXT-STS:vm_state                 | building                                            |
| OS-SRV-USG:launched_at              | None                                                |
| OS-SRV-USG:terminated_at            | None                                                |
| accessIPv4                          |                                                     |
| accessIPv6                          |                                                     |
| addresses                           |                                                     |
| adminPass                           | J886yg7sz7MP                                        |
| config_drive                        |                                                     |
| created                             | 2019-10-27T11:10:26Z                                |
| flavor                              | m1-gpu.large (5c6843b5-89ae-4fe8-92c5-fac5a707c241) |
| hostId                              |                                                     |
| id                                  | ad86a6cf-6115-4944-88c1-568c1bc58da0                |
| image                               | rhel77 (ad740f80-83ad-4af3-8fe7-f255276c0453)       |
| key_name                            | lambda                                              |
| name                                | instance0                                           |
| progress                            | 0                                                   |
| project_id                          | c942a792fd6f447186e5bafd6d4cbce0                    |
| properties                          |                                                     |
| security_groups                     | name='93cb85c9-5821-47e8-ad85-de18706d63f5'         |
| status                              | BUILD                                               |
| updated                             | 2019-10-27T11:10:26Z                                |
| user_id                             | 721b251122304444bfee09c97f441042                    |
| volumes_attached                    |                                                     |
+-------------------------------------+-----------------------------------------------------+

(overcloud) [stack@perflab-director ~]$ FLOATING_IP_ID=$( openstack floating ip list -f value -c ID --status 'DOWN' | head -n 1 )

(overcloud) [stack@perflab-director ~]$ openstack server add floating ip instance0 $FLOATING_IP_ID

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+---------------------------------------------------+--------+--------------+
| ID                                   | Name                       | Status | Networks                                          | Image  | Flavor       |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+--------+--------------+
| ad86a6cf-6115-4944-88c1-568c1bc58da0 | instance0                  | ACTIVE | perflab-x7szb-openshift=10.0.0.12, 192.168.168.41 | rhel77 | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16                 | rhcos  | m1.large     |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37                 | rhcos  | m1.large     |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20                 | rhcos  | m1.large     |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0     | ACTIVE | perflab-x7szb-openshift=10.0.0.13                 | rhcos  | m1.large     |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2     | ACTIVE | perflab-x7szb-openshift=10.0.0.29                 | rhcos  | m1.large     |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1     | ACTIVE | perflab-x7szb-openshift=10.0.0.22                 | rhcos  | m1.large     |
+--------------------------------------+----------------------------+--------+---------------------------------------------------+--------+--------------+

Connect to the instance to check if we can finf the GPU device:

(overcloud) [stack@perflab-director ~]$ ssh cloud-user@192.168.168.41

[cloud-user@instance0 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.7 (Maipo)

[cloud-user@instance0 ~]$ sudo lspci | grep -i nvidia
00:05.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

We are good the OSP passthrough is working, we can delete this instance:

(overcloud) [stack@perflab-director ~]$ openstack server delete instance0

List the existing OpenStack nodes before adding the new machineset:

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| ID                                   | Name                       | Status | Networks                          | Image | Flavor   |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0     | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2     | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1     | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large |
+--------------------------------------+----------------------------+--------+-----------------------------------+-------+----------+

Import the OpenShift GPU worker machine set:

(overcloud) [stack@perflab-director ~]$ oc create -f perflab-x7szb-worker-gpu.json
machineset.machine.openshift.io/perflab-x7szb-worker-gpu created

List OpenShift machinesets:

(overcloud) [stack@perflab-director ~]$ oc get machinesets -n openshift-machine-api
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
perflab-x7szb-worker       3         3         3       3           36h
perflab-x7szb-worker-gpu   1         1                             53s

(overcloud) [stack@perflab-director ~]$ oc get nodes
NAME                         STATUS   ROLES    AGE   VERSION
perflab-x7szb-master-0       Ready    master   36h   v1.14.6+c07e432da
perflab-x7szb-master-1       Ready    master   36h   v1.14.6+c07e432da
perflab-x7szb-master-2       Ready    master   36h   v1.14.6+c07e432da
perflab-x7szb-worker-2jqns   Ready    worker   35h   v1.14.6+c07e432da
perflab-x7szb-worker-7gk2p   Ready    worker   35h   v1.14.6+c07e432da
perflab-x7szb-worker-v6xwp   Ready    worker   36h   v1.14.6+c07e432da

(overcloud) [stack@perflab-director ~]$ oc get machines -n openshift-machine-api
NAME                             STATE    TYPE           REGION      ZONE   AGE
perflab-x7szb-master-0           ACTIVE   m1.large       regionOne   nova   36h
perflab-x7szb-master-1           ACTIVE   m1.large       regionOne   nova   36h
perflab-x7szb-master-2           ACTIVE   m1.large       regionOne   nova   36h
perflab-x7szb-worker-2jqns       ACTIVE   m1.large       regionOne   nova   36h
perflab-x7szb-worker-7gk2p       ACTIVE   m1.large       regionOne   nova   36h
perflab-x7szb-worker-gpu-rrstz   ACTIVE   m1-gpu.large   regionOne   nova   53s
perflab-x7szb-worker-v6xwp       ACTIVE   m1.large       regionOne   nova   36h

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| ID                                   | Name                           | Status | Networks                          | Image | Flavor       |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| b211621b-97cb-477e-a6c5-895181e4747f | perflab-x7szb-worker-gpu-rrstz | ACTIVE | perflab-x7szb-openshift=10.0.0.30 | rhcos | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns     | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large     |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p     | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large     |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp     | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large     |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0         | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large     |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2         | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large     |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1         | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large     |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+

Check the status during the deployment:

(overcloud) [stack@perflab-director ~]$  oc -n openshift-machine-api get machinesets | grep gpu
perflab-x7szb-worker-gpu   1         1         1       1           8m

One additional worker is spawned:

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| ID                                   | Name                           | Status | Networks                          | Image | Flavor       |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+
| b211621b-97cb-477e-a6c5-895181e4747f | perflab-x7szb-worker-gpu-rrstz | ACTIVE | perflab-x7szb-openshift=10.0.0.30 | rhcos | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns     | ACTIVE | perflab-x7szb-openshift=10.0.0.16 | rhcos | m1.large     |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p     | ACTIVE | perflab-x7szb-openshift=10.0.0.37 | rhcos | m1.large     |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp     | ACTIVE | perflab-x7szb-openshift=10.0.0.20 | rhcos | m1.large     |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0         | ACTIVE | perflab-x7szb-openshift=10.0.0.13 | rhcos | m1.large     |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2         | ACTIVE | perflab-x7szb-openshift=10.0.0.29 | rhcos | m1.large     |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1         | ACTIVE | perflab-x7szb-openshift=10.0.0.22 | rhcos | m1.large     |
+--------------------------------------+--------------------------------+--------+-----------------------------------+-------+--------------+

(overcloud) [stack@perflab-director ~]$ FLOATING_IP_ID=$( openstack floating ip list -f value -c ID --status 'DOWN' | head -n 1 )
(overcloud) [stack@perflab-director ~]$ openstack server add floating ip perflab-x7szb-worker-gpu-rrstz $FLOATING_IP_ID

(overcloud) [stack@perflab-director ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+---------------------------------------------------+-------+--------------+
| ID                                   | Name                           | Status | Networks                                          | Image | Flavor       |
+--------------------------------------+--------------------------------+--------+---------------------------------------------------+-------+--------------+
| b211621b-97cb-477e-a6c5-895181e4747f | perflab-x7szb-worker-gpu-rrstz | ACTIVE | perflab-x7szb-openshift=10.0.0.30, 192.168.168.41 | rhcos | m1-gpu.large |
| c4cab741-6c4c-4360-8b52-23429d9832d9 | perflab-x7szb-worker-2jqns     | ACTIVE | perflab-x7szb-openshift=10.0.0.16                 | rhcos | m1.large     |
| 777cdf85-93f7-41fa-886f-60171e4e0151 | perflab-x7szb-worker-7gk2p     | ACTIVE | perflab-x7szb-openshift=10.0.0.37                 | rhcos | m1.large     |
| 4edf8110-26ff-42b3-880e-56dcbf43762c | perflab-x7szb-worker-v6xwp     | ACTIVE | perflab-x7szb-openshift=10.0.0.20                 | rhcos | m1.large     |
| e4fa8300-b24c-4d64-95fb-2b0c19c86b17 | perflab-x7szb-master-0         | ACTIVE | perflab-x7szb-openshift=10.0.0.13                 | rhcos | m1.large     |
| f56b855a-f02c-4b42-9009-b3b3078c3890 | perflab-x7szb-master-2         | ACTIVE | perflab-x7szb-openshift=10.0.0.29                 | rhcos | m1.large     |
| dcc0b4d3-97ec-4c6a-b841-558c1bb535a3 | perflab-x7szb-master-1         | ACTIVE | perflab-x7szb-openshift=10.0.0.22                 | rhcos | m1.large     |
+--------------------------------------+--------------------------------+--------+---------------------------------------------------+-------+--------------+

We can connect into the worker to check the status and find the NVIDIA Tesla V100::

(overcloud) [stack@perflab-director ~]$ ssh core@192.168.168.41
The authenticity of host '192.168.168.41 (192.168.168.41)' can't be established.
ECDSA key fingerprint is SHA256:D5SUxj513jGdhKE/Z2or+9s4RKl6milx+/aa5vm1bcM.
ECDSA key fingerprint is MD5:6a:fb:9b:53:fd:79:46:34:31:c8:db:8b:2e:3b:07:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.168.41' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
WARNING: Direct SSH access to machines is not recommended.

---
[core@perflab-x7szb-worker-gpu-rrstz ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.2

[core@perflab-x7szb-worker-gpu-rrstz ~]$ lspci -nn |grep -i nvidia
00:05.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)

The GPU machineset is available:

(overcloud) [stack@perflab-director ~]$  oc -n openshift-machine-api get machines | grep gpu 
perflab-x7szb-worker-gpu-rrstz   ACTIVE   m1-gpu.large   regionOne   nova   8m47s

(overcloud) [stack@perflab-director ~]$ oc get node perflab-x7szb-worker-gpu-rrstz -o json | jq .metadata.labels
{
  "node.openshift.io/os_id": "rhcos",
  "node-role.kubernetes.io/worker": "",
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/instance-type": "m1-gpu.large",
  "beta.kubernetes.io/os": "linux",
  "failure-domain.beta.kubernetes.io/region": "regionOne",
  "failure-domain.beta.kubernetes.io/zone": "nova",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "perflab-x7szb-worker-gpu-rrstz",
  "kubernetes.io/os": "linux"
}

Deploy the Node Feature Discovery Operator

The Node Feature Discovery operator identifies hardware device features in nodes.

You can find all the information about Node Feature Discovery operator in his git: https://github.com/openshift/cluster-nfd-operator

To install Node Feature Discovery operator, go in he OpenShift console, to “Administrator > Operators > OperatorHub”, and search NFD: Node Feature Discovery operator installation

In Node Feature Discovery operator detail page, click on “Install”: Node Feature Discovery operator installation

Create the Operator Subscription by clicking on “Suscribe”: Node Feature Discovery operator installation

Node Feature Discovery is subscribed: Node Feature Discovery operator installation

The Node Feature Discovery is now “Created”: Node Feature Discovery operator installation

We can list the installation steps of the Node Feature Discovery installation: Node Feature Discovery operator installation

Check the cluster-nfd-operator container image tags:

(overcloud) [stack@perflab-director cluster-nfd-operator]$ skopeo inspect docker://quay.io/zvonkok/cluster-nfd-operator | jq ".Tag , .RepoTags"
"latest"
[
  "v0.0.1",
  "v4.1",
  "p3",
  "e2e",
  "operand",
  "configmap",
  "nvidia-label",
  "latest"
]

Check the openshift-nfd status:

(overcloud) [stack@perflab-director ~]$ oc get pods -n openshift-operators
NAME                           READY   STATUS    RESTARTS   AGE
nfd-operator-fd55688bd-hrf9c   1/1     Running   0          2m46s

Check the status during the setup:

(overcloud) [stack@perflab-director ~]$ oc get pods -n openshift-nfd
NAME               READY   STATUS              RESTARTS   AGE
nfd-master-ksslc   0/1     ContainerCreating   0          7s
nfd-master-qbzcb   0/1     ContainerCreating   0          7s
nfd-master-xw622   0/1     ContainerCreating   0          7s
nfd-worker-84fs2   0/1     ContainerCreating   0          8s
nfd-worker-ljdqk   0/1     ContainerCreating   0          8s
nfd-worker-nbxsm   0/1     ContainerCreating   0          8s
nfd-worker-sr7pq   0/1     ContainerCreating   0          8s

(overcloud) [stack@perflab-director ~]$ oc get pods -n openshift-nfd
NAME               READY   STATUS    RESTARTS   AGE
nfd-master-ksslc   1/1     Running   0          21s
nfd-master-qbzcb   1/1     Running   0          21s
nfd-master-xw622   1/1     Running   0          21s
nfd-worker-84fs2   1/1     Running   0          22s
nfd-worker-ljdqk   1/1     Running   0          22s
nfd-worker-nbxsm   1/1     Running   0          22s
nfd-worker-sr7pq   1/1     Running   0          22s

The Node Feature Discovery Operator is available, and the GPU workers are tagged:

(overcloud) [stack@perflab-director openshift]$ oc describe node perflab-x7szb-worker-gpu-rrstz|grep 10de
                    feature.node.kubernetes.io/pci-10de.present=true

(overcloud) [stack@perflab-director ~]$ oc describe node perflab-x7szb-worker-gpu-rrstz | egrep 'Roles|pci'
Roles:              worker
                    feature.node.kubernetes.io/pci-1013.present=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1af4.present=true

Deploy the Special Resource Operator

Clone the special-resource-operaror and witch to release-4.2 branch:

(overcloud) [stack@perflab-director openshift]$ git clone https://github.com/openshift-psap/special-resource-operator
Cloning into 'special-resource-operator'...
remote: Enumerating objects: 11558, done.
remote: Counting objects: 100% (11558/11558), done.
remote: Compressing objects: 100% (5857/5857), done.
remote: Total 11558 (delta 4434), reused 11515 (delta 4396), pack-reused 0
Receiving objects: 100% (11558/11558), 15.76 MiB | 3.50 MiB/s, done.
Resolving deltas: 100% (4434/4434), done.

(overcloud) [stack@perflab-director openshift]$ cd special-resource-operator/

(overcloud) [stack@perflab-director special-resource-operator]$ git checkout release-4.2
Branch release-4.2 set up to track remote branch release-4.2 from origin.
Switched to a new branch 'release-4.2'

(overcloud) [stack@perflab-director special-resource-operator]$ PULLPOLICY=Always make deploy
customresourcedefinition.apiextensions.k8s.io/specialresources.sro.openshift.io created
sleep 1
for obj in namespace.yaml service_account.yaml role.yaml role_binding.yaml operator.yaml crds/sro_v1alpha1_specialresource_cr.yaml; do               \
        sed 's+REPLACE_IMAGE+quay.io/openshift-psap/special-resource-operator:release-4.2+g; s+REPLACE_NAMESPACE+openshift-sro+g; s+Always+Always+' deploy/$obj | kubectl apply -f - ; \
done
namespace/openshift-sro created
serviceaccount/special-resource-operator created
role.rbac.authorization.k8s.io/special-resource-operator created
clusterrole.rbac.authorization.k8s.io/special-resource-operator created
rolebinding.rbac.authorization.k8s.io/special-resource-operator created
clusterrolebinding.rbac.authorization.k8s.io/special-resource-operator created
deployment.apps/special-resource-operator created
specialresource.sro.openshift.io/example-specialresource created
specialresource.sro.openshift.io/example-specialresource unchanged

The installation of the Special Resource Operator is completed:

(overcloud) [stack@perflab-director openshift]$ oc get pods -n openshift-sro
NAME                                         READY   STATUS      RESTARTS   AGE
nvidia-dcgm-exporter-8hl6j                   2/2     Running     0          10m
nvidia-device-plugin-daemonset-6xptf         1/1     Running     0          10m
nvidia-device-plugin-validation              0/1     Completed   0          10m
nvidia-driver-daemonset-cqp62                1/1     Running     0          12m
nvidia-driver-validation                     0/1     Completed   0          12m
nvidia-feature-discovery-ckjsn               1/1     Running     0          10m
nvidia-grafana-67bdb6d6-shp8f                1/1     Running     0          10m
special-resource-operator-7cbb8f5d67-pqj84   1/1     Running     0          13m

We can see the final deployment with the GPU worker in the dashboard Horizon: OpenStack networks

We can see the final topology with the GPU worker in the dashboard Horizon: OpenStack network topology

Check the security groups created in OpenStack: OpenStack security groups

Check the trunks created in OpenStack: OpenStack Trunks

Test nvidia-smi

Create a nvidia-smi POD definition YAML file:

(overcloud) [stack@perflab-director openshift]$ cat << EOF > nvidia-smi.yaml 
apiVersion: v1
kind: Pod
metadata:
 name: nvidia-smi
spec:
 containers:
 - image: nvidia/cuda
   name: nvidia-smi
   command: [ nvidia-smi ]
   resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      nvidia.com/gpu: 1
EOF

Create the nvidia-smi pod:

(overcloud) [stack@perflab-director openshift]$ oc create -f nvidia-smi.yaml
pod/nvidia-smi created

(overcloud) [stack@perflab-director openshift]$ oc get pods
NAME         READY   STATUS              RESTARTS   AGE
nvidia-smi   0/1     ContainerCreating   0          5s

(overcloud) [stack@perflab-director openshift]$ oc get pods
NAME         READY   STATUS      RESTARTS   AGE
nvidia-smi   0/1     Completed   0          15s

OpenShift bootstrap

Success, the NVIDIA drivers are available in the pod:

(overcloud) [stack@perflab-director openshift]$ oc logs nvidia-smi
Sun Oct 27 15:03:29 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:05.0 Off |                  Off |
| N/A   31C    P0    25W / 250W |      0MiB / 16160MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Delete the nvidia-smi pod:

(overcloud) [stack@perflab-director openshift]$ oc delete pod nvidia-smi
pod "nvidia-smi" deleted

TensorFlow benchmarks with GPU

Create the GPU benchmark Pod Definition YAML file:

(overcloud) [stack@perflab-director pods]$ cat << EOF > tensorflow-benchmarks-gpu.yaml
apiVersion: v1
kind: Pod 
metadata:
 name: tensorflow-benchmarks-gpu
spec:
 containers:
 - image: nvcr.io/nvidia/tensorflow:19.09-py3
   name: cudnn
   command: ["/bin/sh","-c"]
   args: ["git clone https://github.com/tensorflow/benchmarks.git;cd benchmarks/scripts/tf_cnn_benchmarks;python3 tf_cnn_benchmarks.py --num_gpus=1 --data_format=NHWC --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
   resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      nvidia.com/gpu: 1
 restartPolicy: Never
EOF

Create the GPU benchmark pod:

(overcloud) [stack@perflab-director pods]$ oc create -f tensorflow-benchmarks-gpu.yaml
pod/tensorflow-benchmarks-gpu created

The pod switch to “Completed” status after 30 seconds:

(overcloud) [stack@perflab-director pods]$ oc get pod
NAME                        READY   STATUS      RESTARTS   AGE
tensorflow-benchmarks-gpu   0/1     Completed   0          30s

Check the GPU benchark results, the training is fast with 325.03 images/sec:

(overcloud) [stack@perflab-director pods]$ oc logs tensorflow-benchmarks-gpu
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NHWC
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step  Img/sec total_loss
1 images/sec: 327.4 +/- 0.0 (jitter = 0.0)  8.108
10  images/sec: 326.5 +/- 0.7 (jitter = 1.0)  8.122
20  images/sec: 327.2 +/- 0.4 (jitter = 0.6)  7.983
30  images/sec: 327.1 +/- 0.6 (jitter = 0.5)  7.780
40  images/sec: 327.5 +/- 0.4 (jitter = 0.5)  7.848
50  images/sec: 327.3 +/- 0.4 (jitter = 0.6)  7.779
60  images/sec: 326.5 +/- 0.4 (jitter = 0.9)  7.826
70  images/sec: 326.7 +/- 0.3 (jitter = 0.7)  7.840
80  images/sec: 326.1 +/- 0.4 (jitter = 0.8)  7.819
90  images/sec: 325.5 +/- 0.4 (jitter = 1.3)  7.646
100 images/sec: 325.3 +/- 0.4 (jitter = 1.7)  7.918
----------------------------------------------------------------
total images/sec: 325.03
----------------------------------------------------------------

Tensorflow benchmarks with CPU

To compare, create a CPU Pod Definition YAML file:

(overcloud) [stack@perflab-director pods]$ cat << EOF > tensorflow-benchmarks-gpu.yaml 
apiVersion: v1
kind: Pod 
metadata:
name: tensorflow-benchmarks-cpu
spec:
 containers:
 - image: nvcr.io/nvidia/tensorflow:19.09-py3
    name: cudnn
    command: ["/bin/sh","-c"]
    args: ["git clone https://github.com/tensorflow/benchmarks.git;cd benchmarks/scripts/tf_cnn_benchmarks;python3 tf_cnn_benchmarks.py --device=cpu --data_format=NHWC --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
 restartPolicy: Never
EOF

Create the CPU benchmark pod:

(overcloud) [stack@perflab-director pods]$ oc create -f tensorflow-benchmarks-cpu.yaml
pod/tensorflow-benchmarks-cpu created

Because it takes a lot of time with CPU only, lets have a look in the container:

(overcloud) [stack@perflab-director pods]$ oc rsh tensorflow-benchmarks-cpu

(overcloud) [stack@perflab-director pods]$ top
top - 22:18:38 up 10:35,  0 users,  load average: 6.07, 5.90, 5.10
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  : 85.9 us,  2.7 sy,  0.0 ni, 10.4 id,  0.0 wa,  1.0 hi,  0.0 si,  0.0 st
%Cpu1  : 86.7 us,  2.3 sy,  0.0 ni,  8.7 id,  0.0 wa,  1.0 hi,  1.3 si,  0.0 st
%Cpu2  : 87.9 us,  2.7 sy,  0.0 ni,  8.7 id,  0.0 wa,  0.7 hi,  0.0 si,  0.0 st
%Cpu3  : 85.3 us,  3.0 sy,  0.0 ni, 10.3 id,  0.0 wa,  1.0 hi,  0.3 si,  0.0 st
KiB Mem : 32936388 total,  5645924 free,  5718028 used, 21572436 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 26818208 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    30 root      20   0 27.200g 4.241g 512752 S 342.7 13.5  89:00.66 tf_cnn_benchmar
     1 root      20   0   20124   3588   3268 S   0.0  0.0   0:00.02 bash
    40 root      20   0   32100  10952   5444 S   0.0  0.0   0:00.03 python3
   386 root      20   0   23516   7076   3392 S   0.0  0.0   0:00.03 bash
   693 root      20   0   40460   3460   2976 R   0.0  0.0   0:00.00 top

We can also follow the CPU load in the console: TensorFlow benchmarks CPU

The pod switch to “Completed” status after 28 minutes:

[stack@perflab-director ~]$ oc get pod ; oc logs tensorflow-benchmarks-cpu|tail -20
NAME                        READY   STATUS      RESTARTS   AGE
tensorflow-benchmarks-cpu   0/1     Completed   0          28m

Check the CPU benchark results, the training is slow with 2.24 images/sec:

[stack@perflab-director ~]$ oc logs tensorflow-benchmarks-cpu
TensorFlow:  1.14
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/cpu:0']
NUMA bind:   False
Data format: NHWC
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step  Img/sec total_loss
1 images/sec: 2.2 +/- 0.0 (jitter = 0.0)  8.108
10  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  8.122
20  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.983
30  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.780
40  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.848
50  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.779
60  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.825
70  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.839
80  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.818
90  images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.648
100 images/sec: 2.2 +/- 0.0 (jitter = 0.0)  7.915
----------------------------------------------------------------
total images/sec: 2.24
----------------------------------------------------------------

With this setup one pod can increase by 145 the resnet50 training performance with Red Hat OpenShift, Red Hat OpenStack Platform and NVIDIA GPU.

Grafana dashboards

We can connect into the Grafana dashboard:
https://grafana-openshift-monitoring.apps.perflab.lan.redhat.com

NRO data: Grafana

NFD data: Grafana

Prometheus data: Grafana

Etcd data: Grafana

Product documentation