Erwan Gallen

Jan 27, 2019 33 min read

NVIDIA vGPU with Red Hat OpenStack Platform 14

Red Hat OpenStack Platform 14 is now generally available \o/

NVIDIA GRID capabilities are available as a technology preview to support NVIDIA Virtual GPU (vGPU). Multiple OpenStack instances virtual machines can have simultaneous, direct access to a single physical GPU. vGPU configuration is fully automated via Red Hat OpenStack Platform director.

Description of the platform
Install KVM on RHEL 7.6 server
Director installation
Prepare templates
Virtual GPU License Server
Launch the deployment
Test a vGPU instance

Prerequisites:
This documentation is providing the information to use vGPU with Red Hat OpenStack Platform 14. We will not explain in detail how to deploy RHOSP but we will focus on specific NVIDIA Tesla vGPU configuration.

Description of the platform

For this test we will use two servers, the compute node has two GPU boards (M10 and M60). This lab environment uses heterogeneous GPU board types, for a production environment, I advise to use the same type of card like the Tesla V100 for data center or Tesla T4 for edge.

One Dell PowerEdge R730:
compute node with two GPU cards:
NVIDIA Tesla M10 with two physical GPUs per PCI board
NVIDIA Tesla M60 with four physical GPUs per PCI board
One Dell PowerEdge R440:
qemu/kvm host for director and controller VMs
This server will host these two control VMs:
- lab-director: director to manage the RHOSP deployment
- lab-controller: RHOSP controller

CUDA will only run in a full vGPU profile: 8Q for these M10 or M60 old Tesla boards. I encourage you to use V100 or T4 Tesla boards to be able to use all vGPU profiles.

The platform uses network isolation with VLAN on one NIC:

Install KVM on RHEL 7.6 server

Enable repositories:

[root@lab607 ~]# subscription-manager register --username myrhnaccount
[root@lab607 ~]# subscription-manager attach --pool=XXXXXXXXXXXXX
[root@lab607 ~]# subscription-manager repos --disable=* 
[root@lab607 ~]# subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-openstack-14-rpms

Install OVS and KVM packages:

[root@lab607 ~]# yum install -y libguestfs libguestfs-tools openvswitch virt-install kvm libvirt libvirt-python python-virtinst libguestfs-xfs wget net-tools nmap

Enable libvirt:

[root@lab607 ~]# systemctl start libvirtd
[root@lab607 ~]# systemctl enable libvirtd

Setup br0 on the host virlab607:

[root@lab607 ~]# systemctl stop firewalld
[root@lab607 ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.

[root@lab607 ~]# yum install iptables-services

[root@lab607 ~]# systemctl start iptables

[root@lab607 ~]# systemctl enable iptables

Setup br0 on the host lab607:

[root@lab607 ~]# systemctl start openvswitch
[root@lab607 ~]# systemctl enable openvswitch
Created symlink from /etc/systemd/system/multi-user.target.wants/openvswitch.service to /usr/lib/systemd/system/openvswitch.service.

Prepare br0:

[root@lab607 ~]# cat << EOF > /etc/sysconfig/network-scripts/ifcfg-br0 
DEVICE=br0
TYPE=OVSBridge
DEVICETYPE=ovs
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=none
IPADDR="192.168.168.7"
PREFIX="24"
EOF

[root@lab607 ~]# cat << EOF > /etc/sysconfig/network-scripts/ifcfg-em2 
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=em2
UUID=c0da4182-e956-4c08-a9ab-9d11ccbfdaa6
DEVICE=em2
ONBOOT=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br0
EOF

Create libvirt bridge br0:

[root@lab607 ~]# cat br0.xml 
<network>
  <name>br0</name>
  <forward mode='bridge'/>
  <bridge name='br0'/>
  <virtualport type='openvswitch'/>
</network>

[root@lab607 ~]# sudo virsh net-define br0.xml
Network br0 defined from br0.xml

[root@lab607 ~]# virsh net-start br0
Network br0 started

Director VM

Copy the link of the last RHL 7.6 KVM cloud image here (rhel-server-7.6-x86_64-kvm.qcow2): https://access.redhat.com/downloads/

[egallen@lab607 ~]$ sudo mkdir -p /data/inetsoft
[egallen@lab607 ~]$ cd /data/inetsoft

[egallen@lab607 inetsoft]$ curl -o rhel-server-7.6-x86_64-kvm.qcow2 "https://access.cdn.redhat.com/content/origin/files/sha256/XX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/rhel-server-7.6-x86_64-kvm.qcow2?user=XXXXXXXXXXXXXXXXXXXXXXXXXXX&_auth_=XXXXXXXXXXXXXX_XXXXXXXXXXXXXXXXXXXXXXXX"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  653M  100  653M    0     0  54.7M      0  0:00:11  0:00:11 --:--:-- 64.5M

[egallen@lab607 inetsoft]$ ls -lah rhel-server-7.6-x86_64-kvm.qcow2
-rw-rw-r--. 1 egallen egallen 654M Jan 16 04:45 rhel-server-7.6-x86_64-kvm.qcow2

Create the qcow2 file:

[egallen@lab607 ~]$  sudo qemu-img create -f qcow2 -o preallocation=metadata /var/lib/libvirt/images/lab-director.qcow2 120G;
Formatting '/var/lib/libvirt/images/lab-director.qcow2', fmt=qcow2 size=128849018880 cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16

[egallen@lab607 ~]$ sudo virt-resize --expand /dev/sda1 /data/inetsoft/rhel-server-7.6-x86_64-kvm.qcow2 /var/lib/libvirt/images/lab-director.qcow2
[   0.0] Examining /data/inetsoft/rhel-server-7.6-x86_64-kvm.qcow2
◓ 25% ⟦▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒════════════════════════════════════════════════════════════════════════════════════════════════════════════⟧ --:--
 100% ⟦▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒⟧ --:--
**********

Summary of changes:

/dev/sda1: This partition will be resized from 7.8G to 120.0G.  The 
filesystem xfs on /dev/sda1 will be expanded using the 'xfs_growfs' 
method.

**********
[  18.2] Setting up initial partition table on /var/lib/libvirt/images/lab-director.qcow2
[  18.3] Copying /dev/sda1
 100% ⟦▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒⟧ 00:00
[  43.7] Expanding /dev/sda1 using the 'xfs_growfs' method

Resize operation completed with no errors.  Before deleting the old disk, 
carefully check that the resized disk boots and works correctly.


[egallen@lab607 ~]$ sudo virt-customize -a /var/lib/libvirt/images/lab-director.qcow2 --uninstall cloud-init --root-password password:XXXXXXXXX
[   0.0] Examining the guest ...
[   1.9] Setting a random seed
[   1.9] Setting the machine ID in /etc/machine-id
[   1.9] Uninstalling packages: cloud-init
[   3.2] Setting passwords
[   4.4] Finishing off

[root@lab607 ~]#  sudo virt-install --ram 32768 --vcpus 8 --os-variant rhel7 \
--disk path=/var/lib/libvirt/images/lab-director.qcow2,device=disk,bus=virtio,format=qcow2 \
--graphics vnc,listen=0.0.0.0 --noautoconsole \
--network network:default \
--network network:br0 \
--name lab-director --dry-run \
--print-xml > /tmp/lab-director.xml; 

[root@lab607 ~]# sudo virsh define --file /tmp/lab-director.xml

Start the VM and Get the IP from dnsmasq DHCP:

[egallen@lab607 ~]$ sudo virsh start lab-director
Domain lab-director started

[egallen@lab607 ~]$ sudo cat /var/log/messages | grep dnsmasq-dhcp
Jan 16 04:56:50 lab607 dnsmasq-dhcp[2242]: DHCPDISCOVER(virbr0) 52:54:00:5e:9c:32
Jan 16 04:56:50 lab607 dnsmasq-dhcp[2242]: DHCPOFFER(virbr0) 192.168.122.103 52:54:00:5e:9c:32
Jan 16 04:56:50 lab607 dnsmasq-dhcp[2242]: DHCPREQUEST(virbr0) 192.168.122.103 52:54:00:5e:9c:32
Jan 16 04:56:50 lab607 dnsmasq-dhcp[2242]: DHCPACK(virbr0) 192.168.122.103 52:54:00:5e:9c:32

Log into the VM:

[egallen@lab607 ~]$ ssh root@192.168.122.103
The authenticity of host '192.168.122.103 (192.168.122.103)' can't be established.
ECDSA key fingerprint is SHA256:yf9AxYVo2d5FwuLeAEswbWK83IYVYDcvTbJdtxEcv5A.
ECDSA key fingerprint is MD5:86:77:4e:e0:85:89:5c:03:82:ce:59:e0:a4:dd:92:4d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.122.103' (ECDSA) to the list of known hosts.
root@192.168.122.103's password: 

[root@localhost ~]#

Check RHEL release:

[root@lab-director ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Set hostname:

[root@localhost ~]# hostnamectl set-hostname lab-director.lan.redhat.com
[root@localhost ~]# exit
[egallen@lab607 ~]$ ssh root@192.168.122.103
root@192.168.122.245's password: 
Last login: Wed Jan 16 05:02:35 2019 from gateway

[root@lab-director ~]#

Add stack user:

[root@lab-director ~]# adduser stack

[root@lab-director ~]# passwd stack
Changing password for user stack.
New password: 
Retype new password: 
passwd: all authentication tokens updated successfully.

[root@lab-director ~]#  echo "stack ALL=(root) NOPASSWD:ALL" | tee -a /etc/sudoers.d/stack
stack ALL=(root) NOPASSWD:ALL

[root@lab-director ~]# chmod 0440 /etc/sudoers.d/stack

Copy SSH key:

[egallen@lab607 ~]$ ssh-copy-id stack@192.168.122.103
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
stack@192.168.122.103's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'stack@192.168.122.103'"
and check to make sure that only the key(s) you wanted were added.
[egallen@lab607 ~]$ ssh stack@192.168.122.103
[stack@lab-director ~]$

Set up network configuration:

[stack@lab-director ~]$ sudo tee /etc/sysconfig/network-scripts/ifcfg-eth0 << EOF 
DEVICE="eth0"
BOOTPROTO="none"
ONBOOT="yes"
TYPE="Ethernet"
USERCTL="yes"
PEERDNS="yes"
IPV6INIT="yes"
PERSISTENT_DHCLIENT="1"
IPADDR=192.168.122.10
PREFIX=24
GATEWAY=192.168.122.1
DNS1=192.168.122.1
EOF
[stack@lab-director ~]$ sudo reboot
[egallen@lab607 ~]$ ssh stack@lab-director
The authenticity of host 'lab-director (192.168.122.10)' can't be established.
ECDSA key fingerprint is SHA256:yf9AxYVo2d5FwuLeAEswbWK83IYVYDcvTbJdtxEcv5A.
ECDSA key fingerprint is MD5:86:77:4e:e0:85:89:5c:03:82:ce:59:e0:a4:dd:92:4d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'lab-director,192.168.122.10' (ECDSA) to the list of known hosts.
Last login: Wed Jan 16 05:15:40 2019 from gateway

[stack@lab-director ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:5e:9c:32 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.10/24 brd 192.168.122.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe5e:9c32/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:54:84:08 brd ff:ff:ff:ff:ff:ff

Set up /etc/hosts:

[stack@lab-director ~]$ sudo tee /etc/hosts << EOF 
127.0.0.1   lab-director localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
EOF

Enable RHOSP14 repositories:

[stack@lab-director ~]$ sudo subscription-manager register --username myrhnaccount 
Registering to: subscription.rhsm.redhat.com:443/subscription
Password: 
The system has been registered with ID: XXXXX-XXXX-XXXX-XXXXXXXX
The registered system name is: lab-director.lan.redhat.com
WARNING
The yum/dnf plugins: /etc/yum/pluginconf.d/subscription-manager.conf, /etc/yum/pluginconf.d/product-id.conf were automatically enabled for the benefit of Red Hat Subscription Management. If not desired, use "subscription-manager config --rhsm.auto_enable_yum_plugins=0" to block this behavior.

List available pools:

[stack@lab-director ~]$ sudo subscription-manager list --available

Grab the “Pool ID” that provides: “Red Hat OpenStack” and attach the pool ID:

[stack@lab-director ~]$ sudo subscription-manager attach --pool=XXXXXXXXXXXXXXXXXXXXXXXX
Successfully attached a subscription for: SKU

Disable all repositories:

[stack@lab-director ~]$ sudo subscription-manager repos --disable=*
...

Enable RHEL7 and RHOSP14 repositories only:

[stack@lab-director ~]$ sudo subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-openstack-14-rpms --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-rh-common-rpms --enable=rhel-ha-for-rhel-7-server-rpms
Repository 'rhel-7-server-rpms' is enabled for this system.
Repository 'rhel-ha-for-rhel-7-server-rpms' is enabled for this system.
Repository 'rhel-7-server-extras-rpms' is enabled for this system.
Repository 'rhel-7-server-openstack-14-rpms' is enabled for this system.
Repository 'rhel-7-server-rh-common-rpms' is enabled for this system.

Upgrade:

[stack@lab-director ~]$ sudo yum upgrade -y
…
Complete!

Make a snapshot on the KVM host of this fresh RHEL 7.6 VM:

[stack@lab-director ~]$ sudo systemctl poweroff
Connection to lab-director closed by remote host.
Connection to lab-director closed.

[egallen@lab607 ~]$ sudo virsh snapshot-list lab-director

[egallen@lab607 ~]$ sudo virsh snapshot-create-as lab-director 20190116-1-rhel76
Domain snapshot 20190116-1-rhel76 created

[egallen@lab607 ~]$ sudo virsh snapshot-list lab-director
 Name                 Creation Time             State
------------------------------------------------------------
 20190116-1-rhel76    2019-01-16 05:42:35 -0500 shutoff

Install director binaries:

[stack@lab-director ~]$ time sudo yum install -y python-tripleoclient
Loaded plugins: product-id, search-disabled-repos, subscription-manager
…
Complete!

real    4m47.428s
user    1m41.595s
sys     0m26.743s

Prepare director configuration:

[stack@lab-director ~]$ cat << EOF > /home/stack/undercloud.conf 
[DEFAULT]
undercloud_hostname = lab-director.lan.redhat.com
local_ip = 172.16.16.10/24
undercloud_public_host = 172.16.16.11
undercloud_admin_host = 172.16.16.12
local_interface = eth1
undercloud_nameservers = 10.19.43.29
undercloud_ntp_servers = clock.redhat.com
overcloud_domain_name = lan.redhat.com
container_images_file=/home/stack/containers-prepare-parameter.yaml
docker_insecure_registries=172.16.16.12:8787,172.16.16.10:8787,docker-registry.engineering.redhat.com,brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888

[ctlplane-subnet]
cidr = 172.16.16.0/24
dhcp_start = 172.16.16.201
dhcp_end = 172.16.16.220
inspection_iprange = 172.16.16.221,172.16.16.230
gateway = 172.16.16.10
masquerade_network = True
subnets = ctlplane-subnet
local_subnet = ctlplane-subnet
masquerade = true
EOF

Generate default container configuration:

[stack@lab-director ~]$ time openstack tripleo container image prepare default --local-push-destination --output-env-file ~/containers-prepare-parameter.yaml
real    0m1.115s
user    0m0.714s
sys     0m0.251s

Check container configuration:

[stack@lab-director ~]$ cat ~/containers-prepare-parameter.yaml
# Generated with the following on 2019-01-16T10:41:41.881679
#
#   openstack tripleo container image prepare default --local-push-destination --output-env-file /home/stack/containers-prepare-parameter.yaml
#

parameter_defaults:
  ContainerImagePrepare:
  - push_destination: true
    set:
      ceph_image: rhceph-3-rhel7
      ceph_namespace: registry.access.redhat.com/rhceph
      ceph_tag: latest
      name_prefix: openstack-
      name_suffix: ''
      namespace: registry.access.redhat.com/rhosp14
      neutron_driver: null
      openshift_asb_namespace: registry.access.redhat.com/openshift3
      openshift_asb_tag: v3.11
      openshift_cluster_monitoring_image: ose-cluster-monitoring-operator
      openshift_cluster_monitoring_namespace: registry.access.redhat.com/openshift3
      openshift_cluster_monitoring_tag: v3.11
      openshift_cockpit_image: registry-console
      openshift_cockpit_namespace: registry.access.redhat.com/openshift3
      openshift_cockpit_tag: v3.11
      openshift_configmap_reload_image: ose-configmap-reloader
      openshift_configmap_reload_namespace: registry.access.redhat.com/openshift3
      openshift_configmap_reload_tag: v3.11
      openshift_etcd_image: etcd
      openshift_etcd_namespace: registry.access.redhat.com/rhel7
      openshift_etcd_tag: latest
      openshift_gluster_block_image: rhgs-gluster-block-prov-rhel7
      openshift_gluster_image: rhgs-server-rhel7
      openshift_gluster_namespace: registry.access.redhat.com/rhgs3
      openshift_gluster_tag: latest
      openshift_grafana_namespace: registry.access.redhat.com/openshift3
      openshift_grafana_tag: v3.11
      openshift_heketi_image: rhgs-volmanager-rhel7
      openshift_heketi_namespace: registry.access.redhat.com/rhgs3
      openshift_heketi_tag: latest
      openshift_kube_rbac_proxy_image: ose-kube-rbac-proxy
      openshift_kube_rbac_proxy_namespace: registry.access.redhat.com/openshift3
      openshift_kube_rbac_proxy_tag: v3.11
      openshift_kube_state_metrics_image: ose-kube-state-metrics
      openshift_kube_state_metrics_namespace: registry.access.redhat.com/openshift3
      openshift_kube_state_metrics_tag: v3.11
      openshift_namespace: registry.access.redhat.com/openshift3
      openshift_oauth_proxy_tag: v3.11
      openshift_prefix: ose
      openshift_prometheus_alertmanager_tag: v3.11
      openshift_prometheus_config_reload_image: ose-prometheus-config-reloader
      openshift_prometheus_config_reload_namespace: registry.access.redhat.com/openshift3
      openshift_prometheus_config_reload_tag: v3.11
      openshift_prometheus_node_exporter_tag: v3.11
      openshift_prometheus_operator_image: ose-prometheus-operator
      openshift_prometheus_operator_namespace: registry.access.redhat.com/openshift3
      openshift_prometheus_operator_tag: v3.11
      openshift_prometheus_tag: v3.11
      openshift_tag: v3.11
      tag: latest
    tag_from_label: '{version}-{release}'

Install director:

[stack@lab-director ~]$ time openstack undercloud install
…
########################################################

Deployment successfull!

########################################################

Writing the stack virtual update mark file /var/lib/tripleo-heat-installer/update_mark_undercloud

##########################################################

The Undercloud has been successfully installed.

Useful files:

Password file is at ~/undercloud-passwords.conf
The stackrc file is at ~/stackrc

Use these files to interact with OpenStack services, and
ensure they are secured.

##########################################################


real    30m53.087s
user    10m50.567s
sys     1m30.979s

Test Keystone:

(undercloud) [stack@lab-director ~]$ source stackrc 

(undercloud) [stack@lab-director ~]$ openstack token issue
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field      | Value                                                                                                                                                                                   |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| expires    | 2019-01-17T00:04:18+0000                                                                                                                                                                |
| id         | XXXXXXXXXXXXXXXXXXXXXXXXXXXX |
| project_id | 8eaa8db43a6743c5851cc5219056dd80                                                                                                                                                        |
| user_id    | f9a61486cc59444481e401b4851ac3d3                                                                                                                                                        |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Set up networks

List overcloud networks:

(undercloud) [stack@lab-director ~]$ openstack subnet list
+--------------------------------------+-----------------+--------------------------------------+----------------+
| ID                                   | Name            | Network                              | Subnet         |
+--------------------------------------+-----------------+--------------------------------------+----------------+
| e0eb61d1-932c-4fb2-bd89-86c0d43adbdb | ctlplane-subnet | 8088e8ba-21ad-4b0f-98ec-83f24cc3ca39 | 172.16.16.0/24 |
+--------------------------------------+-----------------+--------------------------------------+----------------+

List overcloud subnets:

(undercloud) [stack@lab-director ~]$ openstack subnet show ctlplane-subnet
+-------------------+----------------------------------------------------------+
| Field             | Value                                                    |
+-------------------+----------------------------------------------------------+
| allocation_pools  | 172.16.16.201-172.16.16.220                              |
| cidr              | 172.16.16.0/24                                           |
| created_at        | 2019-01-16T17:20:41Z                                     |
| description       |                                                          |
| dns_nameservers   | 10.19.43.29                                              |
| enable_dhcp       | True                                                     |
| gateway_ip        | 172.16.16.10                                             |
| host_routes       | destination='169.254.169.254/32', gateway='172.16.16.12' |
| id                | e0eb61d1-932c-4fb2-bd89-86c0d43adbdb                     |
| ip_version        | 4                                                        |
| ipv6_address_mode | None                                                     |
| ipv6_ra_mode      | None                                                     |
| name              | ctlplane-subnet                                          |
| network_id        | 8088e8ba-21ad-4b0f-98ec-83f24cc3ca39                     |
| project_id        | 8eaa8db43a6743c5851cc5219056dd80                         |
| revision_number   | 0                                                        |
| segment_id        | None                                                     |
| service_types     |                                                          |
| subnetpool_id     | None                                                     |
| tags              |                                                          |
| updated_at        | 2019-01-16T17:20:41Z                                     |
+-------------------+----------------------------------------------------------+

Prepare overcloud default image for the controller

Import images into glance:

[stack@lab-director ~]$ sudo yum install -y rhosp-director-images rhosp-director-images-ipa libguestfs-tools -y

[stack@lab-director ~]$ sudo mkdir -p /var/images/x86_64

[stack@lab-director ~]$ sudo chown stack:stack /var/images/x86_64

[stack@lab-director ~]$ cd /var/images/x86_64

(undercloud) [stack@lab-director x86_64]$ for i in /usr/share/rhosp-director-images/overcloud-full-latest-14.0.tar /usr/share/rhosp-director-images/ironic-python-agent-latest-14.0.tar; do tar -xvf $i; done

Customize and upload default overcloud image that will be used for the controller only:

[stack@lab-director x86_64]$ virt-customize -a overcloud-full.qcow2 --root-password password:XXXXXX
[   0.0] Examining the guest ...
[  40.3] Setting a random seed
[  40.5] Setting the machine ID in /etc/machine-id
[  40.6] Setting passwords
[  59.1] Finishing off

[stack@lab-director ~]$ source ~/stackrc 

(undercloud) [stack@lab-director ~]$ openstack overcloud image upload --update-existing --image-path /var/images/x86_64
Image "overcloud-full-vmlinuz" was uploaded.
+--------------------------------------+------------------------+-------------+---------+--------+
|                  ID                  |          Name          | Disk Format |   Size  | Status |
+--------------------------------------+------------------------+-------------+---------+--------+
| 216aea4a-912a-42a2-9cab-441d289f4fb6 | overcloud-full-vmlinuz |     aki     | 6639920 | active |
+--------------------------------------+------------------------+-------------+---------+--------+
Image "overcloud-full-initrd" was uploaded.
+--------------------------------------+-----------------------+-------------+----------+--------+
|                  ID                  |          Name         | Disk Format |   Size   | Status |
+--------------------------------------+-----------------------+-------------+----------+--------+
| d16fce37-fab6-4eb7-b19a-4b3f572e0214 | overcloud-full-initrd |     ari     | 60133074 | active |
+--------------------------------------+-----------------------+-------------+----------+--------+
Image "overcloud-full" was uploaded.
+--------------------------------------+----------------+-------------+------------+--------+
|                  ID                  |      Name      | Disk Format |    Size    | Status |
+--------------------------------------+----------------+-------------+------------+--------+
| ee868592-8a69-4636-9c02-92cd56805de0 | overcloud-full |    qcow2    | 1031274496 | active |
+--------------------------------------+----------------+-------------+------------+--------+
Image "bm-deploy-kernel" was uploaded.
+--------------------------------------+------------------+-------------+---------+--------+
|                  ID                  |       Name       | Disk Format |   Size  | Status |
+--------------------------------------+------------------+-------------+---------+--------+
| f1b189e3-d3a0-4b33-acad-81f01ade39f4 | bm-deploy-kernel |     aki     | 6639920 | active |
+--------------------------------------+------------------+-------------+---------+--------+
Image "bm-deploy-ramdisk" was uploaded.
+--------------------------------------+-------------------+-------------+-----------+--------+
|                  ID                  |        Name       | Disk Format |    Size   | Status |
+--------------------------------------+-------------------+-------------+-----------+--------+
| f8c47730-5fa7-459e-b3ec-95c4d2c792ae | bm-deploy-ramdisk |     ari     | 420645365 | active |
+--------------------------------------+-------------------+-------------+-----------+--------+

Check image uploaded:

(undercloud) [stack@lab-director ~]$ openstack image list
+--------------------------------------+------------------------+--------+
| ID                                   | Name                   | Status |
+--------------------------------------+------------------------+--------+
| f1b189e3-d3a0-4b33-acad-81f01ade39f4 | bm-deploy-kernel       | active |
| f8c47730-5fa7-459e-b3ec-95c4d2c792ae | bm-deploy-ramdisk      | active |
| ee868592-8a69-4636-9c02-92cd56805de0 | overcloud-full         | active |
| d16fce37-fab6-4eb7-b19a-4b3f572e0214 | overcloud-full-initrd  | active |
| 216aea4a-912a-42a2-9cab-441d289f4fb6 | overcloud-full-vmlinuz | active |
+--------------------------------------+------------------------+--------+

Prepare overcloud GPU image for compute nodes

Install iso tooling:

[stack@lab-director x86_64]$ sudo yum install genisoimage -y
[stack@lab-director x86_64]$ mkdir nvidia-package

Prepare an iso file with the NVIDIA GRID RPM downloaded on nvidia web site:

[stack@lab-director nvidia-package]$ ls
NVIDIA-vGPU-rhel-7.6-410.91.x86_64.rpm

[root@lab607 guest]# genisoimage -o nvidia-package.iso -R -J -V NVIDIA-PACKAGE nvidia-package/
[stack@lab-director x86_64]$ genisoimage -o nvidia-package.iso -R -J -V NVIDIA-PACKAGE nvidia-package/
I: -input-charset not specified, using utf-8 (detected in locale settings)
 67.01% done, estimate finish Wed Jan 16 16:57:14 2019
Total translation table size: 0
Total rockridge attributes bytes: 279
Total directory bytes: 0
Path table size(bytes): 10
Max brk space used 0
7472 extents written (14 MB)

Prepare a customization script for the GPU image:

[stack@lab-director x86_64]$ cat << EOF > install-vgpu-grid-drivers.sh 
#/bin/bash
# Remove nouveau driver
if grep -Fxq "modprobe.blacklist=nouveau" /etc/default/grub
then
  # Nouveau already removed
  echo "Nothing to do for Nouveau driver"
  echo ">>>> /etc/default/grub:"
  cat /etc/default/grub | grep nouveau
else
  # To fix https://bugzilla.redhat.com/show_bug.cgi?id=1646871
  sed -i 's/crashkernel=auto\"\ /crashkernel=auto\ /' /etc/default/grub
  sed -i 's/net\.ifnames=0$/net.ifnames=0\"/' /etc/default/grub
  # Nouveau driver not removed
  sed -i 's/GRUB_CMDLINE_LINUX="[^"]*/& modprobe.blacklist=nouveau/' /etc/default/grub
  echo ">>>> /etc/default/grub:"
  cat /etc/default/grub | grep nouveau
  grub2-mkconfig -o /boot/grub2/grub.cfg
fi

# NVIDIA GRID package
mkdir /tmp/mount
mount LABEL=NVIDIA-PACKAGE /tmp/mount
rpm -ivh /tmp/mount/NVIDIA-vGPU-rhel-7.6-410.91.x86_64.rpm
EOF

Customize GPU image:

[stack@lab-director x86_64]$ cp overcloud-full.qcow2 overcloud-gpu.qcow2

[stack@lab-director x86_64]$ virt-customize --attach nvidia-package.iso -a overcloud-gpu.qcow2  -v --run install-vgpu-grid-drivers.sh
[   0.0] Examining the guest ...
libguestfs: creating COW overlay to protect original drive content
libguestfs: command: run: qemu-img
libguestfs: command: run: \ info
libguestfs: command: run: \ --output json
libguestfs: command: run: \ ./nvidia-package.iso
libguestfs: parse_json: qemu-img info JSON output:\n{\n    "virtual-size": 15302656,\n    "filename": "./nvidia-package.iso",\n    "format": "raw",\n    "actual-size": 15302656,\n

Relabel with SELinux

[stack@lab-director x86_64]$ time virt-customize -a overcloud-gpu.qcow2 --selinux-relabel
[   0.0] Examining the guest ...
[  25.3] Setting a random seed
[  25.5] SELinux relabelling
[ 387.5] Finishing off

real    6m28.262s
user    0m0.086s
sys     0m0.059s

Prepare vmlinuz and initrd files for glance:

[stack@lab-director x86_64]$ sudo mkdir /tmp/initrd

[stack@lab-director x86_64]$ guestmount -a overcloud-gpu.qcow2 -i --ro /tmp/initrd

[stack@lab-director x86_64]$ df -T | grep fuse
/dev/fuse      fuse       3524864  2207924   1316940  63% /tmp/initrd

[stack@lab-director x86_64]$ ls -la /tmp/initrd/boot
total 138476
dr-xr-xr-x.  4 root root     4096 Jan 16 17:21 .
drwxr-xr-x. 17 root root      224 Jan  3 07:21 ..
-rw-r--r--.  1 root root   151922 Nov 15 12:41 config-3.10.0-957.1.3.el7.x86_64
drwxr-xr-x.  3 root root       17 Jan  3 06:18 efi
drwx------.  5 root root       97 Jan 16 17:05 grub2
-rw-------.  1 root root 53704670 Jan  3 06:25 initramfs-0-rescue-0bc51786d56a4645b23dcd0510229358.img
-rw-------.  1 root root 70778142 Jan 16 17:20 initramfs-3.10.0-957.1.3.el7.x86_64.img
-rw-r--r--.  1 root root   314072 Nov 15 12:41 symvers-3.10.0-957.1.3.el7.x86_64.gz
-rw-------.  1 root root  3544010 Nov 15 12:41 System.map-3.10.0-957.1.3.el7.x86_64
-rwxr-xr-x.  1 root root  6639920 Jan  3 06:25 vmlinuz-0-rescue-0bc51786d56a4645b23dcd0510229358
-rwxr-xr-x.  1 root root  6639920 Nov 15 12:41 vmlinuz-3.10.0-957.1.3.el7.x86_64
-rw-r--r--.  1 root root      170 Nov 15 12:41 .vmlinuz-3.10.0-957.1.3.el7.x86_64.hmac

[stack@lab-director x86_64]$ cp /tmp/initrd/boot/vmlinuz*x86_64 ./overcloud-gpu.vmlinuz
[stack@lab-director x86_64]$ cp /tmp/initrd/boot/initramfs-3*x86_64.img ./overcloud-gpu.initrd

[stack@lab-director x86_64]$ ls -lah overcloud-gpu*
-rw-------. 1 stack stack  68M Jan 17 02:52 overcloud-gpu.initrd
-rw-r--r--. 1 qemu  qemu  1.5G Jan 16 17:49 overcloud-gpu.qcow2
-rwxr-xr-x. 1 stack stack 6.4M Jan 17 02:52 overcloud-gpu.vmlinuz

Upload GPU image into director glance service:

[stack@lab-director x86_64]$ source ~/stackrc

(undercloud) [stack@lab-director x86_64]$ openstack overcloud image upload --update-existing --os-image-name overcloud-gpu.qcow2
Image "overcloud-gpu-vmlinuz" was uploaded.
+--------------------------------------+-----------------------+-------------+---------+--------+
|                  ID                  |          Name         | Disk Format |   Size  | Status |
+--------------------------------------+-----------------------+-------------+---------+--------+
| 89db0719-ef1e-41a0-b9f7-6ff0460ec2ef | overcloud-gpu-vmlinuz |     aki     | 6639920 | active |
+--------------------------------------+-----------------------+-------------+---------+--------+
Image "overcloud-gpu-initrd" was uploaded.
+--------------------------------------+----------------------+-------------+----------+--------+
|                  ID                  |         Name         | Disk Format |   Size   | Status |
+--------------------------------------+----------------------+-------------+----------+--------+
| 4307ee93-bf2e-4c9e-85e6-c9679401268c | overcloud-gpu-initrd |     ari     | 70778142 | active |
+--------------------------------------+----------------------+-------------+----------+--------+
Image "overcloud-gpu" was uploaded.
+--------------------------------------+---------------+-------------+------------+--------+
|                  ID                  |      Name     | Disk Format |    Size    | Status |
+--------------------------------------+---------------+-------------+------------+--------+
| f23cb412-c6d3-4301-b84b-9188fc2507c2 | overcloud-gpu |    qcow2    | 1508376576 | active |
+--------------------------------------+---------------+-------------+------------+--------+
Image "bm-deploy-kernel" is up-to-date, skipping.
Image "bm-deploy-ramdisk" is up-to-date, skipping.
Image file "/var/lib/ironic/httpboot/agent.kernel" is up-to-date, skipping.
Image file "/var/lib/ironic/httpboot/agent.ramdisk" is up-to-date, skipping.


(undercloud) [stack@lab-director x86_64]$ openstack image list
+--------------------------------------+------------------------+--------+
| ID                                   | Name                   | Status |
+--------------------------------------+------------------------+--------+
| f1b189e3-d3a0-4b33-acad-81f01ade39f4 | bm-deploy-kernel       | active |
| f8c47730-5fa7-459e-b3ec-95c4d2c792ae | bm-deploy-ramdisk      | active |
| ee868592-8a69-4636-9c02-92cd56805de0 | overcloud-full         | active |
| d16fce37-fab6-4eb7-b19a-4b3f572e0214 | overcloud-full-initrd  | active |
| 216aea4a-912a-42a2-9cab-441d289f4fb6 | overcloud-full-vmlinuz | active |
| 3c59f45e-5bdf-466f-b0d0-1c68a1afbce6 | overcloud-gpu          | active |
| a25fee19-85c3-4994-b2a9-3269f0cda2fe | overcloud-gpu-initrd   | active |
| 4d6be38e-2a53-4fd0-a6c9-98aef0549363 | overcloud-gpu-vmlinuz  | active |
+--------------------------------------+------------------------+--------+

Copy default roles templates:

(undercloud) [stack@lab-director ~]$ cp -r /usr/share/openstack-tripleo-heat-templates/roles ~/templates

Create the new ComputeGPU role file:

(undercloud) [stack@lab-director ~]$ cp /home/stack/templates/roles/Compute.yaml ~/templates/roles/ComputeGPU.yaml

Update the new ComputeGPU role file:

(undercloud) [stack@lab-director ~]$ sed -i s/Role:\ Compute/Role:\ ComputeGpu/g ~/templates/roles/ComputeGpu.yaml                                                                                                             

(undercloud) [stack@lab-director ~]$ sed -i s/name:\ Compute/name:\ ComputeGpu/g ~/templates/roles/ComputeGpu.yaml

(undercloud) [stack@lab-director ~]$ sed -i s/Basic\ Compute\ Node\ role/GPU\ Compute\ Node\ role/g ~/templates/roles/ComputeGpu.yaml

(undercloud) [stack@lab-director ~]$ sed -i '/CountDefault:\ 1/a \ \ ImageDefault: overcloud-gpu' ~/templates/roles/ComputeGpu.yaml

(undercloud) [stack@lab-director ~]$ sed -i s/-compute-/-computegpu-/g /home/stack/templates/roles/ComputeGpu.yaml                                                                                  

(undercloud) [stack@lab-director ~]$ sed -i s/compute\.yaml/compute-gpu\.yaml/g ~/templates/roles/ComputeGpu.yaml

Generate the NVIDIA GPU role:

(undercloud) [stack@lab-director ~]$ openstack overcloud roles generate --roles-path /home/stack/templates/roles -o /home/stack/templates/gpu_roles_data.yaml Controller Compute ComputeGpu

Check the role template generated:

(undercloud) [stack@lab-director ~]$ cat /home/stack/templates/gpu_roles_data.yaml | grep -i gpu
# Role: ComputeGPU                                                               #
- name: ComputeGPU
    GPU Compute Node role
  HostnameFormatDefault: '%stackname%-computegpu-%index%'

List bare metal nodes:

(undercloud) [stack@lab-director ~]$ openstack baremetal node list
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name               | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | lab-controller     | None          | power off   | available          | False       |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | lab-compute        | None          | power off   | available          | False       |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+

(undercloud) [stack@lab-director templates]$ openstack baremetal node set --property capabilities='profile:compute-gpu,boot_option:local' lab-compute
(undercloud) [stack@lab-director templates]$ openstack overcloud profiles list
+--------------------------------------+--------------------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name          | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+--------------------+-----------------+-----------------+-------------------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | lab-controller     | available       | control         |                   |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | lab-compute        | available       | compute-gpu     |                   |
+--------------------------------------+--------------------+-----------------+-----------------+-------------------+

Create a GPU flavor:

(undercloud) [stack@lab-director templates]$ openstack flavor create --id auto --ram 6144 --disk 40 --vcpus 4 compute-gpu
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 40                                   |
| id                         | b098aaf1-f32b-458f-b4bd-4f48dba27a7a |
| name                       | compute-gpu                          |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 6144                                 |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 4                                    |
+----------------------------+--------------------------------------+

(undercloud) [stack@lab-director ~]$ openstack flavor list
+--------------------------------------+---------------+------+------+-----------+-------+-----------+
| ID                                   | Name          |  RAM | Disk | Ephemeral | VCPUs | Is Public |
+--------------------------------------+---------------+------+------+-----------+-------+-----------+
| 0372c486-0983-43ee-97ca-6ec5cc7c6065 | block-storage | 4096 |   40 |         0 |     1 | True      |
| 18baee14-3a48-4c11-90f3-9b280c03dba4 | ceph-storage  | 4096 |   40 |         0 |     1 | True      |
| 5b750b0f-f3ac-473c-ac81-078f81d525bb | baremetal     | 4096 |   40 |         0 |     1 | True      |
| 6893f4c3-0129-4b0b-a641-0cca91f07e3a | control       | 4096 |   40 |         0 |     1 | True      |
| afbf6532-ca93-4529-a20f-71c2e0232667 | swift-storage | 4096 |   40 |         0 |     1 | True      |
| b098aaf1-f32b-458f-b4bd-4f48dba27a7a | compute-gpu   | 6144 |   40 |         0 |     4 | True      |
| be857621-558c-4908-8939-fe6df4b0f4cf | compute       | 4096 |   40 |         0 |     1 | True      |
+--------------------------------------+---------------+------+------+-----------+-------+-----------+

Introspection

Prepare instackenv.json:

[stack@lab-director ~]$ cat << EOF > ~/instackenv.json
{
  "nodes": [
    {
      "capabilities": "profile:control",
      "name": "lab-controller",
      "pm_user": "admin",
      "pm_password": "XXXXXXXX",
      "pm_port": "6230",
      "pm_addr": "192.168.122.1",
      "pm_type": "pxe_ipmitool",
      "mac": [
        "52:54:00:f9:aa:aa"
      ]
    },
    {
      "capabilities": "profile:compute",
      "name": "lab-compute",
      "pm_user": "root",
      "pm_password": "XXXXXXXXX",
      "pm_addr": "10.19.152.193",
      "pm_type": "pxe_ipmitool",
      "mac": [
        "18:66:da:fc:aa:aa"
      ]
    }
  ]
}
EOF

Import baremetal node information into Ironic database:

(undercloud) [stack@lab-director ~]$ openstack overcloud node import ~/instackenv.json
Waiting for messages on queue 'tripleo' with no timeout.
2 node(s) successfully moved to the "manageable" state.
Successfully registered node UUID 6031ff6f-4e90-4ec0-8ec6-0a360881dac7
Successfully registered node UUID 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03

List bare metal nodes:

(undercloud) [stack@lab-director ~]$ openstack baremetal node list
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name               | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | lab-controller     | None          | power off   | manageable         | False       |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | lab-compute        | None          | power off   | manageable         | False       |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+

Launch introspection:

(undercloud) [stack@lab-director ~]$ openstack overcloud node introspect --all-manageable --provide
Waiting for introspection to finish...
Waiting for messages on queue 'tripleo' with no timeout.
Introspection of node 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 completed. Status:SUCCESS. Errors:None
Introspection of node 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 completed. Status:SUCCESS. Errors:None
Successfully introspected 2 node(s).

Introspection completed.
Waiting for messages on queue 'tripleo' with no timeout.

2 node(s) successfully moved to the "available" state.

Introspection done, check bare metal status:

(undercloud) [stack@lab-director ~]$ openstack baremetal node list
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name               | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | lab-controller     | None          | power off   | available          | False       |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | lab-compute        | None          | power off   | available          | False       |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+


(undercloud) [stack@lab-director ~]$ openstack baremetal introspection list
+--------------------------------------+---------------------+---------------------+-------+
| UUID                                 | Started at          | Finished at         | Error |
+--------------------------------------+---------------------+---------------------+-------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | 2019-01-16T22:28:51 | 2019-01-16T22:30:49 | None  |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | 2019-01-16T22:28:51 | 2019-01-16T22:35:01 | None  |
+--------------------------------------+---------------------+---------------------+-------+

Check introspection:

(undercloud) [stack@lab-director ~]$  openstack baremetal introspection interface list lab-controller
+-----------+-------------------+----------------------+-------------------+----------------+
| Interface | MAC Address       | Switch Port VLAN IDs | Switch Chassis ID | Switch Port ID |
+-----------+-------------------+----------------------+-------------------+----------------+
| eth0      | 52:54:00:f9:d1:7a | []                   | None              | None           |
+-----------+-------------------+----------------------+-------------------+----------------+

(undercloud) [stack@lab-director ~]$ openstack baremetal introspection interface list lab-compute
+-----------+-------------------+----------------------+-------------------+----------------+
| Interface | MAC Address       | Switch Port VLAN IDs | Switch Chassis ID | Switch Port ID |
+-----------+-------------------+----------------------+-------------------+----------------+
| p2p1      | 3c:fd:fe:1f:bc:80 | []                   | None              | None           |
| em4       | 18:66:da:fc:9c:2f | []                   | None              | None           |
| p2p2      | 3c:fd:fe:1f:bc:82 | []                   | None              | None           |
| em3       | 18:66:da:fc:9c:2e | []                   | None              | None           |
| em2       | 18:66:da:fc:9c:2d | []                   | None              | None           |
| em1       | 18:66:da:fc:9c:2c | [179]                | 08:81:f4:a7:5f:00 | ge-1/0/35      |
+-----------+-------------------+----------------------+-------------------+----------------+

Prepare templates

Prepare deploy script:

(undercloud) [stack@lab-director ~]$ cat << EOF > ~/overcloud-deploy.sh 
#!/bin/bash
time openstack overcloud deploy \
--templates /usr/share/openstack-tripleo-heat-templates \
-r /home/stack/templates/gpu_roles_data.yaml \
-e /home/stack/templates/node-info.yaml \
-e /home/stack/templates/gpu.yaml \
-e /home/stack/templates/custom-domain.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/templates/network-environment.yaml \
-e /home/stack/containers-prepare-parameter.yaml \
--timeout 60
EOF

To determine the vGPU type for your physical GPU device, you need to check the device type available from a machine. You can perform these steps by booting a live Linux.

[root@host1 ~]# ls /sys/class/mdev_bus/0000\:06\:00.0/mdev_supported_types/
nvidia-11  nvidia-12  nvidia-13  nvidia-14  nvidia-15  nvidia-16  nvidia-17  nvidia-18  nvidia-19  nvidia-20  nvidia-21  nvidia-210  nvidia-22

[root@host1 ~]# cat /sys/class/mdev_bus/0000\:06\:00.0/mdev_supported_types/nvidia-18/description
num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=4096x2160, max_instance=4

NVIDIA types available for M60, detail of nvidia-m35 (512M):

[root@host10 ~]# ls /sys/class/mdev_bus/0000\:84\:00.0/mdev_supported_types/
nvidia-155  nvidia-208  nvidia-35  nvidia-36  nvidia-37  nvidia-38  nvidia-39  nvidia-40  nvidia-41  nvidia-42  nvidia-43  nvidia-44  nvidia-4

[root@overcloud-computegpu-0 heat-admin]# cat /sys/class/mdev_bus/0000\:84\:00.0/mdev_supported_types/nvidia-35/description 
num_heads=2, frl_config=45, framebuffer=512M, max_resolution=2560x1600, max_instance=16

NVIDIA types available for M10, detail of nvidia-m18 (2G):

[root@host10 ~]# ls /sys/class/mdev_bus/0000\:06\:00.0/mdev_supported_types/
nvidia-11  nvidia-12  nvidia-13  nvidia-14  nvidia-15  nvidia-16  nvidia-17  nvidia-18  nvidia-19  nvidia-20  nvidia-21  nvidia-210  nvidia-22

[root@overcloud-computegpu-0 heat-admin]# cat /sys/class/mdev_bus/0000\:84\:00.0/mdev_supported_types/nvidia-43/description 
num_heads=4, frl_config=60, framebuffer=4096M, max_resolution=4096x2160, max_instance=2

Create GPU customization template:

(undercloud) [stack@lab-director templates]$ cat << EOF > ~/templates/gpu.yaml 
parameter_defaults:
  ComputeExtraConfig:
    nova::compute::vgpu::enabled_vgpu_types:
      - nvidia-m35

Prepare network template:

(undercloud) [stack@lab-director ~]$ cat << EOF > ~/templates/network-environment.yaml 
resource_registry:
  OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/templates/nic-configs/compute.yaml
  OS::TripleO::ComputeGpu::Net::SoftwareConfig: /home/stack/templates/nic-configs/compute-gpu.yaml
  OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/templates/nic-configs/controller.yaml
  #OS::TripleO::AllNodes::Validation: OS::Heat::None

parameter_defaults:
  # This sets 'external_network_bridge' in l3_agent.ini to an empty string
  # so that external networks act like provider bridge networks (they
  # will plug into br-int instead of br-ex)
  NeutronExternalNetworkBridge: "''"

  # Internal API used for private OpenStack Traffic
  InternalApiNetCidr: 172.16.2.0/24
  InternalApiAllocationPools: [{'start': '172.16.2.50', 'end': '172.16.2.100'}]
  InternalApiNetworkVlanID: 102

  # Tenant Network Traffic - will be used for VXLAN over VLAN
  TenantNetCidr: 172.16.3.0/24
  TenantAllocationPools: [{'start': '172.16.3.50', 'end': '172.16.3.100'}]
  TenantNetworkVlanID: 103

  # Public Storage Access - e.g. Nova/Glance <--> Ceph
  StorageNetCidr: 172.16.4.0/24
  StorageAllocationPools: [{'start': '172.16.4.50', 'end': '172.16.4.100'}]
  StorageNetworkVlanID: 104

  # Private Storage Access - i.e. Ceph background cluster/replication
  StorageMgmtNetCidr: 172.16.5.0/24
  StorageMgmtAllocationPools: [{'start': '172.16.5.50', 'end': '172.16.5.100'}]
  StorageMgmtNetworkVlanID: 105

  # External Networking Access - Public API Access
  ExternalNetCidr: 172.16.0.0/24
  # Leave room for floating IPs in the External allocation pool (if required)
  ExternalAllocationPools: [{'start': '172.16.0.50', 'end': '172.16.0.250'}]
  # Set to the router gateway on the external network
  ExternalInterfaceDefaultRoute: 172.16.0.1
  ExternalNetworkVlanID: 101 

  # Add in configuration for the Control Plane
  ControlPlaneSubnetCidr: "24"
  ControlPlaneDefaultRoute: 172.16.16.10
  EC2MetadataIp: 172.16.16.10
  DnsServers: ['10.16.36.29']
EOF

Prepare node information template:

(undercloud) [stack@lab-director templates]$ cat << EOF > ~/templates/node-info.yaml 
parameter_defaults:
  OvercloudControllerFlavor: control
  OvercloudComputeFlavor: compute
  OvercloudComputeGpuFlavor: compute-gpu
  ControllerCount: 1
  ComputeCount: 0
  ComputeGpuCount: 1
  NtpServer: 'clock.redhat.com'
  NeutronNetworkType: 'vxlan,vlan'
  NeutronTunnelTypes: 'vxlan'
EOF

Prepare nic-configs template files based on documentation.

Deploy

Check status before deployment:

(undercloud) [stack@lab-director ~]$ openstack baremetal node list
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name               | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | lab-controller     | None          | power off   | available          | False       |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | lab-compute        | None          | power off   | available          | False       |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+

(undercloud) [stack@lab-director ~]$ openstack overcloud profiles list
+--------------------------------------+--------------------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name          | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+--------------------+-----------------+-----------------+-------------------+
| 6031ff6f-4e90-4ec0-8ec6-0a360881dac7 | lab-controller     | available       | control         |                   |
| 5fa3d40c-29b5-4540-85c9-e76ee5cb6c03 | lab-compute        | available       | compute-gpu     |                   |
+--------------------------------------+--------------------+-----------------+-----------------+-------------------+

(undercloud) [stack@lab-director ~]$ openstack stack list

(undercloud) [stack@lab-director ~]$ openstack server list

Deploy:

(undercloud) [stack@lab-director ~]$ ./overcloud-deploy.sh
...
real    57m50.873s
user    0m7.347s
sys     0m0.794s

During the deployment, a watch command can be launched in a second console to monitor the progress:

(undercloud) [stack@lab-director ~]$ watch "openstack server list;openstack baremetal node list;openstack stack list"

Check deployment done

List bare metal nodes:

(undercloud) [stack@lab-director ~]$ openstack server list
+--------------------------------------+------------------------+--------+------------------------+----------------+-------------+
| ID                                   | Name                   | Status | Networks               | Image          | Flavor      |
+--------------------------------------+------------------------+--------+------------------------+----------------+-------------+
| 94928e09-4b49-484c-a41c-1e5d7fb5ff6d | overcloud-computegpu-0 | ACTIVE | ctlplane=172.16.16.216 | overcloud-gpu  | compute-gpu |
| ab1f84a3-b221-44d1-a97f-e3360774f4f3 | overcloud-controller-0 | ACTIVE | ctlplane=172.16.16.203 | overcloud-full | control     |
+--------------------------------------+------------------------+--------+------------------------+----------------+-------------+

Check nova hypervisor registered into Keystone:

[stack@lab-director ~]$ source overcloudrc
(overcloud) [stack@lab-director ~]$ openstack hypervisor list
+----+---------------------------------------+-----------------+-------------+-------+
| ID | Hypervisor Hostname                   | Hypervisor Type | Host IP     | State |
+----+---------------------------------------+-----------------+-------------+-------+
|  1 | overcloud-computegpu-0.lan.redhat.com | QEMU            | 172.16.2.51 | up    |
+----+---------------------------------------+-----------------+-------------+-------+

Verify that the NVIDIA vGPU software package is installed and loaded correctly by checking for the VFIO drivers in the list of kernel loaded modules:

[heat-admin@overcloud-computegpu-0 ~]$ lsmod | grep vfio
nvidia_vgpu_vfio       49475  0 
nvidia              16633974  10 nvidia_vgpu_vfio
vfio_mdev              12841  0 
mdev                   20336  2 vfio_mdev,nvidia_vgpu_vfio
vfio_iommu_type1       22300  0 
vfio                   32656  3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1

Like expected, the nouveau module is not loaded:

[root@overcloud-computegpu-0 ~]# lsmod | grep nouveau
[root@overcloud-computegpu-0 ~]#

Obtain the PCI device bus/device/function (BDF) of the physical GPU:

[heat-admin@overcloud-computegpu-0 ~]$ lspci | grep NVIDIA
06:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
07:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
84:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Tesla M10] (rev a2)
85:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Tesla M10] (rev a2)
86:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Tesla M10] (rev a2)
87:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Tesla M10] (rev a2)

Check NVIDIA driver status:

[root@overcloud-computegpu-0 sysconfig]# nvidia-smi 
Wed Oct 31 04:09:58 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.68       Driver Version: 410.68       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:06:00.0 Off |                  Off |
| N/A   32C    P8    23W / 150W |     14MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:07:00.0 Off |                  Off |
| N/A   31C    P8    23W / 150W |     14MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M10           On   | 00000000:84:00.0 Off |                  N/A |
| N/A   27C    P8    10W /  53W |     11MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M10           On   | 00000000:85:00.0 Off |                  N/A |
| N/A   27C    P8    10W /  53W |     11MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M10           On   | 00000000:86:00.0 Off |                  N/A |
| N/A   27C    P8    10W /  53W |     11MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M10           On   | 00000000:87:00.0 Off |                  N/A |
| N/A   27C    P8    10W /  53W |     11MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Check configuration applied by director deployment:

[root@overcloud-computegpu-0 heat-admin]# cat /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf  | grep -i nvidia
# Some pGPUs (e.g. NVIDIA GRID K1) support different vGPU types. User can use
#     enabled_vgpu_types = GRID K100,Intel GVT-g,MxGPU.2,nvidia-11
enabled_vgpu_types=nvidia-43

Check NVIDIA manager service status:

[root@overcloud-computegpu-0 heat-admin]# systemctl status nvidia-vgpu-mgr
● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
   Loaded: loaded (/usr/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-01-18 14:26:03 UTC; 5h 50min ago
 Main PID: 10409 (nvidia-vgpu-mgr)
    Tasks: 1
   Memory: 220.0K
   CGroup: /system.slice/nvidia-vgpu-mgr.service
           └─10409 /usr/bin/nvidia-vgpu-mgr

Jan 18 14:26:01 localhost.localdomain systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jan 18 14:26:03 localhost.localdomain systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jan 18 14:26:09 overcloud-computegpu-0 nvidia-vgpu-mgr[10409]: notice: vgpu_mgr_log: nvidia-vgpu-mgr daemon started

Initialize testing tenant

Prepare a script to initialize the tenant and image:

[stack@lab-director ~]$ cat << EOF > overcloud-prepare.sh 
#!/bin/bash
source /home/stack/overcloudrc
openstack network create --external --share --provider-physical-network datacentre --provider-network-type vlan --provider-segment 101 external
openstack subnet create external --gateway 172.16.0.1 --allocation-pool start=172.16.0.20,end=172.16.0.49 --no-dhcp --network external --subnet-range 172.16.0.0/24
openstack router create router0
neutron router-gateway-set router0 external
openstack network create internal0
openstack subnet create subnet0 --network internal0 --subnet-range 172.31.0.0/24 --dns-nameserver 10.19.43.29
openstack router add subnet router0 subnet0
curl "https://launchpad.net/~egallen/+sshkeys" -o ~/.ssh/id_rsa_lambda.pub
openstack keypair create --public-key ~/.ssh/id_rsa_lambda.pub lambda
openstack flavor create --ram 1024 --disk 40 --vcpus 2 m1.small
curl http://download.cirros-cloud.net/0.3.5/cirros-0.3.5-x86_64-disk.img -o /var/images/x86_64/cirros-0.3.5-x86_64-disk.img
openstack image create cirros035 --file /var/images/x86_64/cirros-0.3.5-x86_64-disk.img --disk-format qcow2 --container-format bare --public
openstack floating ip create external
openstack floating ip create external
openstack floating ip create external
openstack floating ip create external
openstack security group create web --description "Web servers"
openstack security group rule create --protocol icmp web
openstack security group rule create --protocol tcp --dst-port 22:22 --src-ip 0.0.0.0/0 web
openstack server create --flavor m1.small --image cirros035 --security-group web --nic net-id=internal0 --key-name lambda instance0
FLOATING_IP_ID=\$( openstack floating ip list -f value -c ID --status 'DOWN' | head -n 1 )
openstack server add floating ip instance0 \$FLOATING_IP_ID
openstack server list
echo "Type: openstack server ssh --login cirros instance0"
EOF

Prepare the tenant:

[stack@lab-director ~]$ source ~/overcloudrc

(overcloud) [stack@lab-director ~]$ ./overcloud-prepare.sh
+--------------------------------------+-----------+--------+-----------------------------------+-----------+----------+
| ID                                   | Name      | Status | Networks                          | Image     | Flavor   |
+--------------------------------------+-----------+--------+-----------------------------------+-----------+----------+
| 83974e70-63ad-4887-b8c7-e5c068a2a740 | instance0 | ACTIVE | internal0=172.31.0.4, 172.16.0.26 | cirros035 | m1.small |
+--------------------------------------+-----------+--------+-----------------------------------+-----------+----------+

(overcloud) [stack@lab-director ~]$ openstack server ssh instance0 --login cirros
The authenticity of host '172.16.0.26 (172.16.0.26)' can't be established.
RSA key fingerprint is SHA256:dl+U9Ky5xnaWZCcAjmgyTQ/e230qog/Z2gylPaMX2X0.
RSA key fingerprint is MD5:8e:bf:c4:52:e2:97:2d:31:89:d9:25:7b:9e:13:d4:76.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.16.0.26' (RSA) to the list of known hosts.
cirros@172.16.0.26's password: 
$ uptime
 20:31:31 up 1 min,  1 users,  load average: 0.00, 0.00, 0.00

Delete the instance:

(overcloud) [stack@lab-director ~]$ openstack server delete instance0

Prepare guest image with GRID

Create a NVIDIA licence file:

[stack@lab-director x86_64]$ mkdir -p /var/images/x86_64/nvidia-guest/

[stack@lab-director x86_64]$ cat << EOF > /var/images/x86_64/nvidia-guest/gridd.conf
# /etc/nvidia/gridd.conf.template - Configuration file for NVIDIA Grid Daemon

# This is a template for the configuration file for NVIDIA Grid Daemon.
# For details on the file format, please refer to the nvidia-gridd(1)
# man page.

# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=dhcp158-15.virt.lab.eng.bos.redhat.com

# Description: Set License Server port number
# Data type: integer
# Format:  <port>, default is 7070
ServerPort=7070

# Description: Set Backup License Server Address
# Data type: string
# Format:  "<address>"
#BackupServerAddress=

# Description: Set Backup License Server port number
# Data type: integer
# Format:  <port>, default is 7070
#BackupServerPort=

# Description: Set Feature to be enabled
# Data type: integer
# Possible values:
#    0 => for unlicensed state
#    1 => for GRID vGPU
#    2 => for Quadro Virtual Datacenter Workstation
FeatureType=2

# Description: Parameter to enable or disable Grid Licensing tab in nvidia-settings
# Data type: boolean
# Possible values: TRUE or FALSE, default is FALSE
EnableUI=TRUE

# Description: Set license borrow period in minutes
# Data type: integer
# Possible values: 10 to 10080 mins(7 days), default is 1440 mins(1 day)
#LicenseInterval=1440

# Description: Set license linger period in minutes
# Data type: integer
# Possible values: 0 to 10080 mins(7 days), default is 0 mins
#LingerInterval=10
EOF

Prepare iso file with driver script from NVIDIA:

[root@lab607 guest]# genisoimage -o nvidia-guest.iso -R -J -V NVIDIA-GUEST /var/images/x86_64/nvidia-guest/
I: -input-charset not specified, using utf-8 (detected in locale settings)
  9.06% done, estimate finish Mon Jan 21 05:09:34 2019
 18.08% done, estimate finish Mon Jan 21 05:09:34 2019
 27.14% done, estimate finish Mon Jan 21 05:09:34 2019
 36.17% done, estimate finish Mon Jan 21 05:09:34 2019
 45.22% done, estimate finish Mon Jan 21 05:09:34 2019
 54.25% done, estimate finish Mon Jan 21 05:09:34 2019
 63.31% done, estimate finish Mon Jan 21 05:09:34 2019
 72.34% done, estimate finish Mon Jan 21 05:09:34 2019
 81.39% done, estimate finish Mon Jan 21 05:09:34 2019
 90.42% done, estimate finish Mon Jan 21 05:09:34 2019
 99.48% done, estimate finish Mon Jan 21 05:09:34 2019
Total translation table size: 0
Total rockridge attributes bytes: 358
Total directory bytes: 0
Path table size(bytes): 10
Max brk space used 0
55297 extents written (108 MB)

Prepare customization script step1:

[root@lab607 guest]# cat << EOF > nvidia-guest_step1.sh 
#/bin/bash

# Remove nouveau driver
if grep -Fxq "modprobe.blacklist=nouveau" /etc/default/grub
then
  echo "Nothing to do for Nouveau driver"
else
  # Nouveau driver not removed
  # To fix https://bugzilla.redhat.com/show_bug.cgi?id=1646871
  sed -i 's/crashkernel=auto\"\ /crashkernel=auto\ /' /etc/default/grub
  sed -i 's/net\.ifnames=0$/net.ifnames=0\"/' /etc/default/grub
  # Remove Nouveau
  sed -i 's/GRUB_CMDLINE_LINUX="[^"]*/& modprobe.blacklist=nouveau/' /etc/default/grub
  echo ">>>> /etc/default/grub:"
  cat /etc/default/grub | grep nouveau
  grub2-mkconfig -o /boot/grub2/grub.cfg
fi

# Add build tooling
subscription-manager register --username myrhnaccount --password XXXXXXX
subscription-manager attach --pool=XXXXXXXXXXXXXXXXXXXXXXXXXX
subscription-manager repos --disable=* 
subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-extras-rpms
yum upgrade -y
yum install -y gcc make kernel-devel cpp glibc-devel glibc-headers kernel-headers libmpc mpfr
EOF

Prepare customization script step2:

[stack@lab-director x86-64]$ cat << EOF > nvidia-guest_step2.sh 
#/bin/bash
# NVIDIA GRID guest script
mkdir /tmp/mount
mount LABEL=NVIDIA-GUEST /tmp/mount
/bin/sh /tmp/mount/NVIDIA-Linux-x86_64-410.71-grid.run -s --kernel-source-path /usr/src/kernels/3.10.0-957.1.3.el7.x86_64
mkdir -p /etc/nvidia
cp /tmp/mount/gridd.conf /etc/nvidia
EOF

Apply customization:

[stack@lab-director x86-64]$ cp rhel-server-7.6-x86_64-kvm.qcow2 rhel-server-7.6-x86_64-kvm-guest.qcow2

[stack@lab-director x86-64]$ sudo virt-customize -a rhel-server-7.6-x86_64-kvm-guest.qcow2 --root-password password:XXXXXXXX
[   0.0] Examining the guest ...
[   2.8] Setting a random seed
[   2.8] Setting the machine ID in /etc/machine-id
[   2.8] Setting passwords
[   4.3] Finishing off

[stack@lab-director x86-64]$ time sudo virt-customize --memsize 8192 --attach nvidia-guest.iso -a rhel-server-7.6-x86_64-kvm-guest.qcow2 -v --run nvidia-guest_step1.sh
real  4m32.928s
user  0m0.128s
sys 0m0.159s

[stack@lab-director x86-64]$ time sudo virt-customize -a rhel-server-7.6-x86_64-kvm-guest.qcow2 --selinux-relabel
real  0m18.080s
user  0m0.083s
sys 0m0.064s

[stack@lab-director x86-64]$ time sudo virt-customize --memsize 8192 --attach nvidia-guest.iso -a rhel-server-7.6-x86_64-kvm-guest.qcow2 -v --run nvidia-guest_step2.sh
real    0m18.898s
user    0m0.123s
sys     0m0.152s

[stack@lab-director x86-64]$ time sudo virt-customize -a rhel-server-7.6-x86_64-kvm-guest.qcow2 --selinux-relabel

Upload image into Glance:

(overcloud) [stack@lab-director tmp]$ openstack image create rhel76vgpu --file /tmp/rhel-server-7.6-x86_64-kvm-guest.qcow2 --disk-format qcow2 --container-format bare --public
+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field            | Value                                                                                                                                                                                                                                                                    |
+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| checksum         | 9c4bd93ad42a9f67501218a765be8c58                                                                                                                                                                                                                                         |
| container_format | bare                                                                                                                                                                                                                                                                     |
| created_at       | 2019-01-22T13:20:49Z                                                                                                                                                                                                                                                     |
| disk_format      | qcow2                                                                                                                                                                                                                                                                    |
| file             | /v2/images/13b8fb20-6c84-4b8b-83c9-df7a88b29e7b/file                                                                                                                                                                                                                     |
| id               | 13b8fb20-6c84-4b8b-83c9-df7a88b29e7b                                                                                                                                                                                                                                     |
| min_disk         | 0                                                                                                                                                                                                                                                                        |
| min_ram          | 0                                                                                                                                                                                                                                                                        |
| name             | rhel76vgpu                                                                                                                                                                                                                                                               |
| owner            | 83458e214cb14e388208207ac42a701d                                                                                                                                                                                                                                         |
| properties       | direct_url='swift+config://ref1/glance/13b8fb20-6c84-4b8b-83c9-df7a88b29e7b', os_hash_algo='sha512', os_hash_value='a7d243ea7fdce2e0a43e1157aea030048d3ea0a8a1a09e7347412663e173bf80714f1124e1e8d8885127aa71a87a041dd3cffafea731aa012a5a05c2c953f406', os_hidden='False' |
| protected        | False                                                                                                                                                                                                                                                                    |
| schema           | /v2/schemas/image                                                                                                                                                                                                                                                        |
| size             | 2030567424                                                                                                                                                                                                                                                               |
| status           | active                                                                                                                                                                                                                                                                   |
| tags             |                                                                                                                                                                                                                                                                          |
| updated_at       | 2019-01-22T13:21:15Z                                                                                                                                                                                                                                                     |
| virtual_size     | None                                                                                                                                                                                                                                                                     |
| visibility       | public                                                                                                                                                                                                                                                                   |
+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Create NVIDIA GPU flavor:

(overcloud) [stack@lab-director ~]$ openstack flavor create --vcpus 6 --ram 8192 --disk 100 m1.small-gpu
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| disk                       | 100                                  |
| id                         | 34576fcf-b2a8-4505-832f-6f9c393d2b34 |
| name                       | m1.small-gpu                         |
| os-flavor-access:is_public | True                                 |
| properties                 |                                      |
| ram                        | 8192                                 |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 6                                    |
+----------------------------+--------------------------------------+

Customize GPU flavor:

(overcloud) [stack@lab-director ~]$ openstack flavor set m1.small-gpu --property "resources:VGPU=1"

(overcloud) [stack@lab-director ~]$ openstack flavor show m1.small-gpu
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| access_project_ids         | None                                 |
| disk                       | 100                                  |
| id                         | 34576fcf-b2a8-4505-832f-6f9c393d2b34 |
| name                       | m1.small-gpu                         |
| os-flavor-access:is_public | True                                 |
| properties                 | resources:VGPU='1'                   |
| ram                        | 8192                                 |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 6                                    |
+----------------------------+--------------------------------------+

Virtual GPU License Server

NVIDIA vGPU software is a licensed product. Licensed vGPU functionalities are activated during guest OS boot by the acquisition of a software license served over the network from an NVIDIA vGPU software license server.

NVIDIA Tesla

To manage pool of floating licenses to NVIDIA vGPU software licensed products, you need to install a Virtual GPU License Server, more information here: https://docs.nvidia.com/grid/latest/grid-license-server-user-guide/index.html

After the installation, make a tunnel to access to the NVIDIA licence server with on your laptop:

[egallen@my-desktop ~]$ ssh -N -L 8080:localhost:8080 root@my-nvidia-licence-server.redhat.com

Check the NVIDIA licence server before on http://localhost:8080/licserver/:

No instance registred before the boot: NVIDIA Tesla

Test a vGPU instance

Create an instance with vGPU:

(overcloud) [stack@lab-director ~]$ openstack server create --flavor m1.small-gpu --image rhel76vgpu --security-group web --nic net-id=internal0 --key-name lambda instance0
+-------------------------------------+-----------------------------------------------------+
| Field                               | Value                                               |
+-------------------------------------+-----------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                              |
| OS-EXT-AZ:availability_zone         |                                                     |
| OS-EXT-SRV-ATTR:host                | None                                                |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                |
| OS-EXT-SRV-ATTR:instance_name       |                                                     |
| OS-EXT-STS:power_state              | NOSTATE                                             |
| OS-EXT-STS:task_state               | scheduling                                          |
| OS-EXT-STS:vm_state                 | building                                            |
| OS-SRV-USG:launched_at              | None                                                |
| OS-SRV-USG:terminated_at            | None                                                |
| accessIPv4                          |                                                     |
| accessIPv6                          |                                                     |
| addresses                           |                                                     |
| adminPass                           | bEY8Tx7UxjkX                                        |
| config_drive                        |                                                     |
| created                             | 2019-01-22T13:22:21Z                                |
| flavor                              | m1.small-gpu (34576fcf-b2a8-4505-832f-6f9c393d2b34) |
| hostId                              |                                                     |
| id                                  | 2a6a9ae7-a630-431a-8cb9-42a57abb44a4                |
| image                               | rhel76vgpu (13b8fb20-6c84-4b8b-83c9-df7a88b29e7b)   |
| key_name                            | lambda                                              |
| name                                | instance0                                           |
| progress                            | 0                                                   |
| project_id                          | 83458e214cb14e388208207ac42a701d                    |
| properties                          |                                                     |
| security_groups                     | name='fd72c7f2-e6c0-452d-b448-411f790397a3'         |
| status                              | BUILD                                               |
| updated                             | 2019-01-22T13:22:21Z                                |
| user_id                             | b0f1fa1cbcef44cb8ae0958afb7d5de6                    |
| volumes_attached                    |                                                     |
+-------------------------------------+-----------------------------------------------------+

(overcloud) [stack@lab-director ~]$ FLOATING_IP_ID=$( openstack floating ip list -f value -c ID --status 'DOWN' | head -n 1 )

(overcloud) [stack@lab-director ~]$ echo $FLOATING_IP_ID
44a9fbed-b4f7-4627-bd1e-b639e642e1da

(overcloud) [stack@lab-director ~]$ openstack server add floating ip instance0 $FLOATING_IP_ID

(overcloud) [stack@lab-director ~]$ openstack server list
+--------------------------------------+-----------+--------+-----------------------------------+------------+--------------+
| ID                                   | Name      | Status | Networks                          | Image      | Flavor       |
+--------------------------------------+-----------+--------+-----------------------------------+------------+--------------+
| 658c5862-fbea-4676-a848-2a7338ba506a | instance0 | ACTIVE | internal0=172.31.0.4, 172.16.0.26 | rhel76vgpu | m1.small-gpu |
+--------------------------------------+-----------+--------+-----------------------------------+------------+--------------+

(overcloud) [stack@lab-director ~]$ ping 172.16.0.26
PING 172.16.0.26 (172.16.0.26) 56(84) bytes of data.
64 bytes from 172.16.0.26: icmp_seq=9 ttl=62 time=0.755 ms
64 bytes from 172.16.0.26: icmp_seq=10 ttl=62 time=0.667 ms
64 bytes from 172.16.0.26: icmp_seq=11 ttl=62 time=0.730 ms

(overcloud) [stack@lab-director ~]$ ssh cloud-user@172.16.0.26
The authenticity of host '172.16.0.26 (172.16.0.26)' can't be established.
ECDSA key fingerprint is SHA256:LJSDFi3ktg0somXrzyXm68zxbx44Szf8bdt1X486gmY.
ECDSA key fingerprint is MD5:48:d5:3b:be:82:f2:c9:e1:4f:fb:b6:52:0c:44:54:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '172.16.0.26' (ECDSA) to the list of known hosts.

[cloud-user@instance0 ~]$ nvidia-smi 
Tue Jan 22 08:30:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.71       Driver Version: 410.71       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID M10-4Q         On   | 00000000:00:05.0 Off |                  N/A |
| N/A   N/A    P8    N/A /  N/A |    272MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Check nvidia-gridd service:

[cloud-user@instance0 ~]$ sudo systemctl status nvidia-gridd.service 
● nvidia-gridd.service - NVIDIA Grid Daemon
   Loaded: loaded (/usr/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-01-22 08:29:35 EST; 2min 27s ago
  Process: 18377 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
 Main PID: 18378 (nvidia-gridd)
   CGroup: /system.slice/nvidia-gridd.service
           └─18378 /usr/bin/nvidia-gridd

Jan 22 08:29:35 instance0 systemd[1]: Starting NVIDIA Grid Daemon...
Jan 22 08:29:35 instance0 systemd[1]: Started NVIDIA Grid Daemon.
Jan 22 08:29:35 instance0 nvidia-gridd[18378]: Started (18378)
Jan 22 08:29:35 instance0 nvidia-gridd[18378]: Ignore Service Provider Licensing.
Jan 22 08:29:35 instance0 nvidia-gridd[18378]: Calling load_byte_array(tra)
Jan 22 08:29:36 instance0 nvidia-gridd[18378]: Acquiring license for GRID vGPU Edition.
Jan 22 08:29:36 instance0 nvidia-gridd[18378]: Calling load_byte_array(tra)
Jan 22 08:29:39 instance0 nvidia-gridd[18378]: License acquired successfully. (Info: http://dhcp158-15.virt.lab.eng.bos.redhat.com:7070/request; GRID...l-WS,2.0)
Hint: Some lines were ellipsized, use -l to show in full.

Check the NVIDIA Licence server after the instance launch:

NVIDIA Tesla

Appendix: How to get the NVDIA grid rpm packages

To enable NVIDIA GRID you will need:

one package for the RHOSP Compute KVM host (bare metal): NVIDIA-vGPU-rhel-7.5-430.46.x86_64.rpm NVIDIA-vGPU-rhel-7.6-430.46.x86_64.rpm NVIDIA-vGPU-rhel-7.7-430.46.x86_64.rpm NVIDIA-vGPU-rhel-8.0-430.46.x86_64.rpm (if you have issues with the packages, you can use one script: NVIDIA-Linux-x86_64-430.46-vgpu-kvm.run)
and one script for the guest instance (ivirtual machine): NVIDIA-Linux-x86_64-430.46-grid.run

These files can be downloaded here for trial: https://vgputestdrive.nvidia.com/accounts/register/ https://www.nvidia.com/en-us/data-center/resources/nvidia-enterprise-account/

You can contact a sales representative to buy the vGPU software stack: https://www.nvidia.com/en-us/contact/sales/#assistance

« How OpenStack enables Face Recognition with GPUs and FPGAs NVIDIA Tesla GPU PCI passthrough with Red Hat OpenStack Platform 13 »

NVIDIA vGPU with Red Hat OpenStack Platform 14

Description of the platform

Install KVM on RHEL 7.6 server

Director VM

Set up networks

Prepare overcloud default image for the controller

Prepare overcloud GPU image for compute nodes

Introspection

Prepare templates

Deploy

Check deployment done

Initialize testing tenant

Prepare guest image with GRID

Virtual GPU License Server

Test a vGPU instance

Appendix: How to get the NVDIA grid rpm packages

Explore →