New Content for NVIDIA T4 GPU Support
Created 2 new topics in Node management - HW Acceleration Devices:
- Configure NVIDIA GPU Operator for PCI Passthrough
- Delete the GPU Operator
Patch 4: Added NVIDIA information in Planning - Verified Comm HW
Patch 5: Acted on Greg's comment
Patch 6: updated Index as requested in review
	worked on comments from Ghada
Patch 7 and 8: acted on Mary's comments
Added 'release-caveat'
Acted on Ron's comments
Story: 2008434
Task: 42220
https://review.opendev.org/c/starlingx/docs/+/785251
Signed-off-by: Adil <mohamed.adilassakkali@windriver.com>
Change-Id: I337e33e805d89621436b35c238aca800b0727e0b
(cherry picked from commit 3053ff6e40)
			
			
This commit is contained in:
		@@ -53,3 +53,5 @@
 | 
			
		||||
.. product capabilities
 | 
			
		||||
 | 
			
		||||
.. |max-workers| replace:: 99
 | 
			
		||||
 | 
			
		||||
.. |release-caveat| replace:: This is a pre-release feature and may not function as described in |prod| 5.
 | 
			
		||||
@@ -0,0 +1,171 @@
 | 
			
		||||
 | 
			
		||||
.. fgy1616003207054
 | 
			
		||||
.. _configure-nvidia-gpu-operator-for-pci-passthrough:
 | 
			
		||||
 | 
			
		||||
=================================================
 | 
			
		||||
Configure NVIDIA GPU Operator for PCI Passthrough
 | 
			
		||||
=================================================
 | 
			
		||||
 | 
			
		||||
|release-caveat|
 | 
			
		||||
 | 
			
		||||
This section provides instructions for configuring NVIDIA GPU Operator.
 | 
			
		||||
 | 
			
		||||
.. rubric:: |context|
 | 
			
		||||
 | 
			
		||||
.. note::
 | 
			
		||||
    NVIDIA GPU Operator is only supported for standard performance kernel
 | 
			
		||||
    profile. There is no support provided for low-latency performance kernel
 | 
			
		||||
    profile.
 | 
			
		||||
 | 
			
		||||
NVIDIA GPU Operator automates the installation, maintenance, and management of
 | 
			
		||||
NVIDIA software needed to provision NVIDIA GPU and provisioning of pods that
 | 
			
		||||
require nvidia.com/gpu resources.
 | 
			
		||||
 | 
			
		||||
NVIDIA GPU Operator is delivered as a Helm chart to install a number of services
 | 
			
		||||
and pods to automate the provisioning of NVIDIA GPUs with the needed NVIDIA
 | 
			
		||||
software components. These components include:
 | 
			
		||||
 | 
			
		||||
.. _fgy1616003207054-ul-sng-blk-z4b:
 | 
			
		||||
 | 
			
		||||
-   NVIDIA drivers \(to enable CUDA which is a parallel computing platform\)
 | 
			
		||||
 | 
			
		||||
-   Kubernetes device plugin for GPUs
 | 
			
		||||
 | 
			
		||||
-   NVIDIA Container Runtime
 | 
			
		||||
 | 
			
		||||
-   Automatic Node labelling
 | 
			
		||||
 | 
			
		||||
-   DCGM \(NVIDIA Data Center GPU Manager\) based monitoring
 | 
			
		||||
 | 
			
		||||
.. rubric:: |prereq|
 | 
			
		||||
 | 
			
		||||
Download the **gpu-operator-v3-1.6.0.3.tgz** file at
 | 
			
		||||
`http://mirror.starlingx.cengn.ca/mirror/starlingx/
 | 
			
		||||
<http://mirror.starlingx.cengn.ca/mirror/starlingx/>`__.
 | 
			
		||||
 | 
			
		||||
Use the following steps to configure the GPU Operator container:
 | 
			
		||||
 | 
			
		||||
.. rubric:: |proc|
 | 
			
		||||
 | 
			
		||||
#.  Lock the hosts\(s\).
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$  system host-lock <hostname>
 | 
			
		||||
 | 
			
		||||
#.  Configure the Container Runtime host path to the NVIDIA runtime which will be installed by the GPU Operator Helm deployment.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ system service-parameter-add platform container_runtime custom_container_runtime=nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
 | 
			
		||||
 | 
			
		||||
#.  Unlock the hosts\(s\). Once the system is unlocked, the system will reboot automatically.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ system host-unlock <hostname>
 | 
			
		||||
 | 
			
		||||
#.  Create the RuntimeClass resource definition and apply it to the system.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        cat > nvidia.yml << EOF
 | 
			
		||||
            kind: RuntimeClass
 | 
			
		||||
            apiVersion: node.k8s.io/v1beta1
 | 
			
		||||
            metadata:
 | 
			
		||||
              name: nvidia
 | 
			
		||||
            handler: nvidia
 | 
			
		||||
        EOF
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ kubectl apply -f nvidia.yml
 | 
			
		||||
 | 
			
		||||
#.  Install the GPU Operator Helm charts.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ helm install -–name gpu-operator /path/to/gpu-operator-1.6.0.3.tgz
 | 
			
		||||
 | 
			
		||||
#.  Check if the GPU Operator is deployed using the following command.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ kubectl get pods –A
 | 
			
		||||
        NAMESPACE               NAME      READY  STATUS    RESTART  AGE
 | 
			
		||||
        default                 g-node..  1/1    Running   1       7h54m
 | 
			
		||||
        default                 g-node..  1/1    Running   1       7h54m
 | 
			
		||||
        default                 gpu-ope.  1/1    Running   1       7h54m
 | 
			
		||||
        gpu-operator-resources  gpu-..    1/1    Running   4       28m
 | 
			
		||||
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
 | 
			
		||||
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
 | 
			
		||||
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
 | 
			
		||||
        gpu-operator-resources  nvidia..  0/1    Completed 0       7h53m
 | 
			
		||||
        gpu-operator-resources  nvidia..  1/1    Running   0       28m
 | 
			
		||||
 | 
			
		||||
    The plugin validation pod is marked completed.
 | 
			
		||||
 | 
			
		||||
#.  Check if the nvidia.com/gpu resources are available using the following command.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ kubectl describe nodes <hostname> | grep nvidia
 | 
			
		||||
 | 
			
		||||
#.  Create a pod that uses the NVIDIA RuntimeClass and requests a
 | 
			
		||||
    nvidia.com/gpu resource. Update the nvidia-usage-example-pod.yml file to launch
 | 
			
		||||
    a pod NVIDIA GPU. For example:
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        cat <<EOF > nvidia-usage-example-pod.yml
 | 
			
		||||
        apiVersion: v1
 | 
			
		||||
        kind: Pod
 | 
			
		||||
        metadata:
 | 
			
		||||
          name: nvidia-usage-example-pod
 | 
			
		||||
        spec:
 | 
			
		||||
          runtimeClassName: nvidia
 | 
			
		||||
          containers:
 | 
			
		||||
           - name: nvidia-usage-example-pod
 | 
			
		||||
              image: nvidia/samples:cuda10.2-vectorAdd
 | 
			
		||||
              imagePullPolicy: IfNotPresent    command: [ "/bin/bash", "-c", "--" ]
 | 
			
		||||
             args: [ "while true; do sleep 300000; done;" ]
 | 
			
		||||
             resources:
 | 
			
		||||
               requests:
 | 
			
		||||
                 nvidia.com/gpu: 1
 | 
			
		||||
               limits:
 | 
			
		||||
                 nvidia.com/gpu: 1
 | 
			
		||||
        EOF
 | 
			
		||||
 | 
			
		||||
#.  Create a pod using the following command.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ kubectl create -f nvidia-usage-example-pod.yml
 | 
			
		||||
 | 
			
		||||
#.  Check that the pod has been set up correctly. The status of the NVIDIA device is displayed in the table.
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ kubectl exec -it nvidia-usage-example-pod -- nvidia-smi
 | 
			
		||||
        +-----------------------------------------------------------------------------+
 | 
			
		||||
        | NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
 | 
			
		||||
        |-------------------------------+----------------------+----------------------+
 | 
			
		||||
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | 
			
		||||
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 | 
			
		||||
        |                               |                      |               MIG M. |
 | 
			
		||||
        |===============================+======================+======================|
 | 
			
		||||
        |   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
 | 
			
		||||
        | N/A   28C    P8    14W /  70W |      0MiB / 15109MiB |      0%      Default |
 | 
			
		||||
        |                               |                      |                  N/A |
 | 
			
		||||
        +-------------------------------+----------------------+----------------------+
 | 
			
		||||
 | 
			
		||||
        +-----------------------------------------------------------------------------+
 | 
			
		||||
        | Processes:                                                                  |
 | 
			
		||||
        |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 | 
			
		||||
        |        ID   ID                                                   Usage      |
 | 
			
		||||
        |=============================================================================|
 | 
			
		||||
        |  No running processes found                                                 |
 | 
			
		||||
        +-----------------------------------------------------------------------------+
 | 
			
		||||
 | 
			
		||||
    For information on deleting the GPU Operator, see :ref:`Delete the GPU
 | 
			
		||||
    Operator <delete-the-gpu-operator>`.
 | 
			
		||||
@@ -0,0 +1,59 @@
 | 
			
		||||
 | 
			
		||||
.. nsr1616019467549
 | 
			
		||||
.. _delete-the-gpu-operator:
 | 
			
		||||
 | 
			
		||||
=======================
 | 
			
		||||
Delete the GPU Operator
 | 
			
		||||
=======================
 | 
			
		||||
 | 
			
		||||
|release-caveat|
 | 
			
		||||
 | 
			
		||||
Use the commands in this section to delete the GPU Operator, if required.
 | 
			
		||||
 | 
			
		||||
.. rubric:: |prereq|
 | 
			
		||||
 | 
			
		||||
Ensure that all user generated pods with access to `nvidia.com/gpu` resources are deleted first.
 | 
			
		||||
 | 
			
		||||
.. rubric:: |proc|
 | 
			
		||||
 | 
			
		||||
#.  Remove the GPU Operator pods from the system using the following commands:
 | 
			
		||||
 | 
			
		||||
    .. code-block:: none
 | 
			
		||||
 | 
			
		||||
        ~(keystone_admin)]$ helm delete --purge gpu-operator
 | 
			
		||||
        ~(keystone_admin)]$ kubectl delete runtimeclasses.node.k8s.io nvidia
 | 
			
		||||
 | 
			
		||||
#.  Remove the GPU Operator, and remove the service parameter platform
 | 
			
		||||
    `container\_runtime custom\_container\_runtime` from the system, using the
 | 
			
		||||
    following commands:
 | 
			
		||||
 | 
			
		||||
    #.  Lock the host\(s\).
 | 
			
		||||
 | 
			
		||||
        .. code-block:: none
 | 
			
		||||
 | 
			
		||||
            ~(keystone_admin)]$ system host-lock <hostname>
 | 
			
		||||
 | 
			
		||||
    #.  List the service parameter using the following command.
 | 
			
		||||
 | 
			
		||||
        .. code-block:: none
 | 
			
		||||
 | 
			
		||||
            ~(keystone_admin)]$ system service-parameter-list
 | 
			
		||||
 | 
			
		||||
    #.  Remove the service parameter platform `container\_runtime custom\_container\_runtime`
 | 
			
		||||
        from the system, using the following command.
 | 
			
		||||
 | 
			
		||||
        .. code-block:: none
 | 
			
		||||
 | 
			
		||||
            ~(keystone_admin)]$ system service-parameter-delete <service param ID>
 | 
			
		||||
 | 
			
		||||
        where ``<service param ID>`` is the ID of the service parameter, for example, 3c509c97-92a6-4882-a365-98f1599a8f56.
 | 
			
		||||
 | 
			
		||||
    #.  Unlock the hosts\(s\).
 | 
			
		||||
 | 
			
		||||
        .. code-block:: none
 | 
			
		||||
 | 
			
		||||
            ~(keystone_admin)]$ system host-unlock <hostname>
 | 
			
		||||
 | 
			
		||||
    For information on configuring the GPU Operator, see :ref:`Configure NVIDIA
 | 
			
		||||
    GPU Operator for PCI Passthrough Operator
 | 
			
		||||
    <configure-nvidia-gpu-operator-for-pci-passthrough>`.
 | 
			
		||||
@@ -273,17 +273,6 @@ Node inventory tasks
 | 
			
		||||
Hardware acceleration devices
 | 
			
		||||
-----------------------------
 | 
			
		||||
 | 
			
		||||
.. toctree::
 | 
			
		||||
   :maxdepth: 1
 | 
			
		||||
 | 
			
		||||
   hardware_acceleration_devices/uploading-a-device-image
 | 
			
		||||
   hardware_acceleration_devices/listing-uploaded-device-images
 | 
			
		||||
   hardware_acceleration_devices/listing-device-labels
 | 
			
		||||
   hardware_acceleration_devices/removing-a-device-image
 | 
			
		||||
   hardware_acceleration_devices/removing-a-device-label
 | 
			
		||||
   hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
 | 
			
		||||
   hardware_acceleration_devices/displaying-the-status-of-device-images
 | 
			
		||||
 | 
			
		||||
************************
 | 
			
		||||
Intel N3000 FPGA support
 | 
			
		||||
************************
 | 
			
		||||
@@ -295,8 +284,22 @@ Intel N3000 FPGA support
 | 
			
		||||
   hardware_acceleration_devices/updating-an-intel-n3000-fpga-image
 | 
			
		||||
   hardware_acceleration_devices/n3000-fpga-forward-error-correction
 | 
			
		||||
   hardware_acceleration_devices/showing-details-for-an-fpga-device
 | 
			
		||||
   hardware_acceleration_devices/uploading-a-device-image
 | 
			
		||||
   hardware_acceleration_devices/common-device-management-tasks
 | 
			
		||||
 | 
			
		||||
Common device management tasks
 | 
			
		||||
******************************
 | 
			
		||||
 | 
			
		||||
.. toctree::
 | 
			
		||||
   :maxdepth: 2
 | 
			
		||||
 | 
			
		||||
   hardware_acceleration_devices/listing-uploaded-device-images
 | 
			
		||||
   hardware_acceleration_devices/listing-device-labels
 | 
			
		||||
   hardware_acceleration_devices/removing-a-device-image
 | 
			
		||||
   hardware_acceleration_devices/removing-a-device-label
 | 
			
		||||
   hardware_acceleration_devices/initiating-a-device-image-update-for-a-host
 | 
			
		||||
   hardware_acceleration_devices/displaying-the-status-of-device-images
 | 
			
		||||
 | 
			
		||||
***********************************************
 | 
			
		||||
vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
 | 
			
		||||
***********************************************
 | 
			
		||||
@@ -306,6 +309,17 @@ vRAN Accelerator ACC100 Adapter \(Mount Bryce\)
 | 
			
		||||
   hardware_acceleration_devices/enabling-mount-bryce-hw-accelerator-for-hosted-vram-containerized-workloads
 | 
			
		||||
   hardware_acceleration_devices/set-up-pods-to-use-sriov
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
*******************
 | 
			
		||||
NVIDIA GPU Operator
 | 
			
		||||
*******************
 | 
			
		||||
.. toctree::
 | 
			
		||||
   :maxdepth: 1
 | 
			
		||||
 | 
			
		||||
   hardware_acceleration_devices/configure-nvidia-gpu-operator-for-pci-passthrough
 | 
			
		||||
   hardware_acceleration_devices/delete-the-gpu-operator
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
------------------------
 | 
			
		||||
Host hardware management
 | 
			
		||||
------------------------
 | 
			
		||||
 
 | 
			
		||||
@@ -176,6 +176,10 @@ Verified and approved hardware components for use with |prod| are listed here.
 | 
			
		||||
    | Hardware Accelerator Devices Verified for PCI-Passthrough or PCI SR-IOV Access | -   ACC100 Adapter \(Mount Bryce\) - SRIOV only                                                                                                                                                                                                                                                                                                                                                                                        |
 | 
			
		||||
    +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | GPUs Verified for PCI Passthrough                                              | -   NVIDIA Corporation: VGA compatible controller - GM204GL \(Tesla M60 rev a1\)                                                                                                                                                                                                                                                                                                                                                       |
 | 
			
		||||
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | 
			
		||||
    |                                                                                | -   NVIDIA T4 TENSOR CORE GPU                                                                                                                                                                                                                                                                                                                                                                                                          |
 | 
			
		||||
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | 
			
		||||
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | 
			
		||||
    +--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 | 
			
		||||
    | Board Management Controllers                                                   | -   HPE iLO3                                                                                                                                                                                                                                                                                                                                                                                                                           |
 | 
			
		||||
    |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user