Merge "support virtual persistent memory"

2019-06-27 16:21:23 +00:00
parent 15da794b70 2e4cfbe8aa
commit 77e9009cb5
1 changed files with 575 additions and 0 deletions
--- a/specs/train/approved/virtual-persistent-memory.rst
+++ b/specs/train/approved/virtual-persistent-memory.rst
@@ -0,0 +1,575 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=================================
+support virtual persistent memory
+=================================
+
+https://blueprints.launchpad.net/nova/+spec/virtual-persistent-memory
+
+Virtual persistent memory is now supported in both QEMU and
+libvirt. This spec seeks to enable this support in OpenStack Nova.
+
+Problem description
+===================
+
+For many years computer applications organized their data between
+two tiers: memory and storage. Emerging `persistent memory`_
+technologies introduce a third tier. Persistent memory
+(or ``pmem`` for short) is accessed like volatile memory, using processor
+load and store instructions, but it retains its contents across power
+loss like storage.
+
+Virtualization layer has already supported virtual persistent memory
+which means virtual machines now can have physical persistent memory
+as the backend of virtual persistent memory. As far as Nova is concerned,
+several problems need to be addressed:
+
+* How is the physical persistent memory managed and presented as
+  virtual persistent memory
+* The discovery and resource tracking of persistent memory
+* How does the user specify the desired amount of virtual persistent
+  memory
+* What is the life cycle of virtual persistent memory
+
+Use Cases
+---------
+Provide applications with the ability to load large contiguous segments
+of memory that retain their data across power cycles.
+
+Besides data persistence, persistent memory is less expensive than DRAM
+and comes with much larger capacities. This is an appealing feature for
+scenarios that request huge amounts of memory such as high performance
+computing (HPC).
+
+There has been some exploration by applications which heavily use memory
+devices such as in memory databases. To name a few: redis_, rocksdb_,
+oracle_, `SAP HANA`_ and Aerospike_.
+
+.. note::
+    This spec only intends to enable virtual persistent memory
+    for the libvirt KVM driver.
+
+Proposed change
+===============
+
+Background
+----------
+The most efficient way for an applications to use persistent memory is
+to memory map (mmap()) a portion of persistent memory into the address
+space of the application. Once the mapping is done, the application
+accesses the persistent memory ``directly`` (also called ``direct access``),
+meaning without going through kernel or whatever other software in the
+middle. Persistent memory has two types of hardware interfaces --
+"PMEM" and "BLK". Since "BLK" adopts an aperture model to access
+persistent memory, it does not support ``direct access``.
+For the sake of efficiency, this spec only proposes to use persistent
+memory accessed by "PMEM" interface as the backend for QEMU virtualized
+persistent memory.
+
+Persistent memory must be partitioned into `pmem namespaces`_ for
+applications to use. There are several modes of pmem namespaces for
+different use scenarios. Mode ``devdax`` and mode ``fsdax`` both
+support ``direct access``. Mode ``devdax`` gives out a character
+device for a namespace, thus applications can mmap() the entire
+namespace into their address spaces. Whereas mode ``fsdax`` gives
+out a block device. It is recommended to use mode ``devdax`` to
+assign persistent memory to virtual machines.
+Please refer to `virtual NVDIMM backends`_ and
+`NVDIMM Linux kernel document`_ for details.
+
+.. important ::
+    So this spec only proposes to use persistent memory namespaces in
+    ``devdax`` mode as QEMU virtual persistent memory backends.
+
+The ``devdax`` persistent memory namespaces require contiguous physical
+space and are not managed in pages as ordinary system memory.
+This introduces a fragmentation issue with regard to multiple namespaces
+are created and used by multiple applications. As shown in below diagram,
+four applications are using four namespaces each of size 100GB::
+
+   +-----+   +-----+   +-----+    +-----+
+   |app1 |   |app2 |   |app3 |    |app4 |
+   +--+--+   +--+--+   +--+--+    +--+--+
+      |         |         |          |
+      |         |         |          |
+ +----v----+----v----+----v----+-----v---+
+ |         |         |         |         |
+ |  100GB  |  100GB  |  100GB  |  100GB  |
+ |         |         |         |         |
+ +---------+---------+---------+---------+
+
+After the termination of app2 and app4, it turns out to be::
+
+    +-----+             +-----+
+    |app1 |             |app3 |
+    +--+--+             +--+--+
+       |                   |
+       |                   |
+  +----v----+---------+----v----+---------+
+  |         |         |         |         |
+  |  100GB  |  100GB  |  100GB  |  100GB  |
+  |         |         |         |         |
+  +---------+---------+---------+---------+
+
+The total size of free space is 200GB. However a ``devdax`` mode
+namespace of 200GB size can not be created.
+
+Persistent memory namespace management and resource tracking
+------------------------------------------------------------
+Due to the aforementioned fragmentation issue, persistent memory can not
+be managed in the similar way as system memory. In other words,
+dynamically creating and deleting persistent memory namespaces upon
+VM creation and deletion will result in fragmentation and also a challenge
+to track persistent memory resource.
+The proposed approach is to use pre-created fix sized namespaces.
+In other words, the cloud admin creates persistent memory of the desired
+sizes before Nova is deployed on a certain host. And the cloud admin puts
+the namespace information into nova config file (details below).
+Nova compute agent discovers the namespaces by parsing the config file
+to determine what namespaces it can allocate to a guest. The discovered
+persistent memory namespaces will be reported to the placement service
+as inventories of a custom resource class associated with the ROOT
+resource provider.
+
+Custom Resource Classes are used to represent persistent memory namespace
+resource. The naming convention of the custom resource classes being used is::
+
+ CUSTOM_PMEM_NAMESPACE_$LABEL
+
+``$LABEL`` is variable part of the resource class name defined by the admin
+to be associated with a certain number of persistent memory namespaces.
+It normally is the size of namespaces in any desired units.
+It can also be a string describing the capacities -- such as 'SMALL',
+'MEDIUM' or 'LARGE'. Admin shall properly define the value of '$LABEL'
+for each namespace.
+
+The association between ``$LABEL`` and persistent memory namespaces
+is defined by a new configuration option 'CONF.libvirt.pmem_namespaces'.
+This config option is of string type in below format::
+
+    "$LABEL:$NSNAME[|$NSNAME][,$LABEL:$NSNAME[|$NSNAME]]"
+
+``$NSNAME`` is the name of the persistent memory namespace that falls
+into the resource class named ``CUSTOM_PMEM_NAMESPACE_$LABEL``.
+A name can be given to a persitent memory namespace upon creation by
+the "-n/--name" option to the `ndctl`_ command.
+
+To give an example, on a certain host, there might be a below configuration::
+
+    "128G:ns0|ns1|ns2|ns3,262144MB:ns4|ns5,MEDIUM:ns6|ns7"
+
+The interpretation of the above configuration is that this host has 4
+persistent memory namespaces (ns0, ns1, ns2, ns3) of resource class
+``CUSTOM_PMEM_NAMESPACE_128G``, 2 namespaces (ns4, ns5) of resource class
+``CUSTOM_PMEM_NAMESPACE_262144MB``, and 2 namespaces (ns6, ns7) of resource
+class ``CUSTOM_PMEM_NAMESPACE_MEDIUM``.
+
+The 'total' value of the inventory is the *number* of the
+persistent memory namespaces belong to this resource class.
+
+The 'max_unit' is set to the same value as 'total' since it is possible
+to attach all of the persistent memory namespaces in a certain resource
+class to one instance.
+
+The values of 'min_unit' and 'step_size' are 1.
+
+The value of 'allocation_ratio' is 1.0.
+
+In case of the above example, the response to a `GET` request to this
+resource provider inventories is::
+
+ "inventories": {
+         ...
+         "CUSTOM_PMEM_NAMESPACE_128GB": {
+             "allocation_ratio": 1.0,
+             "max_unit": 4,
+             "min_unit": 1,
+             "reserved": 0,
+             "step_size":1,
+             "total": 4
+         },
+         "CUSTOM_PMEM_NAMESPACE_262144MB": {
+             "allocation_ratio": 1.0,
+             "max_unit": 2,
+             "min_unit": 1,
+             "reserved": 0,
+             "step_size": 1,
+             "total": 2
+         },
+         "CUSTOM_PMEM_NAMESPACE_MEDIUM": {
+             "allocation_ratio": 1.0,
+             "max_unit": 2,
+             "min_unit": 1,
+             "reserved": 0,
+             "step_size": 1,
+             "total":2
+         },
+         ...
+ }
+
+Please note, this is just an example to show different ways to configure
+persistent memory namespaces and how they are tracked. There are certainly
+some flexibility in the naming of the resource class name. It is up to
+the admin to configure the namespaces properly.
+
+.. note::
+    Resource class names are opaque. For example, a request
+    for CUSTOM_PMEM_NAMESPACE_128GB cannot be fulfilled by a
+    CUSTOM_PMEM_NAMESPACE_131072MB resource even though they are
+    (presumably) the same size.
+
+Different units do not convert freely from one to another while embeded
+in custom resource class names. Meaning a request for a 128GB persistent
+memory namespace can be fulfilled by a CUSTOM_PMEM_NAMESPACE_128GB
+resource, but can not be fulfilled by a CUSTOM_PMEM_NAMESPACE_131072MB
+resource even though they are of the same quantity.
+
+Persistent memory is by nature NUMA sensitive. However for the initial
+iteration, the resource inventories are put directly under ROOT resource
+provider of the compute host. Persistent memory NUMA affinity will be
+adddressed by a seperate follow-on spec.
+
+A change in the configuration will stop the nova compute agent from
+(re)starting if that change removes any namespaces in use by guests
+from the configuration.
+
+Virtual persistent memory specification
+---------------------------------------
+Virtual persistent memory information is added to guest hardware flavor
+extra specs in the form of::
+
+ hw:pmem=$LABEL[,$LABEL]
+
+``$LABEL`` is the variable part of a resource class name as defined
+in the `Persistent memory namespace management and resource tracking`_
+section. Each appearence of a '$LABEL' means a requirement to one
+persistent memory namespace of ``CUSTOM_PMEM_NAMESPACE_$LABEL``
+resource class. So there can be multiple appearences of the same
+$LABEL in one specification. To give an example::
+
+    hw:pmem=128GB,128GB
+
+It means a resource requirement of two 128GB persisent memory
+namespaces.
+
+Libvirt domain specification requires each virtual persistent memory
+to be associated with one guest NUMA node. If guest NUMA topology
+is specified in the flavor, the guest virtual persistent memory
+devices are put under guest NUMA node 0. If guest NUMA topology is not
+specified in the flavor, a guest NUMA node 0 is constructed implicitly
+and all guest virutal persistent memory devices are put under it.
+Please note, under the second circumstance (implicitly constructing
+a guest NUMA node 0), the construction of guest NUMA node 0 happens
+at the libvirt driver while the guest libvirt domain specification
+is being built up. The NUMA topology logic in the scheduler is not
+applied. And from the perspective of any other parts of Nova, this
+guest is still a non-NUMA guest.
+
+Examples::
+
+ One NUMA node, one 512GB virtual persistent memory:
+     hw:numa_nodes=1
+     hw:pmem=512GB
+
+ One NUMA node, two 512GB virtual persistent memory:
+     hw:numa_nodes=1
+     hw:pmem=512GB,512GB
+
+ Two NUMA nodes, two 512GB virtual persistent memory:
+     hw:numa_nodes=2
+     hw:pmem=512GB,512GB
+
+     Both of the two virtual persistent memory devices
+     are put under NUMA node 0.
+
+ No NUMA node, two 512GB virtual persistent memory:
+     hw:pmem = 512GB,512GB
+
+     A guest NUMA node 0 is constructed implicitly.
+     Both virtual persistent memory devices are put under it.
+
+.. important ::
+    Qemu does not support backing one virtual persistent memory device
+    by multiple physical persistent memory namespaces, no matter whether
+    they are contiguous or not. So any virtual persistent memory device
+    requested by guests is backed by one physical persistent memory
+    namespace of the exact same resource class.
+
+The extra specs are translated to placement API requests accordingly.
+
+Virtual persistent memory disposal
+----------------------------------
+Due to the persistent nature of host PMEM namespaces, the content
+of virtual persistent memory in guests shall be zeroed out immediately
+once the virtual persisent memory is no longer associated with any VM
+instance (cases like VM deletion, cold/live migration, shelve, evacuate
+and etc.). Otherwise there will be security concerns.
+Since persistent memory devices are typically of large size, this may
+introduce a performance penalty to guest deletion or any other actions
+involving erasing PMEM namespaces.
+The standard I/O APIs (read/write) cannot be used with DAX (direct access)
+devices. The nova compute libvirt driver uses `daxio`_ utility (wrapped
+by privsep library functions) for this purpose.
+
+VM rebuild
+----------
+The persisent memory namespaces are zeroed out during VM rebuild to
+get to the initial state of the VM.
+
+VM resize
+---------
+Adding new virtual persistent memory devices to an instance is allowed.
+As for a concrete virtual persistent memory device, changing its backing
+namespace's resource class is not allowed. This is because it is not
+always possible to compare the sizes of two namespaces in different resource
+classes (e.g. CUSTOM_PMEM_NAMESPACE_128G and CUSTOM_PMEM_NAMESPACE_MEDIUM).
+
+By default the content of the original virtual persistent memory is copied
+to the new virtual persistent memory (if there is). This could be time
+consuming, so a flavor extra spec is introduced as a flag::
+
+ hw:allow_pmem_copy=true|false (default false)
+
+If either the source or target has this flag set to ``true``, the
+data in virtual persistent memory is copied.
+If both the source and target have this flag set to ``false``, the
+data in virtual persistent memory is not copied. This (not copying data)
+is useful in scenarios such as virtual persistent memory is used as cache.
+For a graceful shutdown (which resize does), the data in the cache
+is flushed, so no need to copy the data.
+
+Nova compute libvirt driver uses daxio_ utility to read out the data from
+the source persistent memory namespace and write in to the target
+persistent memory namespace.
+If the source and target persistent memory namespaces are not on the
+same host, ssh tunnel is used to channel the data transfer. This ssh tunnel
+uses the same ssh key as of moving VM disks during resize. The copying
+via network won't work in case of cross-cell resize since there is no
+direct connectivity between the source and target host.
+
+Live migration
+--------------
+Live migration with virtual persistent memory is supported by QEMU.
+Qemu treats virtual persistent memory as volatile memory in case of
+live migration. It just takes longer time due to the typical large
+capacity of virtual persistent memory.
+
+Virtual persistent memory hotplug
+---------------------------------
+This spec does not address the hot plugging of virtual persistent memory.
+
+VM snapshot
+-----------
+The current VM snapshots do not include memory images. For the current
+phase the virtual persistent images are not included in the VM snapshots.
+In future, virtual persistent images could be stored in Glance as a separate
+image format. And flavor extra specs can be used to specify whether
+to save virtual persistent memory image during VM snapshot.
+
+VM shelve/unshelve
+------------------
+Shelving a VM is to upload the VM snapshot to Glance service. Since the
+virtual persistent memory image is not included in the VM snapshot,
+VM shelve/unshelve does not automatically save/restore the virtual
+persistent memory for the current iteration.
+As snapshot, saving/restoring virtual persistent memory images could be
+supported after the persistent memory images can be stored in Glance.
+The persistent memory namespaces belong to a shelved VM are zeored out
+after VM being shelve-offloaded.
+
+Alternatives
+------------
+Persisent memory namespaces can be created/destroyed on the fly as VM
+creation/deletion. This ways is more flexible than the fix sized
+approach, however it will result in fragmentation as detailed in the
+`Background`_ section.
+
+Another model of fix sized appoach other than the proposed one could
+be evenly partitioning the entire persistent memory space into namespaces
+of the same size and setting the ``step_size`` of the persistent
+memory resource provider to the size of each namespace. However this
+model assumes a larger namespace can be assembled from multiple smaller
+namespaces (a 256GB persistent memory requirement may land on 2x128GB
+namespaces) which is not the case.
+
+Persistent memory demonstrates certain similarity with block devices
+in its non-volatile nature and life cycle management. It is possible
+to stick it into block device mapping (BDM) interface. However, NUMA
+affinity support is in the future of persistent memory and BDM is not
+the ideal interface to decribe NUMA.
+
+Data model impact
+-----------------
+A new VirtualPMEM object is introduced to track the virtual PMEM information
+of an instance, it stands for a virtual persistent memory device backed
+by a physical persistent memory namespace:
+
+.. code-block:: python
+
+ class VirtualPMEM(base.NovaObject):
+     # Version 1.0: Initial version
+     VERSION = "1.0"
+
+     fields = {
+         'rc_name': fields.StringField(),
+         'ns_name': fields.StringField(nullable=True),
+     }
+
+In addition a VirtualPMEMList object is introduced to represent a list
+of VirtualPMEM objects:
+
+.. code-block:: python
+
+ class VirtualPMEMList(base.NovaObject):
+     # Version 1.0: Initial version
+     VERSION = "1.0"
+
+     fields = {
+         'vpmems': fields.ListOfObjectsField('VirtualPMEM'),
+     }
+
+A 'vpmems' deferred-load column is added to class InstanceExtra,
+which stores a serialized VirtualPMEMList object for a given instance:
+
+.. code-block:: python
+
+ class InstanceExtra(BASE, NovaBase, models.SoftDeleteMixin):
+      ...
+      migration_context = orm.deferred(Column(Text))
+      keypairs = orm.deferred(Column(Text))
+      trusted_certs = orm.deferred(Column(Text))
+      vpmems = orm.deferred(Column(Text))
+      instance = orm.relationship(Instance,
+      ...
+
+Two new fields are introduced to MigrationContext to hold the
+old and new virtual persistent memory devices during migration:
+
+.. code-block:: python
+
+ class MigrationContext(base.NovaPersistentObject, base.NovaObject):
+      ...
+      fields = {
+          'old_pci_requests': fields.ObjectField('InstancePCIRequests',
+                                                  nullable=True),
+          'new_vpmems': fields.ObjectField('VirtualPMEMList',
+                                           nullable=True),
+          'old_vpmems': fields.ObjectField('VirtualPMEMList',
+                                           nullable=True),
+     }
+
+REST API impact
+---------------
+Flavor extra specs already accept arbitrary data.
+No new micro version introduced.
+
+Security impact
+---------------
+Host persistent memory namespaces needs to be erased (zeroed) to be reused.
+
+Notifications impact
+--------------------
+None.
+
+Other end user impact
+---------------------
+End users choose flavors with desired virtual persistent memory sizes.
+
+Performance Impact
+------------------
+PMEM namespaces tend to be large. Zeroing out a persistent memory
+namespace requires a considerable amount of time. This may introduce
+a negative performance impact when deleting a guest with large
+virtual persistent memories.
+
+Other deployer impact
+---------------------
+The deployer needs to create persistent memory namespaces of the desired
+sizes before nova is deployed on a certain host.
+
+Developer impact
+----------------
+None.
+
+Upgrade impact
+--------------
+None.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+Primary assignee:
+  xuhj
+
+Other contributors:
+  luyaozhong
+  rui-zang
+
+Work Items
+----------
+* Object: add DB model and Nova object.
+* Compute: virtual persistent memory life cycle management.
+* Scheduler: translate virtual persistent memory request to
+             placement requests.
+* API: parse virtual persistent memory flavor extra specs.
+
+Dependencies
+============
+* Kernel version >= 4.2
+* QEMU version >= 2.9.0
+* Libvirt version >= 5.0.0
+* ndctl version >= 4.7
+* daxio version >= 1.6
+
+
+Testing
+=======
+Unit tests.
+Third party CI is required for testing on real hardware.
+Persistent memory nested virtualization works for QEMU/KVM.
+For the third party CI, tempest tests are executed in a VM with
+virtual persisent memory backed by physical persistent memory.
+
+Documentation Impact
+====================
+
+The cloud administrator docs need to describe how to create
+and configure persistent memory namespaces. Add a persitent
+memory section into the Nova "advanced configuration" document.
+
+The end user needs to be make aware of this feature. Add the
+flavor extra spec details into the Nova flavors document.
+
+References
+==========
+
+.. _`persistent memory`: http://pmem.io/
+.. _redis: https://redislabs.com/blog/persistent-memory-and-redis-enterprise/
+.. _rocksdb:  http://istc-bigdata.org/index.php/nvmrocks-rocksdb-on-non-volatile-memory-systems/
+.. _oracle: https://blogs.vmware.com/apps/2018/09/accelerating-oracle-performance-using-vsphere-persistent-memory-pmem.html
+.. _`SAP HANA`: https://blogs.sap.com/2018/12/03/sap-hana-persistent-memory/
+.. _Aerospike: https://www.aerospike.com/resources/videos/aerospike-intel-persistent-memory-2/
+.. _`pmem namespaces`: http://pmem.io/ndctl/ndctl-create-namespace.html
+.. _`virtual NVDIMM backends`: https://github.com/qemu/qemu/blob/19b599f7664b2ebfd0f405fb79c14dd241557452/docs/nvdimm.txt#L145
+.. _`NVDIMM Linux kernel document`: https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
+.. _ndctl: http://pmem.io/ndctl/
+.. _daxio: http://pmem.io/pmdk/daxio/
+
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Train
+     - Introduced
+