Support volume local cache
Moved the spec from U release Change-Id: I65de5f8310a52cf6c8c9b2eec3085c17bef39a45 Signed-off-by: Liang Fang <liang.a.fang@intel.com>
This commit is contained in:
parent
3bc3142246
commit
9ad4ec3e9c
|
@ -0,0 +1,416 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================
|
||||
Support volume local cache
|
||||
==========================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/support-volume-local-cache
|
||||
|
||||
This blueprint proposes to add support of volume local cache in nova. Cache
|
||||
software such as open-cas [4]_ can use fast NVME SSD or persistent memory to
|
||||
cache for the slow remote volume.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently there are different types of fast NVME SSDs, such as Intel Optane
|
||||
SSD, with latency as low as 10 us. What's more, persistent memory which aim to
|
||||
be SSD size but DRAM speed gets popupar now. Typical latency of persistent
|
||||
memory would be as low as hundreds of nanoseconds. While typical latency of
|
||||
remote volume for a VM can be at the millisecond level (iscsi / rbd). So these
|
||||
fast SSDs or persistent memory can be mounted locally on compute nodes and used
|
||||
as a cache for remote volumes.
|
||||
|
||||
In order to do the cache, there're some cache software, such as open-cas.
|
||||
open-cas is very easy to use, you just need to specify a block device as the
|
||||
cache device, and then can use this device to cache for other block devices.
|
||||
This is transparent to upper layer and lower layer. Regarding upper layer,
|
||||
guest don't know it is using an emulated block device. Regarding lower layer,
|
||||
backend volume don't know it is cached, and the data in backend volume will not
|
||||
have extra change because of cache. That means even if the cache is lost for
|
||||
some reason, the backend volume can be mounted to other places and available
|
||||
immediately. This spec is trying to add volume local cache using such cache
|
||||
software.
|
||||
|
||||
Like all the local cache solution, multi-attach cannot work. This is because
|
||||
cache on node1 don't know the changes made to backend volume by node2.
|
||||
|
||||
This feature requires the cache mode "Write-Through", which makes sure the
|
||||
cache is fully synced with backend volume all the time. Given this, it is
|
||||
transparent to live migration. "Write-Through" is also the default cache mode
|
||||
for open-cas.
|
||||
|
||||
This feature can only cache for backend volumes that would be mounted on host
|
||||
OS first as block device. So volumes (LibvirtNetVolumeDriver is used) mounted
|
||||
by QEMU, such as rbd and sheepdog, cannot be cached. Details can be found in
|
||||
list libvirt_volume_drivers in [5]_.
|
||||
|
||||
In some high performance environments, RDMA may be chosen. RDMA effectively
|
||||
shorten the latency gap between local volume and remote volume. In experimental
|
||||
environment, without network switch, without read/write io to real volume, the
|
||||
point to point RDMA network link latency would be even 3 us in best case. This
|
||||
is the pure network link latency, and this also don't mean it is faster than
|
||||
local PCIe, because RDMA NIC card itself in host and target machines also are
|
||||
PCIe devices. For RDMA scenario, persistent memory is recommended to be
|
||||
selected as cache device, otherwise may no performance gain.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
User wants to use fast NVME SSD to cache for remote slow volumes. This is
|
||||
extremely useful for clouds where operators want to boost disk io performance
|
||||
for specific volumes.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
All volumes cached by the same cache instance share same cache mode. The
|
||||
operator can change cache mode dynamically, using cache software management
|
||||
tool. os-brick just accepts the cache name and cache IDs from Nova. Cache name
|
||||
identifies which cache software to use, currently it only supports 'opencas'.
|
||||
It is allowed that more than one cache instance in one compute node. Cache IDs
|
||||
identifies cache instances that can be used. Cache mode is transparent to
|
||||
os-brick.
|
||||
|
||||
A compute capability is mapped to the trait (e.g. COMPUTE_SUPPORT_VOLUME_CACHE)
|
||||
and the libvirt driver can set this capability to true if there is cache
|
||||
instance id is configured in the nove conf. If want the volume be cached,
|
||||
firstly the volume should belongs to a volume type with "cacheable" property.
|
||||
Then select the flavor with extra spec containing this trait, so the guest
|
||||
would be landed at the host machine with cache capability. If don't want the
|
||||
volume be cached, just select a flavor without this trait.
|
||||
|
||||
If there's failure happened during setting up caching, e.g. cache device
|
||||
broken, then re-schedule the request.
|
||||
|
||||
Final architecture would be something like::
|
||||
|
||||
Compute Node
|
||||
|
||||
+---------------------------------------------------------+
|
||||
| |
|
||||
| +-----+ +-----+ +-----+ |
|
||||
| | VM1 | | VM2 | | VMn | |
|
||||
| +--+--+ +--+--+ +-----+ |
|
||||
| | | |
|
||||
+---------------------------------------------------------+
|
||||
| | | |
|
||||
| +---------+ +-----+----------+-------------+ |
|
||||
| | Nova | | QEMU Virtio | |
|
||||
| +-+-------+ +-----+----------+----------+--+ |
|
||||
| | | | | |
|
||||
| | attach/detach | | | |
|
||||
| | +-----+----------+------+ | |
|
||||
| +-+-------+ | /dev/cas1 /dev/cas2 | | |
|
||||
| | osbrick +---------+ | | |
|
||||
| +---------+ casadm | open cas | | |
|
||||
| +-+---+----------+------+ | |
|
||||
| | | | | |
|
||||
| | | | | | Storage
|
||||
| +--------+ | | +-----+----+ | rbd +---------+
|
||||
| | | | | /dev/sdd +----------+ Vol1 |
|
||||
| | | | +----------+ | +---------+
|
||||
| +-----+-----+ | | | | Vol2 |
|
||||
| | Fast SSD | | +-----+----+ iscsi/fc/... +---------+
|
||||
| +-----------+ | | /dev/sdc +-------------+-------+ Vol3 |
|
||||
| | +----------+ | +---------+
|
||||
| | | | Vol4 |
|
||||
| +-----+----+ iscsi/fc/... | +---------+
|
||||
| | /dev/sdb +--------------------------------+ Vol5 |
|
||||
| +----------+ | +---------+
|
||||
| | | ..... |
|
||||
+---------------------------------------------------------+ +---------+
|
||||
|
||||
|
||||
Changes would include:
|
||||
|
||||
* Cache the volume during connecting volume
|
||||
|
||||
In function _connect_volume():
|
||||
|
||||
- Check if the volume should be cached or not. Cinder would set the cacheable
|
||||
property for the volume if caching is allowed. If cacheable is set and
|
||||
volume_local_cache_driver in CONF is not empty, then do caching. Otherwise
|
||||
just ignore caching.
|
||||
|
||||
- attach_cache before attach_encryptor, cache lays under encryptor. It is to
|
||||
keep encrypted volume secure. No decrypted data would be written to cache
|
||||
device.
|
||||
|
||||
- Call os-brick to cache the volume [2]_. os-brick will call cache software
|
||||
to setup the cache. Then replace the path of original volume with the
|
||||
emulated volume
|
||||
|
||||
- Nova goes ahead to _connect_volume with the newly emulated volume path
|
||||
|
||||
- If any failure happens during setting up caching, just ignore the failure
|
||||
and continue the rest code of _connect_volume().
|
||||
|
||||
* Release cache during disconnecting volume
|
||||
|
||||
In function _disconnect_volume():
|
||||
|
||||
- Call os-brick to release the cache for the volume. os-brick will retrieve
|
||||
the path of original volume from emulated volume, and then replace the path
|
||||
in connection_info with the original volume path
|
||||
|
||||
- Nova goes ahead to _disconnect_volume with the original volume path
|
||||
|
||||
* Add switch in nova-cpu.conf to enable/disable local cache
|
||||
|
||||
Suggested switch names:
|
||||
|
||||
- volume_local_cache_driver: Specifies which cache software to use. Currently
|
||||
only support 'opencas'. If it is empty, then local cache is disabled.
|
||||
|
||||
- volume_local_cache_instance_ids: Specifies cache instances that can be
|
||||
used. Typically opencas has only one cache instance in a single server, but
|
||||
it has the ability to have more than one cache instances which bind to
|
||||
different cache device. Nova needs to pass instance IDs to os-brick and let
|
||||
os-brick to find the best one, e.g. biggest free size, less volumes cached,
|
||||
etc. All these information can be get from instance ID via cache admin
|
||||
tool, like casadm.
|
||||
|
||||
Suggested section: [compute]. Configuration would be like:
|
||||
[compute]
|
||||
volume_local_cache_driver = 'opencas'
|
||||
volume_local_cache_instance_ids = 1,15,222
|
||||
|
||||
Instance IDs are separated by commas.
|
||||
|
||||
Nova calls os-brick to set cache for the volume only when it has the property
|
||||
of "cacheable" and the flavor requested such caching. Let cinder to determine
|
||||
and set the property, just like the way did for volume encryption. If the
|
||||
volume contains property "multiattach", cinder would not set "cacheable" for
|
||||
it. Code work flow would be like::
|
||||
|
||||
Nova osbrick
|
||||
|
||||
|
||||
+
|
||||
+ |
|
||||
| |
|
||||
v |
|
||||
attach_volume |
|
||||
+ |
|
||||
| |
|
||||
+ |
|
||||
attach_cache |
|
||||
+ |
|
||||
| |
|
||||
+ |
|
||||
+-------+ volume_with_cache_property? |
|
||||
| + |
|
||||
| No | Yes |
|
||||
| + |
|
||||
| +--+Host_with_cache_capability? |
|
||||
| | + |
|
||||
| | No | Yes |
|
||||
| | | |
|
||||
| | +-----------------------------> attach_volume
|
||||
| | | +
|
||||
| | | |
|
||||
| | | +
|
||||
| | | set_cache_via_casadm
|
||||
| | | +
|
||||
| | | |
|
||||
| | | +
|
||||
| | | return emulated_dev_path
|
||||
| | | +
|
||||
| | | |
|
||||
| | +-------------------------------------+
|
||||
| | | |
|
||||
| | v |
|
||||
| | replace_device_path |
|
||||
| | + |
|
||||
| | | |
|
||||
v v v |
|
||||
|
|
||||
attach_encryptor and |
|
||||
rest of attach_volume +
|
||||
|
||||
|
||||
* Volume local cache lays upon encryptor would have better performance, but
|
||||
expose decrypted data in cache device. So based on security consideration,
|
||||
cache should lay under encryptor in Nova implementation.
|
||||
|
||||
Code implementation can be found in [1]_ [2]_ [3]_
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* Assign local SSD to a specific VM. VM can then use bcache internally against
|
||||
the ephemeral disk to cache their volume if they want.
|
||||
|
||||
The drawbacks may include:
|
||||
|
||||
- Can only accelerate one VM. The fast SSD capability cannot be shared by
|
||||
other VMs. Unlike RAM, SSD normally is in TB level and large enough to
|
||||
cache for all the VMs in one node.
|
||||
|
||||
- The owner of the VM should setup cache explicitly. But not all the VM owner
|
||||
want to do this, and not all the VM owner has the knowledge to do this. But
|
||||
they for sure want the volume performance is better by default.
|
||||
|
||||
* Create a dedicated cache cluster. Mount all the cache (NVME SSD) in cache
|
||||
cluster as a big cache pool. Then allocate a certain ammount of cache to a
|
||||
specific volume. The allocated cache can be mounted on compute node through
|
||||
NVMEof protocol. Then still use cache software to do the same cache.
|
||||
|
||||
But this would be the compete between local PCIe and remote network. The
|
||||
disadvantage if doing like these ways is: the network of the storage server
|
||||
would be bottleneck.
|
||||
|
||||
- Latency) Storage cluster typically provide volume through iscsi/fc
|
||||
protocol, or through librbd if ceph is used. The latency would be
|
||||
millisecond level. Even NVME over TCP, the latency would be hundreds of
|
||||
microsecond, depends on the network topology. As a contrast, the latency of
|
||||
NVME SSD would be around 10 us, take Intel Optane SSD p4800x as example.
|
||||
|
||||
* Cache can be added in backend storage side, e.g. in ceph. Storage server
|
||||
normally has its own cache mechanism, e.g. using memory as cache, or using
|
||||
NVME SSD as cache.
|
||||
|
||||
Similiar with above solution, latency is the disadvantage.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
* Cache software will remove the cached volume data from cache device when
|
||||
volume is detached. But normally it would not erase the related sectors in
|
||||
cache device. So in theory the volume data is still in cache device before it
|
||||
is overwritten. Volume with encryption doesn't have this security impact if
|
||||
encryption laying upon volume local cache.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* Latency of VM volume will be decreased
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
* Option volume_local_cache_driver and volume_local_cache_instance_ids should
|
||||
be set in nova-cpu.conf to enable this feature. Default value of
|
||||
volume_local_cache_driver would be empty string which means local cache is
|
||||
disabled.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
This is only for libvirt, other drivers like VMWare, hyperv will not be
|
||||
changed. This is because open-cas can only support Linux, and libvirt is the
|
||||
most used one. Meanwhile this spec/implementation would only be tested with
|
||||
libvirt.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Liang Fang <liang.a.fang@intel.com>
|
||||
|
||||
Feature Liaison
|
||||
---------------
|
||||
|
||||
Feature liaison:
|
||||
gibi
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add COMPUTE_SUPPORT_VOLUME_CACHE trait to os-traits
|
||||
|
||||
* Add a new compute capability that maps to this trait
|
||||
|
||||
* Enable this capability in the libvirt driver if a caches is configured
|
||||
|
||||
* Cache the volume during connecting volume
|
||||
|
||||
* Release cache during disconnecting volume
|
||||
|
||||
* Add switch to enable / disable this feature
|
||||
|
||||
* Unit test to be added
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* os-brick patch: [2]_
|
||||
* cinder patch: [3]_
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* New unit test should be added
|
||||
|
||||
* One of tempest jobs should be changed to enable this feature, with open-cas,
|
||||
on a vanilla worker image
|
||||
|
||||
- This can use open-cas with a local file as NVME device.
|
||||
|
||||
- Check if the emulated volume is created for VM or not.
|
||||
|
||||
- Check if the emulated volume is released or not when deleting VM
|
||||
|
||||
* One of tempest jobs should be changed to enable this feature, with open-cas,
|
||||
on a vanilla worker image
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Document need to be changed to describe this feature and include the new
|
||||
options - volume_local_cache_driver, volume_local_cache_instance_ids
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] https://review.opendev.org/#/c/663542/
|
||||
.. [2] https://review.opendev.org/#/c/663549/
|
||||
.. [3] https://review.opendev.org/#/c/700799/
|
||||
.. [4] https://open-cas.github.io/
|
||||
.. [5] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Ussuri
|
||||
- Introduced
|
||||
* - Victoria
|
||||
- Re-proposed
|
Loading…
Reference in New Issue