diff --git a/specs/newton/approved/libvirt-instance-storage.rst b/specs/newton/approved/libvirt-instance-storage.rst new file mode 100644 index 000000000..cfc08d145 --- /dev/null +++ b/specs/newton/approved/libvirt-instance-storage.rst @@ -0,0 +1,242 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================= +Persist libvirt instance storage metadata +========================================= + +https://blueprints.launchpad.net/nova/+spec/libvirt-instance-storage + +Libvirt ephemeral storage layout is currently mostly inferred based on the +local configuration of the compute node. This is problematic in several cases. +In edge cases, it has been the recent cause of several severe security +vulnerabilities. It also makes storage configuration hard or impossible to vary +between compute nodes in the same installation, or over time after +installation. By persisting storage metadata of a particular instance +explicitly we make its configuration unambiguous and simple to understand, and +therefore less vulnerable to security issues. We also make it possible to +unambiguously represent multiple storage configurations within a single +installation. While we don't make the necessary changes to correctly handle +multiple storage configurations, by describing them unambiguously we lay a +foundation for future work to enable this. + + +Problem description +=================== + +There are several problems with the libvirt ephemeral storage code: + +- The code has been expanded beyond its original design with the addition of + each new backend type (LVM/RBD/ploop), but has never been redesigned to + accomodate these substantially different models. + +- Storage layout is inferred from 2 config variables on the compute node: + libvirt.images_type and use_cow_images. An instance which was created on + another compute node with different values for these config variables, or + which was created before the values of these config variables were changed, + will be incorrectly handled. This will lead to failure at best. At worst it + will create a security vulnerability. + +- The imagebackend code uses a single method, cache(), to create both disks + from glance images, and disks from templates (i.e. blank filesystems or swap + disks). These are then handled differently by different backends. Writing to + the image cache is done by the individual backends, which use the image cache + differently due to their different natures. To do this, backends must + differentiate between glance images and templates, but the interface does not + permit them to do this directly. The Raw backend greps 'image_id' from the + argument passed to its template function. The LVM backend uses + 'ephemeral_size'. The Ploop backend uses 'context' and 'image_id', and + independently fetches glance metadata. The cache() interface needs to be + changed to reflect its usage. + +- The cache() interface does not provide the backend with any metadata about + disk image it is being given to import. Consequently it must either infer it + heuristically or inspect it. Both methods are prone to error and potential + security bugs. The replacement for cache() must allow the backend to + determine in advance the format and size of the disk it is importing. + +- The Raw backend is badly named, as disks using the Raw backend may be either + raw or qcow2. The Raw backend first inspects the disk it is importing (see + problem above), then writes its format to a local file called disk.info. + Storing the format of the disk means that there is no need to inspect the + disk at boot time, which prevents a severe security flaw. However, we can do + much better than this. disk.info is used inconsistently between the Qcow2 and + Raw backends, and other backends do not use it at all. It is also local to a + single compute node, so cannot be used to determine storage layout during a + migration. + +Use Cases +--------- + +Developer: Reduce bugs, in particular security bugs, by creating a single, +canonical, persistent repository for disk metadata. + +Developer: Enable future backend development work by removing poorly understood +heuristics and tight coupling. + +Developer: By storing disk layout per instance rather than compute node, enable +the future development of features to: + +- migrate between disk layouts. + +- implement different per-instance storage policies (e.g. SSD vs spinning + rust). + +- track the process of upgrading disk layouts during an upgrade. + +Proposed change +=============== + +We need to make 2 changes: + +1 Split the cache() method into 2 separate methods: + create_from_image(image_id), and create_from_func(func, size). + +2 Create a persistent record of a disk's layout before creating the disk. This + includes at least: the backend in use, disk format, and size. This persistent + record must be extensible so that, for example, in the future we can specify + multiple local storage locations for qcow2 disks, and choose between them. + +The first change involves a substantial refactoring of libvirt's imagebackend +module. This should be achieved with a minimum of functional change. + +The second change depends on the first. With the first change in place we have +enough context when creating a disk to know how big it is, what format it is, +and where it should go. We will implement library code which persists this +information for the disk, and then calls out to the relevant backend. It will +be stored in a virt driver specific field we will add to BlockDeviceMapping: +driver_info. Higher level code will treat this as an opaque blob. The libvirt +driver will treat it as a serialised versioned object: LibvirtDiskMetadata. +LibvirtDiskMetadata will initially contain the metadata mentioned above. + +Alternatives +------------ + +There are undoubtedly other ways to achieve this. The important principal, +though, is that the driver should have a persistent, unambiguous record of how +an instance's storage is laid out. I have picked this one. + +Data model impact +----------------- + +A driver_info column will be added to the block_device_mapping table with type +Text. The column will be nullable, and will initially be unpopulated. It will +not be indexed, or have any associated constraints. + +The column will be populated by the driver when performing any operation on a +disk and it is found to be unpopulated. It will initially take values based on +the current behaviour of the driver. + +REST API impact +--------------- + +None + +Security impact +--------------- + +This change does not directly impact security, but by simplifying an area of +code which has been the source of several severe security bugs it should +indirectly improve security. Specifically, by ensuring we always know the +format of a virtual disk we should never have to perform insecure format +inspection. + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +When reading block device mappings for the driver, we will need to additionally +pull back driver_info, which will add some small overhead. More significantly, +with this change the driver will update the BlockDeviceMapping object for each +disk during boot and similar operations. We should be able to reduce the impact +of this update for instances with multiple disks by batching them in a single +update to a BlockDeviceMappingList object. This would require only one round +trip to conductor. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +The change adds a driver_info field to the BlockDeviceMapping object, and uses +it in the libvirt driver. Other drivers may also use this field, although this +change does not define how they should do that. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + mbooth-9 + +Other contributors: + None + +Work Items +---------- + +- Refactor libvirt.imagebackend to split up cache() + +- Add persistent metadata storage in BlockDeviceMapping + + +Dependencies +============ + +None + + +Testing +======= + +Primarily, this should introduce **no functional change**. Its purpose is to +enable future change. Consequently, to the greatest extent possible, all +existing tests should continue to run with a minimum of change. + +Tempest should require no changes. + +Unit tests will likely have significant churn due to changing internal +interfaces, but the scenarios covered should be at least the same as +previously. + +Note that Jenkins currently only tests the Qcow2 and Rbd(ceph) backends +in the gate. All current libvirt tempest jobs run by Jenkins use the +default Qcow2 backend except gate-tempest-dsvm-full-devstack-plugin-ceph, which +uses Rbd. We additionally coverage of the ploop backend in +check-dsvm-tempest-vz7-exe-minimal run by Virtuozzo CI. This means that we +currently have no gate coverage of the Raw and Lvm backends. + + +Documentation Impact +==================== + +None + +References +========== + +None + + +History +======= + +None