[Spec] NFS related improvement for filesystem driver

Change-Id: Ic9b316a284641f4c5f33fe4238e08cf1d0faf2a1
This commit is contained in:
Abhishek Kekane 2024-04-29 06:39:47 +00:00
parent 3bdda0e98f
commit 52923f9fdd
2 changed files with 228 additions and 1 deletions

View File

@ -0,0 +1,221 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===========================================================
Improve filesystem store driver to utilize NFS capabilities
===========================================================
https://blueprints.launchpad.net/glance/+spec/improve-filesystem-driver
Problem description
===================
The filesystem backend of glance can be used to mount NFS share as local
filesystem, so it does not required to store any special configs at
glance side. Glance does not care about NFS server address or NFS share
path at all, it just assumes that each image is stored in the local
filesystem. The downside of this assumption is that glance does not
aware whether NFS server is connected/available or not, NFS share
is mounted or not and just keeps performing add/delete operations
on local filesystem directory which later might causes problem
in synchronization when NFS is back online.
Use case: In a k8s environment where OpenStack Glance is installed on
top of OpenShift and NFS share is mounted via the `Volume/VolumeMount`
interface, the Glance pod won't start if NFS share isn't ready. Whereas
if NFS share is not available after Glance pod is available then
upload operation will fail with following error::
sh-5.1$ openstack image create --container-format bare --disk-format raw --file /tmp/cirros-0.5.2-x86_64-disk.img cirros
ConflictException: 409: Client Error for url: https://glance-default-public-openstack.apps-crc.testing/v2/images/0ce1f894-5af7-44fa-987d-f4c47c77d0cf/file, Conflict
Even though the Glance Pod is still up, `liveness` and `readiness` probes
starts failing and as a result the Glance Pods are marked as `Unhealthy`::
Normal Started 12m kubelet Started container glance-api
Warning Unhealthy 5m24s (x2 over 9m24s) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m24s (x3 over 9m24s) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m24s kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4m54s (x2 over 9m24s) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4m54s kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Later in time, according to the failure threshold set for the Pod,
the kubelet marks the Pod as Failed, and we can see a failure, and
given that the policy is supposed to recreate it::
glance-default-single-0 0/3 CreateContainerError 4 (3m39s ago) 28m
$ oc describe pod glance-default-single-0 | tail
Normal Started 29m kubelet Started container glance-api
Warning Unhealthy 10m (x3 over 26m) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 10m kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 10m kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x4 over 26m) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x5 over 26m) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x2 over 22m) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x3 over 22m) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Failed 4m47s (x2 over 6m48s) kubelet Error: context deadline exceeded
Unlike other deployments (deployment != k8s) where even if NFS share is not
available the glance service keeps running and uploads or deletes the data
from local filesystem. In this case we can definitely say that NFS share is
not available, the Glance won't be able to upload any image in the
filesystem local to the container and the Pod will be marked as failed and
it fails to be recreated.
Proposed change
===============
We propose to use `statvfs(path)` function of inbuilt `os` library to record
the `f_fsid` attribute at the start of the glance-api service when
filesystem store is initialized. In case of local directory is mounted
to NFS share then `statvfs(path)` function returns Zero otherwise it
will return Non Zero value for the same.
For example, if local FS `/opt/stack/data/glance/images` is mounted
to NFS share::
$ df -h
10.0.108.117:/mnt/nfsshare_glance 117G 44G 73G 38% /opt/stack/data/glance/images
import os
info = os.statvfs('/opt/stack/data/glance/images')
print(info.f_fsid)
>>> 0
Whereas if local FS `/opt/stack/data/glance/images` is not mounted as
NFS share::
import os
info = os.statvfs('/opt/stack/data/glance/images')
print(info.f_fsid)
>>> 3294141091232417704
So we can record this `f_fsid` value at service startup if it is Zero and
assume that NFS share is configured for glance. We will again retrieve this
value while adding image data to filesystem store or deleting the data from
the filesystem store. If newly retrieved value is Non Zero then we will
simply abort the operation and return HTTP 400 to end user.
If the `f_fsid` is found Non Zero at service start then we will ignore it
considering local filesystem is used for Glance storage.
Alternatives
------------
Introduce few configuration options for filesystem driver which will help to
detect if the NFS share is unmounted from underneath the Glance service. We
proposed to introduce below new configuration options for the same:
* 'filesystem_is_nfs_configured' - boolean, verify if NFS is configured or not
* 'filesystem_nfs_host' - IP address of NFS server
* 'filesystem_nfs_share_path' - Mount path of NFS mapped with local filesystem
* 'filesystem_nfs_mount_options' - Mount options to be passed to NFS client
* 'rootwrap_config' - To run commands as root user
If 'filesystem_is_nfs_configured' is set, i.e. if NFS is configured then
deployer must specify 'filesystem_nfs_host' and 'filesystem_nfs_share_path'
config options in glance-api.conf otherwise the respective glance store will
be disabled and will not be used for any operation.
We are planning to use existing os-brick library (already used by cinder driver
of glance_store) to create the NFS client with the help of above configuration
options and check if NFS share is available or not during service
initialization as well as before each image upload/import/delete operation. If
NFS share is not available during service initialization then add and delete
operations will be disabled but if NFS goes down afterwards we will raise
HTTP 410 (HTTP GONE) response to the user.
Glance still doesn't have capability to check whether particular NFS store has
storage capability to store any particular image beforehand. Also it does not
have capability to verify if network failure occurs during upload/import
operation.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
Performance will have very little impact since each add/delete operation
we will be calling `os.statvfs()` and comparing the `f_fsid` to validate
NFS share is available or not.
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
abhishekk
Other contributors:
None
Work Items
----------
* Modify Filesystem Store init process to record `f_fsid`
* Modify add and delete operation to compare `f_fsid`
* Unit/Functional tests for coverage
Dependencies
============
None
Testing
=======
* Unit Tests
* Functional Tests
* Tempest Tests
Documentation Impact
====================
Need to document new behavior of filesystem driver if NFS is configured
References
==========
* os.statvfs - https://docs.python.org/3/library/os.html#os.statvfs

View File

@ -6,7 +6,13 @@
:glob:
:maxdepth: 1
TODO: fill this in once a new approved spec is added.
2024.2 approved specs for glance:
.. toctree::
:glob:
:maxdepth: 1
glance_store/*