[Spec] NFS related improvement for filesystem driver

Change-Id: Ic9b316a284641f4c5f33fe4238e08cf1d0faf2a1
2024-04-29 06:39:47 +00:00 · 2024-04-29 06:39:47 +00:00 · 52923f9fdd
parent 3bdda0e98f
commit 52923f9fdd
2 changed files with 228 additions and 1 deletions
--- a/specs/2024.2/approved/glance_store/improve-filesystem-driver.rst
+++ b/specs/2024.2/approved/glance_store/improve-filesystem-driver.rst
@ -0,0 +1,221 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+===========================================================
+Improve filesystem store driver to utilize NFS capabilities
+===========================================================
+
+https://blueprints.launchpad.net/glance/+spec/improve-filesystem-driver
+
+Problem description
+===================
+
+The filesystem backend of glance can be used to mount NFS share as local
+filesystem, so it does not required to store any special configs at
+glance side. Glance does not care about NFS server address or NFS share
+path at all, it just assumes that each image is stored in the local
+filesystem. The downside of this assumption is that glance does not
+aware whether NFS server is connected/available or not, NFS share
+is mounted or not and just keeps performing add/delete operations
+on local filesystem directory which later might causes problem
+in synchronization when NFS is back online.
+
+Use case: In a k8s environment where OpenStack Glance is installed on
+top of OpenShift and NFS share is mounted via the `Volume/VolumeMount`
+interface, the Glance pod won't start if NFS share isn't ready. Whereas
+if NFS share is not available after Glance pod is available then
+upload operation will fail with following error::
+
+    sh-5.1$ openstack image create --container-format bare --disk-format raw --file /tmp/cirros-0.5.2-x86_64-disk.img cirros
+    ConflictException: 409: Client Error for url: https://glance-default-public-openstack.apps-crc.testing/v2/images/0ce1f894-5af7-44fa-987d-f4c47c77d0cf/file, Conflict
+
+Even though the Glance Pod is still up, `liveness` and `readiness` probes
+starts failing and as a result the Glance Pods are marked as `Unhealthy`::
+
+    Normal   Started         12m                    kubelet            Started container glance-api
+      Warning  Unhealthy       5m24s (x2 over 9m24s)  kubelet            Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       5m24s (x3 over 9m24s)  kubelet            Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       5m24s                  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       4m54s (x2 over 9m24s)  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       4m54s                  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+
+Later in time, according to the failure threshold set for the Pod,
+the kubelet marks the Pod as Failed, and we can see a failure, and
+given that the policy is supposed to recreate it::
+
+    glance-default-single-0                                         0/3     CreateContainerError   4 (3m39s ago)   28m
+
+    $ oc describe pod glance-default-single-0 | tail
+    Normal   Started    29m                    kubelet   Started container glance-api
+    Warning  Unhealthy  10m (x3 over 26m)      kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  10m                    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  10m                    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x4 over 26m)    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x5 over 26m)    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x2 over 22m)    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x3 over 22m)    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s                  kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
+    Warning  Failed     4m47s (x2 over 6m48s)  kubelet   Error: context deadline exceeded
+
+Unlike other deployments (deployment != k8s) where even if NFS share is not
+available the glance service keeps running and uploads or deletes the data
+from local filesystem. In this case we can definitely say that NFS share is
+not available, the Glance won't be able to upload any image in the
+filesystem local to the container and the Pod will be marked as failed and
+it fails to be recreated.
+
+Proposed change
+===============
+
+We propose to use `statvfs(path)` function of inbuilt `os` library to record
+the `f_fsid` attribute at the start of the glance-api service when
+filesystem store is initialized. In case of local directory is mounted
+to NFS share then `statvfs(path)` function returns Zero otherwise it
+will return Non Zero value for the same.
+
+For example, if local FS `/opt/stack/data/glance/images` is mounted
+to NFS share::
+
+    $ df -h
+    10.0.108.117:/mnt/nfsshare_glance  117G   44G   73G  38% /opt/stack/data/glance/images
+
+    import os
+    info = os.statvfs('/opt/stack/data/glance/images')
+    print(info.f_fsid)
+    >>> 0
+
+Whereas if local FS `/opt/stack/data/glance/images` is not mounted as
+NFS share::
+
+    import os
+    info = os.statvfs('/opt/stack/data/glance/images')
+    print(info.f_fsid)
+    >>> 3294141091232417704
+
+So we can record this `f_fsid` value at service startup if it is Zero and
+assume that NFS share is configured for glance. We will again retrieve this
+value while adding image data to filesystem store or deleting the data from
+the filesystem store. If newly retrieved value is Non Zero then we will
+simply abort the operation and return HTTP 400 to end user.
+
+If the `f_fsid` is found Non Zero at service start then we will ignore it
+considering local filesystem is used for Glance storage.
+
+Alternatives
+------------
+
+Introduce few configuration options for filesystem driver which will help to
+detect if the NFS share is unmounted from underneath the Glance service. We
+proposed to introduce below new configuration options for the same:
+
+* 'filesystem_is_nfs_configured' - boolean, verify if NFS is configured or not
+* 'filesystem_nfs_host' - IP address of NFS server
+* 'filesystem_nfs_share_path' - Mount path of NFS mapped with local filesystem
+* 'filesystem_nfs_mount_options' - Mount options to be passed to NFS client
+* 'rootwrap_config' - To run commands as root user
+
+If 'filesystem_is_nfs_configured' is set, i.e. if NFS is configured then
+deployer must specify 'filesystem_nfs_host' and 'filesystem_nfs_share_path'
+config options in glance-api.conf otherwise the respective glance store will
+be disabled and will not be used for any operation.
+
+We are planning to use existing os-brick library (already used by cinder driver
+of glance_store) to create the NFS client with the help of above configuration
+options and check if NFS share is available or not during service
+initialization as well as before each image upload/import/delete operation. If
+NFS share is not available during service initialization then add and delete
+operations will be disabled but if NFS goes down afterwards we will raise
+HTTP 410 (HTTP GONE) response to the user.
+
+Glance still doesn't have capability to check whether particular NFS store has
+storage capability to store any particular image beforehand. Also it does not
+have capability to verify if network failure occurs during upload/import
+operation.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Performance will have very little impact since each add/delete operation
+we will be calling `os.statvfs()` and comparing the `f_fsid` to validate
+NFS share is available or not.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  abhishekk
+
+Other contributors:
+  None
+
+Work Items
+----------
+
+* Modify Filesystem Store init process to record `f_fsid`
+
+* Modify add and delete operation to compare `f_fsid`
+
+* Unit/Functional tests for coverage
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+* Unit Tests
+* Functional Tests
+* Tempest Tests
+
+Documentation Impact
+====================
+
+Need to document new behavior of filesystem driver if NFS is configured
+
+References
+==========
+
+* os.statvfs - https://docs.python.org/3/library/os.html#os.statvfs
--- a/specs/2024.2/approved/index.rst
+++ b/specs/2024.2/approved/index.rst
@ -6,7 +6,13 @@
   :glob:
   :maxdepth: 1

-TODO: fill this in once a new approved spec is added.
+2024.2 approved specs for glance:
+
+.. toctree::
+    :glob:
+    :maxdepth: 1
+
+    glance_store/*