[Spec] NFS related improvement for filesystem driver

Change-Id: Ic9b316a284641f4c5f33fe4238e08cf1d0faf2a1
2024-04-29 06:39:47 +00:00 · 2024-04-29 06:39:47 +00:00 · 5940c59d44
commit 5940c59d44
parent 3bdda0e98f
2 changed files with 237 additions and 1 deletions
--- a/specs/2024.2/approved/glance/improve-filesystem-driver.rst
+++ b/specs/2024.2/approved/glance/improve-filesystem-driver.rst
@ -0,0 +1,230 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+===========================================================
+Improve filesystem store driver to utilize NFS capabilities
+===========================================================
+
+https://blueprints.launchpad.net/glance/+spec/improve-filesystem-driver
+
+Problem description
+===================
+
+The filesystem backend of glance can be used to mount NFS share as local
+filesystem, so it is not required to store any special configs at
+glance side. Glance does not care about NFS server address or NFS share
+path at all, it just assumes that each image is stored in the local
+filesystem. The downside of this assumption is that glance is not
+aware whether NFS server is connected/available or not, NFS share
+is mounted or not and just keeps performing add/delete operations
+on local filesystem directory which later might causes problem
+in synchronization when NFS is back online.
+
+Use case: In a k8s environment where OpenStack Glance is installed on
+top of OpenShift and NFS share is mounted via the `Volume/VolumeMount`
+interface, the Glance pod won't start if NFS share isn't ready. Whereas
+if NFS share is not available after Glance pod is available then
+upload operation will fail with following error::
+
+    sh-5.1$ openstack image create --container-format bare --disk-format raw --file /tmp/cirros-0.5.2-x86_64-disk.img cirros
+    ConflictException: 409: Client Error for url: https://glance-default-public-openstack.apps-crc.testing/v2/images/0ce1f894-5af7-44fa-987d-f4c47c77d0cf/file, Conflict
+
+Even though the Glance Pod is still up, `liveness` and `readiness` probes
+starts failing and as a result the Glance Pods are marked as `Unhealthy`::
+
+    Normal   Started         12m                    kubelet            Started container glance-api
+      Warning  Unhealthy       5m24s (x2 over 9m24s)  kubelet            Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       5m24s (x3 over 9m24s)  kubelet            Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       5m24s                  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       4m54s (x2 over 9m24s)  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+      Warning  Unhealthy       4m54s                  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+
+Later in time, according to the failure threshold set for the Pod,
+the kubelet marks the Pod as Failed, and we can see a failure, and
+given that the policy is supposed to recreate it::
+
+    glance-default-single-0                                         0/3     CreateContainerError   4 (3m39s ago)   28m
+
+    $ oc describe pod glance-default-single-0 | tail
+    Normal   Started    29m                    kubelet   Started container glance-api
+    Warning  Unhealthy  10m (x3 over 26m)      kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  10m                    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  10m                    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x4 over 26m)    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x5 over 26m)    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x2 over 22m)    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s (x3 over 22m)    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
+    Warning  Unhealthy  9m30s                  kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
+    Warning  Failed     4m47s (x2 over 6m48s)  kubelet   Error: context deadline exceeded
+
+Unlike other deployments (deployment != k8s) where even if NFS share is not
+available the glance service keeps running and uploads or deletes the data
+from local filesystem. In this case we can definitely say that NFS share is
+not available, the Glance won't be able to upload any image in the
+filesystem local to the container and the Pod will be marked as failed and
+it fails to be recreated.
+
+Proposed change
+===============
+
+We are planning to add new plugin `enable_by_files` to `healthcheck`
+wsgi middleware in `oslo.middleware` which can be used by all openstack
+components to check if desired path is not present then report
+`503 <REASON>` error or `200 OK` if everything is OK.
+
+In glance we can configure this healthcheck middleware as an application
+in glance-api-paste.ini as an application:
+
+.. code-block:: ini
+
+  [app:healthcheck]
+  paste.app_factory = oslo_middleware:Healthcheck.app_factory
+  backends = enable_by_files (optional, default: empty)
+  # used by the 'enable_by_files' backend
+  enable_by_file_paths = /var/lib/glance/images/filename,/var/lib/glance/cache/filename (optional, default: empty)
+
+  # Use this composite for keystone auth with caching and cache management
+  [composite:glance-api-keystone+cachemanagement]
+  paste.composite_factory = glance.api:root_app_factory
+  /: api-keystone+cachemanagement
+  /healthcheck: healthcheck
+
+The middleware will return "200 OK" if everything is OK,
+or "503 <REASON>" if not with the reason of why this API should not be used.
+
+"backends" will the name of a stevedore extentions in the namespace
+"oslo.middleware.healthcheck".
+
+In glance, if local filesystem path is mounted on NFS share then we
+propose to add one marker file named `.glance` to NFS share and then
+use that file path to configure `enable_by_files` healthcheck
+middleware plugin as shown below:
+
+.. code-block:: ini
+
+  [app:healthcheck]
+  paste.app_factory = oslo_middleware:Healthcheck.app_factory
+  backends = enable_by_files
+  enable_by_file_paths = /var/lib/glance/images/.glance
+
+If NFS goes down or somehow the `/healthcheck` starts reporting
+`503 <REASON>` admin can take appropriate actions to make NFS
+share available again.
+
+Alternatives
+------------
+
+Introduce few configuration options for filesystem driver which will help to
+detect if the NFS share is unmounted from underneath the Glance service. We
+proposed to introduce below new configuration options for the same:
+
+* `filesystem_is_nfs_configured` - boolean, verify if NFS is configured or not
+* `filesystem_nfs_host` - IP address of NFS server
+* `filesystem_nfs_share_path` - Mount path of NFS mapped with local filesystem
+* `filesystem_nfs_mount_options` - Mount options to be passed to NFS client
+* `rootwrap_config` - To run commands as root user
+
+If `filesystem_is_nfs_configured` is set, i.e. if NFS is configured then
+deployer must specify `filesystem_nfs_host` and `filesystem_nfs_share_path`
+config options in glance-api.conf otherwise the respective glance store will
+be disabled and will not be used for any operation.
+
+We are planning to use existing os-brick library (already used by cinder driver
+of glance_store) to create the NFS client with the help of above configuration
+options and check if NFS share is available or not during service
+initialization as well as before each image upload/import/delete operation. If
+NFS share is not available during service initialization then add and delete
+operations will be disabled but if NFS goes down afterwards we will raise
+HTTP 410 (HTTP GONE) response to the user.
+
+Glance still doesn't have capability to check whether particular NFS store has
+storage capability to store any particular image beforehand. Also it does not
+have capability to verify if network failure occurs during upload/import
+operation.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+Need to configure healthcheck middleware for glance.
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  abhishekk
+
+Other contributors:
+  None
+
+Work Items
+----------
+
+* Add `enable_by_files` healthcheck backend in oslo.middleware
+
+* Document how to configure `enable_by_files` healthcheck middleware
+
+* Unit/Functional tests for coverage
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+* Unit Tests
+* Functional Tests
+* Tempest Tests
+
+Documentation Impact
+====================
+
+Need to document new behavior of filesystem driver if NFS and healthcheck
+middleware is configured.
+
+References
+==========
+
+* Oslo.Middleware Implementation - https://review.opendev.org/920055
--- a/specs/2024.2/approved/index.rst
+++ b/specs/2024.2/approved/index.rst
@ -6,7 +6,13 @@
   :glob:
   :maxdepth: 1

-TODO: fill this in once a new approved spec is added.
+2024.2 approved specs for glance:
+
+.. toctree::
+    :glob:
+    :maxdepth: 1
+
+    glance/*