Merge "Re-propose Unified Limits in Nova"

2020-06-04 13:12:03 +00:00 · 2020-06-04 13:12:03 +00:00 · d2fa897705
parent c542073b65 399eec7034
commit d2fa897705
1 changed files with 568 additions and 0 deletions
--- a/specs/victoria/approved/unified-limits-nova.rst
+++ b/specs/victoria/approved/unified-limits-nova.rst
@ -0,0 +1,568 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==================================
+Unified Limits Integration in Nova
+==================================
+
+https://blueprints.launchpad.net/nova/+spec/unified-limits-nova
+
+The spec is about adopting Keystone's unified-limits.
+Includes using oslo.limit to enforce the Nova related limits set in Keystone.
+
+This spec proposes having unified limits in parallel with the existing
+quota system for at least one cycle, to allow for operators to transition
+from setting quotas via Nova to setting limits relating to the Nova API
+endpoint via Keystone.
+
+All per user quota support is dropped, in favor of hierarchical
+enforcement that will be supported by unified limits.
+
+Only server count limits and limits on Resource Class resources requested in
+the flavor will be supported with unified limits. All other existing quotas
+will no longer support per project or per user limits.
+
+Given this placement focused approach, we will depend on the work done here:
+http://specs.openstack.org/openstack/nova-specs/specs/train/approved/count-quota-usage-from-placement.html
+
+Problem description
+===================
+
+While much work has been done to simplify how quotas are implemented in
+Nova, there are still some major usability issues for operators with
+the current system:
+
+* We don't have consistent support for limit/quota hierarchy across OpenStack
+* Requiring operators to set limits individually in each service
+  (i.e. Cinder, Nova, Neutron, etc)
+* Nova's existing quotas don't work well with Ironic
+* No support for custom Resource Class quotas (includes "per flavor" quotas)
+* Confusion when API defined quota limits override any changes made to the
+  configuration
+* Some Nova quotas are unrelated to resource consumption, causing confusion
+
+Transitioning to use Keystone's unified limits, via oslo.limit, will help fix
+these issues.
+
+For more details on unified limits in keystone see:
+https://docs.openstack.org/keystone/stein/admin/identity-unified-limits.html
+
+Use Cases
+---------
+
+The key use cases driving this work are:
+
+* API User tries to understand why they got an Over Quota error
+* Operator migrates to unified limits from existing limits
+* Operator sets a default limit for a given endpoint via Keystone. Note there
+  can be different limits for each Region, even with a shared Keystone.
+* Operator sets specific limits for a given project via Keystone
+* Operator defines limits of a set of projects via non-flat enforcement
+  i.e. the feature formally known as hierarchical quotas
+
+We will focus on adding unified limits relating to:
+
+* total number of servers
+* amounts of each Resource Class requested in the flavor
+
+Note, this includes things like DISK_GB which is not supported today,
+along with things like custom resource class resources that are requested
+in extra specs (e.g. Ironic flavors).
+
+We will now look at all quotas exposed via the API and what they map to
+when using unifed limits:
+https://docs.openstack.org/api-ref/compute/?expanded=show-a-quota-detail#show-a-quota
+
+The follow existing quota types move to unified limits, allowing for
+per endpoint defaults and per project overrides (and hierarchical limits)
+via the unified limits system:
+
+* ``cores`` -> ``resource:VCPU``
+* ``instances`` -> ``server_count``
+* ``ram`` -> ``resource:MEMORY_MB``
+
+The following existing quota becomes defined only by a configuration
+option, and we no longer support and per project or user overrides
+via the API, we just report the existing limit as defined in the
+configuration:
+
+* ``key_pairs`` (counted per user)
+* ``server_groups`` (counted per project)
+* ``server_group_members`` (counted per server group)
+* ``metadata_items`` (counted per server)
+
+The above are purely protecting database bloat (i.e to stop a denial
+of service attack that fills up the database). They are similar to the
+hardcoded limit of the number of tags you can attach to a server.
+
+While deprecated in the API, we will also treat these quotas in the
+same way as the quotas above, i.e. they will now be set via
+confirguration with no per project overrides possible:
+
+* ``injected_file_content_bytes``
+* ``injected_file_path_bytes``
+* ``injected_files``
+
+Proposed change
+===============
+
+There are several parts to this spec:
+
+* Enforce Unified Limits
+* No per-user limits
+* No uncountable limits
+* Deprecate Nova's Quota APIs
+* Operator tooling to assist with the migration
+
+Enforcing Unified Limits
+------------------------
+
+We will support the following limits:
+
+* ``server_count``
+* ``resource:<RESOURCE_CLASS>``
+
+All the resource class usage will be counted using placement, but
+server count will make use of instance mappings. This only works if the
+queued for delete data migration has been completed. Due to no user
+based quotas, we don't need the ``user_id`` migration. If the operators
+tries to use unified limits before completing the migration, the code
+will block all new usage until the migration is completed. It is
+expected a blocking migration will be added before we turn on unified
+limits by default. For more details on the this data migration see
+this point in the existing quota code:
+https://github.com/openstack/nova/blob/0d3aeb0287a0619695c9b9e17c2dec49099876a5/nova/quota.py#L1053
+
+To allow the new system to co-exist with the older quota system, we add
+the following configuration to allow operators to opt-into the new system
+when the operator has migrated to unified limits, and the default will be
+to use the older quota system:
+
+* ``quota.enforce_unified_limits = False``
+
+For further details on the transition, please see the update section of this
+specification. Note the new unified limits code will have a hard dependency
+on counting usage via placement; as such it will ignore the value of
+``CONF.quota.count_usage_from_placement``.
+
+Looking at the existing quotas, `instances` becomes `server_count`,
+`cores` becomes `resource:VCPU` and `ram` becomes `resource:MEMORY_MB`.
+
+This work will re-use a lot of the new logic to query placement for resource
+usage, and use the instance mapping table to count servers added in this spec:
+http://specs.openstack.org/openstack/nova-specs/specs/train/approved/count-quota-usage-from-placement.html
+
+To find out what resources a server will claim, we reuse this
+code to extract the resources from any given flavor:
+https://github.com/openstack/nova/blob/2e85453879533af0b4d0e1178797d26f026a9423/nova/scheduler/utils.py#L387
+
+For server build, we use the above function to get the Resource Class
+resource amounts for the requested flavor. This will then be checked using
+olso.limit, which ensures the additional usage will not push the associated
+project over any of its limits. The oslo.limit library is responsible for
+counting all the current resource usage using a callback we provide that makes
+use of placement to count the current resource usage.
+
+Once resources are claimed in placement, we optionally recheck the limits
+to see if we were racing with other server builds to consume the last bits
+of available quota. The only change is using oslo.limit to do the recheck.
+That is, we will still respect the config: `quota.recheck_quota`
+Note: we do the first check of limits in nova-api, and the recheck in
+nova-conductor after resource allocation in placement succeeds.
+
+It is a similar story with resize. Except in this case, we check that we can
+claim resources for both the new flavor and old flavor at the same time.
+Note that this is quite different to the current quota system, even when
+counting usage via placement.
+
+For further details on the semantic changes relating to counting with
+placement see:
+http://specs.openstack.org/openstack/nova-specs/specs/train/approved/count-quota-usage-from-placement.html
+
+Note baremetal instances no longer claim any VCPU or MEMORY_MB resources.
+With this method, baremetal instances can be limited using custom
+resource class resources they request in the flavor.
+
+Should we choose to allow additional custom inventory entries
+from hypervisor based compute nodes, such as `{'CUSTOM_GPU_V100':1}`
+we will be also be able to apply quotas on these resources.
+
+The oslo.limits library will likely add additional configuration options.
+In particular, operators will need to specify the Nova API's endpoint uuid
+to oslo.limit, so it knows what unified limits apply to each particular
+Nova API service.
+
+No per user limits
+------------------
+
+Nova currently supports "per user" limits. They will no longer be supported
+when:  ``quota.enforce_unified_limits = True``
+
+There are no plans for migration tools, however it is expected that users
+that need a similar model can test out using the unified limits support for
+hierarchical limits, and report back on what could help others migrate.
+
+No uncountable limits
+---------------------
+
+As stated above, the focus for unified limits is the instance count and
+resource class allocations in placement. No other limits will be moved to
+unified limits, as agreed with operators in the Train Forum session.
+
+There are limits that are specific to nova-network. These are all ready
+deprecated. There are no plans to support these with unified limits turned on:
+
+* ``fixed_ips``
+* ``floating_ip``
+* ``security_group_rules``
+* ``networks``
+
+The remaining limits are all mainly used to protect the database from rogue
+users using up all available space in the database and/or missuse the API as
+some sort of storage system. As such, it is not expected that operators need
+per project overrides for any of these limits. As such, we propose to drop
+support for changing the limits via the API, and instead only allow changing
+of the limits via a single configuration option that applies to all
+projects in the system.
+
+The following limits will be changed to only be set via a single configuration
+option that applies equally to all projects:
+
+* ``server metadata``
+* ``injected_files``
+* ``injected_file_content_bytes``
+* ``injected_file_path_bytes``
+* ``key_pairs`` (counted per user)
+* ``server_groups`` (counted per project)
+* ``server_group_members`` (counted per server group)
+
+Note that the server_group_members are currently counted per user, but this
+is frankly very confusing, so above we propose the simpler limit servers
+in the server group. This seems consistent with removing per user limits for
+all other project owned resources.
+
+Using a global configuration option only means:
+
+* no per project overrides
+* no per user overrides
+* no changing of limits via the API
+
+These are limits on the amount of data that can be stored in various
+Nova databases. There is no way to display a project's usage of these limits,
+which further demonstrates how these are different to the resource limits
+unified limits has been designed for.
+
+Currently we honor ``quota.recheck_quota`` for all of these quotas. This adds
+significant code complexity, however most users never hit these limits and
+they are all very soft limits. As such, when we transition to a single global
+configuration value for all of these, we also will stop doing any rechecks.
+
+In summary the impact on the configuration options is:
+
+* ``quota.recheck_quota`` will have an updated description, noting what
+  functionality is lost when ``quota.enforce_unified_limits = True``.
+* ``quota.floating_ips``, ``quota.fixed_ips``, ``quota.security_groups``,
+  ``security_group_rules``: remain deprecated, and will be ignored when
+  ``quota.enforce_unified_limits = True``.
+* ``quota.metadata_items``, ``quota.injected_files``,
+  ``quota.injected_file_content_bytes``, ``quota.injected_file_path_length``,
+  ``quota.server_groups``, ``quota.server_groups_members``,
+  ``quota.key_pairs``:  these will all be
+  kept, but the description will be updated to note if
+  ``quota.enforce_unified_limits = True`` all updates via the API are ignored.
+
+Deprecate Nova's Quota APIs
+---------------------------
+
+To query and set limits, Keystones APIs should be used. To query a user's
+usage, the Placement API should be used, assuming placement is happy
+changing the default policy to allow users to query their usage.
+
+The one exception is server count can't currently be checked via
+Placement. When placement implements consumer records,
+or similar, then all usage could be queried via Placement. To avoid
+using a proxy API, users can do a server list API and count the number
+of servers returned.
+
+When ``quota.enforce_unified_limits = True`` a best effort will be made to
+keep the older micro-versions working by proxing API calls to Keystone and
+Placement as needed. No quota related DB tables will be accessed when
+``quota.enforce_unified_limits = True``.
+
+This includes the follow API resources:
+
+* /limits
+* /os-quota-sets
+* /os-quota-class-sets
+
+Existing tooling to set quotas should continue to operate, as long as it only
+changes quotas relating to instances, cores and ram. Requests to change any
+other quotas will be silently ignored. As one example, this should allow
+Horizon to function as normal during the transition.
+
+When you list limits for quotas that are not supported in the new system, they
+will instead show the configuration based limit that replaces the DB and API
+based limits, e.g. for keypairs you always see the config based value, no
+update via the API will ever be reflected back when
+``quota.enforce_unified_limits = True``
+
+There are some trade-offs with this approach:
+
+* Proxy APIs suck, but horizon must keep working as such all current operator
+  tooling around these existing APIs.
+* We don't need a micro version to enable/disable this proxy
+  of the quota APIs, as it doesn't really change the API.
+* In a future release when unifed limits becomes the default,
+  we should deprecate the APIs
+  ``/os-quota-sets`` and ``/os-quota-class-sets`` and tell users to talk to
+  the Keystone API instead. API based discovery of when Nova is enforcing
+  the limits set in Keystone is left for a future spec.
+* It is expected the above API deprecation will follow the pattern used
+  by nova-network proxy APIs, i.e. the APIs return 404 in new microversions
+  but continue to work in older microversions. Its possible in the more
+  distant future the APIs could be removed by returning 410 error.
+* Rejecting updates to quotas that we were previously able to set would be a
+  breaking change in behaviour, and require a microversion. Adding a new API
+  microversion that returns BadRequest for unsupported quotas would be a nice
+  addition if we were not planning on deprecating the API in favor of calling
+  Keystone instead.
+* Ideally we would also deprecate ``/limits`` in favor of a cross project
+  agreed direction that is aware of both flat and hierarchical limit
+  enforcement. Howerver we do not yet have consenus on what direction
+  we take. For this spec, we leave ``/limits`` in its current form, even
+  though it does not report on all the types of resource usage we now
+  support have limits on, and even though it lists limits that can
+  now only be changed via the configuration file.
+* When hierarchical limits are added, the per project usage information
+  in ``/limits`` does not mention anything about parent limits.
+  As such quota APIs may claim resources are available, but you will be
+  unable to build any new resources.
+  It is not clear what action the user can make to be able to build those new
+  resources. Operators can avoid this confusion by not over allocating quota.
+  We could extext the API to include a boolean to say if the limit has been
+  exceeded in the parent project, and as such the user is unable to consume
+  more resources even though their own usage is not over their own limits.
+  We could consider extending the API to include the usage of the full tree
+
+Migration to Unified Limits
+---------------------------
+
+The migration of all users to unified limits is happening in three phases:
+
+* enable unified limits as an option, with migration path from existing quotas
+* make unified limits the default, deprecate existing quota system
+* remove existing quota system
+
+To help with the transition we need operator tooling to:
+
+* Set registered limits in Keystone for each Nova endpoint in Keystone,
+  based on current limits in DB and/or configuration
+* Copy per-project quotas set in Nova into Keystone unified-limits
+* Operator confirms unified limits works for them
+* Drop all quota info from the DB to signal operator has completed transition
+* Upgrade status check to check there is no data left in quota DB tables
+
+Note the setting of project limits and registered limits in keystone will
+happen via files that are generated and passed to keystone-manage. This
+allows fast-forward upgrades where no API are available during the migration
+of limits from Nova to Keystone.
+
+There will be a new tool to setup the registered limits in keystone. It will
+read from the Nova DB and configuration and generate a file. That file can be
+by used with keystone-manage to register the current endpoint defaults in
+keystone.::
+
+  nova-manage limits generate_registered_limits --endpoint <endpoint-uuid>
+
+The following tool will generate the unified limits overrides (if any)
+that needs to be added into Keystone for each project. Again this too
+produces a file that is handed to keystone-manage which will update keystone::
+
+  nova-manage limits generate_project_limits [--project_id <project_id>]
+
+Once the operator sets `quota.enforce_unified_limits = True`, the Nova DB is
+ignored, and limits are accessed from Keystone only.
+
+To complete the migration, there is an operation to remove all the
+DB entries relating to the quota overrides. The tool only works when
+`quota.enforce_unified_limits = True`. It also removes all any per user limits
+associated with each project.::
+
+  nova-manage limits remove_db_quota_entries [--project_id project_id]
+
+Note the last two tools allow operators to iterate per project, to limit the
+load on the running system. If these tools are used on a running system, it is
+recommended that operators don't change quotas via the API during the
+transition.
+
+The nova status command will warn users that have failed to remove all the
+quota information from the DB. This will become an error in the release when
+``quota.enforce_unified_limits`` defaults to ``True``.
+
+It is worth noting that the Nova database may contain entries for projects
+that have been deleted in keystone. As such, it is advisable to get a list
+of active projects from keystone, and only generate_project_limits for those
+particular projects.
+
+This transition leaves several configuration options redundant, in particular
+the following will all be deprecated once unified limits is on by default:
+
+* ``quota.instances``, ``quota.cores``, ``quota.ram``: deprecate all these as
+  the limit now comes from keystone for unified limits, which will default to
+  unlimited if there is no limit in keystone.
+* ``quota.driver`` is ignored and hard coded to the no-op driver when
+  ``quota.enforce_unified_limits = True``.
+
+The setting ``quota.recheck_quota`` will be kept, and will be used in the same
+way with unified limits to avoid races when multiple instances are built at
+the same time.
+
+Alternatives
+------------
+
+Ideally we would not add any more proxy APIs, however, operators pushed back
+at the Train Forum session, requesting that their tooling continue to work
+across the transition. No operators reported using limits other than the
+instances, cores and ram limits.
+
+We could implement hierarchical quotas in isolation, and not adopt unified
+limits.
+
+We could limit the types of resources we limit, but it will be hard to
+transition to supporting different kinds of resource limits in a clear
+and interoperable way.
+
+Data model impact
+-----------------
+
+See upgrades, no changes in Victora due to having old and new quota systems
+side by side. Once we remove the old quota system, we could drop all the
+quota related DB tables.
+
+REST API impact
+---------------
+
+When ``quota.enforce_unified_limits = True`` Nova will proxy the requests
+to Keystone's unified limits API, where possible. The aim will be to keep
+horizon functioning correctly during the transition.
+
+Once using unified limits, operators should move to using Keystone's
+unified limit APIs to set and query limits. Usage information should be
+queried via Placement and the Servers API.
+
+Security impact
+---------------
+
+The removal of quota rechecks for some limits slightly reduces the protection
+provided, but really it encourages the proper implementation of API
+rate limiting as replacement protection.
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+Quota errors should appear the same before and after this change.
+
+Performance Impact
+------------------
+
+It is possible to have more complicated quota counts with hierarchical
+quotas, but the implementation of that is delegated to oslo.limit.
+
+Other deployer impact
+---------------------
+
+There are several tools to help ease the transition to unified limits noted
+above. Although it is expected that use of the feature will help inform the
+end direction.
+
+Developer impact
+----------------
+
+There will now be two limit system to maintain for a few cycles during the
+transition. But this avoids the long term need to maintain complicated
+hierarchical limit code, which still getting the advantages, such as being able
+to tidy up API policy.
+
+Upgrade impact
+--------------
+
+To get the best experience, operators need to start using the unified limits
+API via Keystone. User should start querying usage from Placement.
+
+The transition between the existing quota system and unified limits is
+detailed in the proposed solution section.
+
+It is expected that oslo.limit will limit versions of Keystone that can be
+used to Queens and newer, which is not expected to affect most users.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  johnthetubaguy
+
+Other contributors:
+  (TBC)
+
+Feature Liaison
+---------------
+
+Feature liaison:
+  melwitt
+
+Work Items
+----------
+
+* Add calls to oslo_limits, guarded by config to enable it
+* Move quota APIs to proxy to Keystone when unified limit quotas enabled
+* Add tools to migrate default and tenant limits from Nova into Keystone
+* Upgrade checks to ensure above tooling is used
+
+Dependencies
+============
+
+* http://specs.openstack.org/openstack/nova-specs/specs/train/approved/count-quota-usage-from-placement.html
+* keystone manage commands to add limits when keystone API not available
+
+Testing
+=======
+
+Grenade test that runs the migration of quota settings (after adding some
+quotas).
+
+Functional tests to ensure quotas are enforced based on unified limits
+correctly.
+
+Documentation Impact
+====================
+
+Building on the work to document quota usage from placement, we should
+describe how the new system operates. The admin guide needs to detail
+how to smoothly migrate to unified limits.
+
+References
+==========
+
+None
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Victora
+     - Introduced