Browse Source

QoS minimum bandwidth allocation in Placement API

This spec describes how to model, from Neutron, a new resource provider
in the Placement API to describe the bandwidth allocation.

Based on a Rocky PTG discussion this is a re-work of the spec.

Co-Authored-By: Rodolfo Alonso Hernandez <rodolfo.alonso.hernandez@intel.com>
Co-Authored-By: Bence Romsics <bence.romsics@ericsson.com>
Co-Authored-By: Balazs Gibizer <balazs.gibizer@ericsson.com>

Related-Bug: #1578989
Change-Id: Ib995837f6161bcceb09735a5601d8b79a25a7354
See-Also: Ie7be551f4f03957ade9beb64457736f400560486
Bence Romsics 11 months ago
parent
commit
dca619fd9e
1 changed files with 721 additions and 0 deletions
  1. 721
    0
      specs/rocky/minimum-bandwidth-allocation-placement-api.rst

+ 721
- 0
specs/rocky/minimum-bandwidth-allocation-placement-api.rst View File

@@ -0,0 +1,721 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+=================================================
8
+QoS minimum bandwidth allocation in Placement API
9
+=================================================
10
+
11
+https://bugs.launchpad.net/neutron/+bug/1578989
12
+
13
+This spec describes how to model, from Neutron, new resource providers
14
+in Placement API to describe bandwidth allocation.
15
+
16
+Problem Description
17
+===================
18
+
19
+Currently there are several parameters, quantitative and qualitative,
20
+that define a Nova server and are used to select the correct host
21
+and network backend devices to run it. Network bandwidth is not yet
22
+among these parameters. This allows situations where a physical network
23
+device could be oversubscribed.
24
+
25
+This spec addresses managing the bandwidth on the first physical device
26
+ie. the pyhsical interface closest to the nova server. Managing bandwidth
27
+further away, for example on the backplane of a Top-Of-Rack switch or
28
+end-to-end, is out of scope here.
29
+
30
+Guaranteeing bandwidth generally involves enforcement of constraints on
31
+two levels.
32
+
33
+* placement: Avoiding oversubscription when placing (scheduling) nova servers
34
+  and their ports.
35
+
36
+* data plane: Enforcing the guarantee on the physical network devices.
37
+
38
+This spec addresses placement enforcement only. (Data plane enforcement
39
+is covered by [4]_.) However the design must respect that users are
40
+interested in the joint use of these enforcements.
41
+
42
+Since the placement enforcement itself is a Nova-Neutron cross-project
43
+feature this spec is meant to be read, commented and maintained together
44
+with its Nova counterpart: `Network bandwidth resource provider` [2]_.
45
+
46
+This spec is based on the approved Neutron spec `Add a spec for strict
47
+minimum bandwidth support` [3]_. The aim of the current spec is not
48
+to redefine what is already approved in [3]_, but to specify how it is
49
+going to be implemented in Neutron.
50
+
51
+Use Cases
52
+---------
53
+
54
+The most straightforward use case is when a user, who has paid for a
55
+premium service that guarantees a minimum network bandwidth, wants to
56
+spawn a Nova server. The scheduler needs to know how much bandwidth is
57
+already in use in each physical network device in each compute host and
58
+how much bandwidth the user is requesting.
59
+
60
+Data plane only enforcement was merged in Newton for SR-IOV egress
61
+(see `Newton Release Notes` [6]_).
62
+
63
+Placement only enforcement may be a viable feature for users able
64
+to control all traffic (e.g. in a single tenant private cloud). Such
65
+placement only enforcement can be also used together with the bandwidth
66
+limit rule. The admin can set two rules in a QoS policy, both with
67
+the same bandwidth values and then each server on such chosen compute
68
+host will be able to use at most as much bandwidth as it has guaranteed.
69
+
70
+Proposed Change
71
+===============
72
+
73
+1. The user must be able to express the resource needs of a port.
74
+
75
+   1. Extend ``qos_minimum_bandwidth_rule`` with ingress direction.
76
+
77
+      Unlike enforcement in the data plane, Placement can handle both
78
+      directions by the same effort.
79
+
80
+   2. Mark ``qos_minimum_bandwidth_rule`` as supported QoS
81
+      policy rule for each existing QoS driver.
82
+
83
+      Placement enforcement is orthogonal to backend mechanisms. A user
84
+      can have placement enforcement for drivers not having data plane
85
+      enforcement (yet).
86
+
87
+   Due to the fact that we exposed (and likely want to expose further)
88
+   partial results of this development effort to end users, the meaning
89
+   of a ``qos_minimum_bandwidth_rule`` depends on OpenStack version,
90
+   Neutron backend driver and the rule's direction. A rule may be enforced
91
+   by placement and/or on the data plane. Therefore we must document, next
92
+   to the already existing support matrix in the `QoS devref` [10]_, which
93
+   combinations of versions, drivers, rule directions and (placement and/or
94
+   data plane) enforcements are supported.
95
+
96
+   Since Neutron's choice of backend is hidden from the cloud user, the
97
+   deployer must also clearly document which subset of the above support
98
+   matrix is applicable for a cloud user in a particular deployment.
99
+
100
+2. Neutron must convey the resource needs of a port to Nova.
101
+
102
+   Extend port with attribute ``resource_request`` according to section
103
+   'How required bandwidth for a Neutron port is modeled' below. This
104
+   attribute is computed, read-only and admin-only.
105
+
106
+   Information available at port create time (ie. before the port
107
+   is bound) must be sufficient to generate the ``resource_request``
108
+   attribute.
109
+
110
+   The port extension must be decoupled from ML2 and kept
111
+   in the QoS service plugin. One way to do that is to use
112
+   ``neutron.db._resource_extend`` like ``trunk_details`` uses it.
113
+
114
+3. Neutron must populate the Placement DB with the available resources.
115
+
116
+   Report information on available resources to the Placement service
117
+   using the `Placement API` [1]_. That is information about the physical
118
+   network devices, their physnets, available bandwidth and supported
119
+   VNIC types.
120
+
121
+   The cloud admin must be able to control (by configuration) what is
122
+   reported to Placement. To ease the configuration work autodiscovery
123
+   of networking devices may be employed, but the admin must be able to
124
+   override its results.
125
+
126
+Which devices and parameters will be tracked
127
+--------------------------------------------
128
+
129
+Even inside a compute host many networking topologies are possible.
130
+For example:
131
+
132
+1. OVS agent: physical network - OVS bridge - single physical NIC (or a bond):
133
+   1-to-1 mapping between physical network and physical interface
134
+
135
+2. SR-IOV agent: physical network - one or more PFs:
136
+   1-to-n mapping between physical network and physical interface(s)
137
+   (See `Networking Guide: SR-IOV` [7]_.)
138
+
139
+Each Neutron agent (Open vSwitch, Linux Bridge, SR-IOV) has a
140
+configuration parameter to map a physical network with one or more
141
+provider interfaces (SR-IOV) or a bridge connected to a provider interface
142
+(Open vSwitch or Linux Bridge).
143
+
144
+OVS agent configuration::
145
+
146
+    [ovs]
147
+    # bridge_mappings as it exists already.
148
+    bridge_mappings = physnet0:br0,physnet1:br1
149
+
150
+    # Each right hand side value in bridge_mappings:
151
+    #   * will have a corresponding resource provider created in Placement
152
+    #   * must be listed as a key in resource_provider_bandwidths
153
+
154
+    resource_provider_bandwidths = br0:EGRESS:INGRESS,br1:EGRESS:INGRESS
155
+
156
+    # Examples:
157
+
158
+    # Resource provider created, no inventory reported.
159
+    resource_provider_bandwidths = br0
160
+    resource_provider_bandwidths = br0::
161
+
162
+    # Report only egress inventory in kbps (same unit as in the QoS rule API).
163
+    resource_provider_bandwidths = br0:1000000:
164
+
165
+    # Report egress and ingress inventories in kbps.
166
+    resource_provider_bandwidths = br0:1000000:1000000
167
+
168
+    # Later we may introduce auto-discovery (for example via ethtool).
169
+    # We reserve the option to make auto-discovery the default behavior
170
+    # when it is implemented.
171
+    resource_provider_bandwidths = br0:auto:auto
172
+
173
+SR-IOV agent configuration::
174
+
175
+    [sriov_nic]
176
+    physical_device_mappings = physnet0:eth0,physnet0:eth1,physnet1:eth2
177
+
178
+    resource_provider_bandwidths = eth0:EGRESS:INGRESS,eth1:EGRESS:INGRESS
179
+
180
+How required bandwidth for a Neutron port is modeled
181
+----------------------------------------------------
182
+
183
+The required minimum network bandwidth needed for a port is modeled
184
+defining a QoS policy along with one or more QoS minimum bandwidth rules
185
+[4]_. However neither Nova nor Placement know about any QoS policy
186
+rule directly. Neutron translates the resource needs of a port into a
187
+standard port attribute describing the needed resource classes, amounts
188
+and traits.
189
+
190
+In this spec we assume that a single port requests resources from a
191
+single RP. Later we may allow a port to request resources from multiple RPs.
192
+
193
+The resources needed by a port are expressed via the new attribute
194
+``resource_request`` extending the port as follows.
195
+
196
+Figure: resource_request in the port
197
+
198
+.. code-block:: python
199
+
200
+    {"port": {
201
+        "status": "ACTIVE",
202
+        "name": "port0",
203
+        ...
204
+        "device_id": "5e3898d7-11be-483e-9732-b2f5eccd2b2e",
205
+        "resource_request": {
206
+            "resources": {
207
+                "NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND": 1000,
208
+                "NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND": 1000 },
209
+            "required": ["CUSTOM_PHYSNET_NET0", "CUSTOM_VNIC_TYPE_NORMAL"]}
210
+    }}
211
+
212
+The ``resource_request`` port attribute will be implemented by a new
213
+API extension named ``port-resource-request``.
214
+
215
+If a nova server boot request has a port defined and this port has a
216
+``resource_request`` attribute, that means the Placement Service must
217
+enforce the minimum bandwidth requirements.
218
+
219
+A host will satisfy the requirements if it has a physical network
220
+interface RP with the following properties. First, inventory of the
221
+new ``NET_BANDWIDTH_*`` resource classes and there is enough bandwidth
222
+available as shown in the 'Networking RP model' section. If a host doesn't
223
+have an inventory of the requested network bandwidth resource class(es),
224
+it won't be a candidate for the scheduler. Second, the physical network
225
+interface RP must have all the traits associated with it as listed in the
226
+``required`` field of the ``resource_request`` attribute.
227
+
228
+We propose two kinds of custom traits. First to express and request support
229
+for certain ``vnic_types``. This trait uses prefix ``CUSTOM_VNIC_TYPE_``.
230
+The ``vnic_type`` is then appended in all upper case.
231
+For example:
232
+
233
+* ``CUSTOM_VNIC_TYPE_NORMAL``
234
+* ``CUSTOM_VNIC_TYPE_DIRECT``
235
+
236
+Second we'll use traits to decide if a segment of a network (identified
237
+by its physnet name) is connected on the compute host considered in
238
+scheduling.  This trait uses prefix ``CUSTOM_PHYSNET_``. The physnet name
239
+is then appended in all upper case, any characters prohibited in traits must
240
+be replaced with underscores.
241
+
242
+For example:
243
+
244
+* ``CUSTOM_PHYSNET_PUBLIC``
245
+* ``CUSTOM_PHYSNET_NET1``
246
+
247
+If a nova server boot request has a network defined and this network has
248
+a ``qos_minimum_bandwidth_rule``, that boot request is going to fail as
249
+documented in the 'Scoping' section of [2]_ until Nova is refactored to
250
+create the port earlier (that is before scheduling). See also `SPEC:
251
+Prep work for Network aware scheduling (Pike)` [11]_.
252
+
253
+For multi-segment Neutron networks each static segment's physnet trait
254
+must be included in the ``resource_request`` attribute in a format that
255
+we can only specify after Placement supports request matching logic
256
+of ``any(traits)``. See `any-traits-in-allocation_candidates-query` [9]_.
257
+
258
+Reporting Available Resources
259
+-----------------------------
260
+
261
+Some details of reporting are described in the following sections of [2]_:
262
+
263
+* Neutron agent first start
264
+
265
+* Neutron agent restart
266
+
267
+* Finding the compute RP
268
+
269
+Details internal to Neutron are the following:
270
+
271
+Networking RP model
272
+~~~~~~~~~~~~~~~~~~~
273
+
274
+We made the following assumptions:
275
+
276
+* Neutron supports the ``multi-provider`` extension therefore a single
277
+  logical network might map to more than one physnet. Physnets of
278
+  non-dynamic segments are known before port binding. For the sake of
279
+  simplicity in this spec we assume each segment directly connected to a
280
+  physical interface with a mimimum bandwidth guarantee is a non-dynamic
281
+  segment. Therefore those physnets can be included in the port's
282
+  ``resource_request`` as traits.
283
+
284
+* Multiple SRIOV physical functions (PFs) can give access to the same
285
+  physnet on a given compute but those PFs always implement the same
286
+  ``vnic_type``. This means that using only physnet traits in Placement
287
+  and in the port's resource request does not select one PF unambiguously
288
+  but it is not a problem as both PFs are equivalent from resource
289
+  allocation perspective.
290
+
291
+* Two different backends (e.g. SRIOV and OVS) can give access to the same
292
+  physnet on the same compute host. In this case Neutron selects the
293
+  backend based on ``vnic_type`` of the Neutron port specified by the
294
+  end user during port create. Therefore physical device selection during
295
+  scheduling should consider the ``vnic_type`` of the port as well. This
296
+  can be done via the ``vnic_type`` based traits previously described.
297
+
298
+* Two different backends (e.g. OVS and LinuxBridge) can give access to
299
+  the same physnet on the same compute host while they are also
300
+  implementing the same ``vnic_type`` (e.g. ``normal``). In this
301
+  case the backend selection in Neutron is done according to
302
+  the order of ``mechanism_drivers`` configured by the admin in
303
+  ``neutron.conf``. Therefore physical device selection during scheduling
304
+  should consider the same preference order. As the backend order is
305
+  just a preference but not a hard rule supporting this behavior is *out
306
+  of scope* in this spec but in theory it can be done by a new weigher
307
+  in nova-scheduler.
308
+
309
+Based on these assumptions, Neutron will construct in Placement a RP tree
310
+as follows:
311
+
312
+Figure: networking RP model
313
+
314
+.. code::
315
+
316
+  Compute RP (name=hostname)
317
+   +
318
+   |
319
+   +-------+Network agent RP (for OVS agent), uuid = agent_uuid
320
+   |          inventory: # later, model number of OVS ports here
321
+   |             +
322
+   |             |
323
+   |             +------+Physical network interface RP,
324
+   |             |       uuid = uuid5(hostname:br0)
325
+   |             |         traits: CUSTOM_PHYSNET_1, CUSTOM_VNIC_TYPE_NORMAL
326
+   |             |         inventory:
327
+   |             |         {NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 10000,
328
+   |             |          NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 10000}
329
+   |             |
330
+   |             +------+Physical network interface RP,
331
+   |                     uuid = uuid5(hostname:br1)
332
+   |                       traits: CUSTOM_PHYSNET_2, CUSTOM_VNIC_TYPE_NORMAL
333
+   |                       inventory:
334
+   |                       {NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 10000,
335
+   |                        NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 10000}
336
+   |
337
+   +-------+Network agent RP (for LinuxBridge agent), uuid = agent_uuid
338
+   |             +
339
+   |             |
340
+   |             +------+Physical network interface RP,
341
+   |                     uuid = uuid5(hostname:virbr0)
342
+   |                       traits: CUSTOM_PHYSNET_1, CUSTOM_VNIC_TYPE_NORMAL
343
+   |                       inventory:
344
+   |                       {NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 10000,
345
+   |                        NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 10000}
346
+   |
347
+   +-------+Network agent RP (for SRIOV agent), uuid = agent_uuid
348
+                 +
349
+                 |
350
+                 +------+Physical network interface RP,
351
+                 |       uuid = uuid5(hostname:eth0)
352
+                 |         traits: CUSTOM_PHYSNET_2, CUSTOM_VNIC_TYPE_DIRECT
353
+                 |         inventory:
354
+                 |         {VF: 8, # VF resource is out of scope
355
+                 |          NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 10000,
356
+                 |          NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 10000}
357
+                 |
358
+                 +------+Physical network interface RP,
359
+                 |       uuid = uuid5(hostname:eth1)
360
+                 |         traits: CUSTOM_PHYSNET_2, CUSTOM_VNIC_TYPE_DIRECT
361
+                 |         inventory:
362
+                 |         {VF: 8, # VF resource is out of scope
363
+                 |          NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 10000,
364
+                 |          NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 10000}
365
+                 |
366
+                 +------+Physical network interface RP,
367
+                         uuid = uuid5(hostname:eth2)
368
+                           traits: CUSTOM_PHYSNET_3, CUSTOM_VNIC_TYPE_DIRECT
369
+                           inventory:
370
+                           {VF: 8, # VF resource is out of scope
371
+                            NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 10000,
372
+                            NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 10000}
373
+
374
+Custom traits will be used to indicate which physical network a given
375
+Physical network interface RP is connected to, as previously described.
376
+
377
+Custom traits will be used to indicate which ``vnic_type`` a backend
378
+supports so different backend technologies can be distinguished, as previously
379
+described.
380
+
381
+The current purpose of agent RPs is to allow us detecting the deletion of an
382
+RP. Later we may also start to model agent-level resources and capabilities.
383
+
384
+Report directly or indirectly
385
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
386
+
387
+Considering only agent-based MechanismDrivers we have two options:
388
+
389
+* direct: The agent reports resource providers, traits and inventories
390
+  directly to the Placement API.
391
+
392
+* indirect: The agent reports resource providers, traits and
393
+  inventories to Neutron-server which in turn reports the information
394
+  to the Placement API.
395
+
396
+Both have pros and cons. Direct reporting involves fewer components
397
+therefore it's more efficient and more reliable. On the other hand
398
+freshness of the resource information may be important information in
399
+itself. Nova has the compute heartbeat mechanism to ensure scheduler
400
+considers the live Placement records only. In case freshness of Neutron
401
+resource information is needed the only practical way is to build
402
+on the Neutron-agent heartbeat mechanism. Otherwise the reporting
403
+and heartbeat mechanism would take different paths. If resource
404
+information is reported through the agent heartbeat mechanism then
405
+freshness of resource information is known by Neutron-server and other
406
+components (for example a nova scheduler filter) could query it from
407
+Neutron-server.
408
+
409
+When Placement and nova-scheduler choose to allocate the requested
410
+bandwidth on a particular network resource provider (that represents
411
+a physical network interface) that choice has its implications on:
412
+
413
+* Neutron-server's choice of a Neutron backend for a port.
414
+  (vif_type, vif_details)
415
+* Neutron-agent's choice of a physical network interface.
416
+  (Only in some cases like when multiple SR-IOV PFs back one physnet.)
417
+
418
+The later choices (of neutron-server and neutron-agent) must respect the
419
+first (in the allocation), otherwise resources could be used somewhere
420
+else than allocated.
421
+
422
+The choice in the allocation can be easily communicated to Neutron
423
+using the chosen network resource provider UUID if this UUID is known
424
+to both Neutron-server and Neutron-agent. If available resources are
425
+reported directly from Neutron-agent to Placement then Neutron-server
426
+may not know about resource provider UUIDs. Therefore indirect reporting
427
+is recommended.
428
+
429
+Even when reporting indirectly we must keep the (Neutron reported part
430
+of the) content of the Placement DB under the as direct as possible control
431
+of Neutron agents. It is best to keep Neutron-server in a basically
432
+proxy-like role.
433
+
434
+Content and format of resource information reported (from agent to server)
435
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
436
+
437
+We propose to extend the ``configurations`` field of the agent heartbeat
438
+RPC message.
439
+
440
+Beyond the agent's hardcoded set of supported ``vnic_types`` the following
441
+agent configuration options are the input to extend the heartbeat message:
442
+
443
+* ``bridge_mappings`` or ``physical_device_mappings``
444
+* ``resource_provider_bandwidths``
445
+* If needed further options controlling inventory attributes like:
446
+  ``allocation_ratio``, ``min_unit``, ``max_unit``,
447
+  ``step_size``, ``reserved``
448
+
449
+Based on the input above the ``configurations`` dictionary of the
450
+heartbeat message shall be extended with the following keys:
451
+
452
+* (custom) ``traits``
453
+* ``resource_providers``
454
+* ``resource_provider_inventories``
455
+* ``resource_provider_traits``
456
+
457
+The values must be (re-)evaluated after the agent configuration is
458
+(re-)read. Each heartbeat message shall contain all items known by
459
+the agent at that time. The extension of the ``configurations`` fields
460
+intentionally mirrors the structure of the placement API (and does not
461
+directly mirror the agent configuration format, though can be derived
462
+from it). The values of these fields shall be formatted so they can be
463
+readily pasted into requests sent to the Placement API.
464
+
465
+Agent resource providers shall be identified by their already existing
466
+Neutron agent UUIDs as shown in the 'Networking RP model' section above.
467
+
468
+Neutron-agents shall generate UUIDs for physical network interface
469
+resource providers. Version 5 (name-based) UUIDs should be used by
470
+hashing names like ``HOSTNAME:OVS-BRIDGE-NAME`` for ovs-agent and
471
+``HOSTNAME:PF-NAME`` for sriov-agent since this way the UUIDs will
472
+be stable through an agent restart.
473
+
474
+Please note that the agent heartbeat message contains traits and their
475
+associations with resource providers, but there's no traits directly
476
+listed in the agent configurations. This is possible because both physnet
477
+and ``vnic_type`` traits we'll use can be inferred from already known
478
+pieces of information.
479
+
480
+Synchronization of resource information reported
481
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
482
+
483
+Ideally Neutron-agent, Neutron-server and Placement must have the same
484
+view of resources. We propose the following synchronization mechanism
485
+between Neutron-server and Placement:
486
+
487
+Each time Neutron-server learns of a new agent it diffs the heartbeat
488
+message (for traits, providers, inventories and trait associations)
489
+with all objects found in Placement under the agent RP. It creates
490
+the objects missing from Placement. It deletes those missing from the
491
+heartbeat. It updates the objects whose attributes are different in
492
+Placement and the heartbeat.
493
+
494
+At subsequent heartbeats received Neutron-server diffs the new and
495
+the previous heartbeats. If nothing changed no Placement request
496
+is sent. If a change in heartbeats is detected Neutron sends the
497
+appropriate Placement request based on the diff of heartbeats using
498
+the last seen Placement generation number. If the Placement request is
499
+successful Neutron stores the new generation number. If the request
500
+fails with generation conflict Neutron falls back to diffing between
501
+Placement and the heartbeat.
502
+
503
+Progress or block until the Compute host RP is created
504
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
505
+
506
+Neutron-server cannot progress to report resource information until
507
+the relevant Nova-compute host RP is created. (The reason being the
508
+Nova-compute host RP UUID is unpredictable to Neutron.) We believe
509
+that while waiting for the Nova-compute host RP a Neutron-server can
510
+progress with its other functions.
511
+
512
+Port binding changes
513
+--------------------
514
+
515
+The order of relevant operations are the following:
516
+
517
+1. Placement DB is populated with both compute and network resource
518
+   information.
519
+
520
+2. Triggered by a nova server boot Placement selects a list of candidates.
521
+
522
+3. Scheduler chooses exactly one candidate and allocates it in a single
523
+   transaction. (In some complex nova server move cases the conductor
524
+   may allocate but that's unimportant here.)
525
+
526
+4. Neutron binds the port.
527
+
528
+In steps (2) and (3) the selection includes the choice of RPs representing
529
+network backend objects (beyond the obvious choice of compute host). This
530
+naturally conflicts with Neutron's current port binding mechanism.
531
+
532
+To solve the conflict we must make sure:
533
+
534
+* Placement must produce candidates whose ports later can be bound by
535
+  Neutron. (At least with roughly the same probability as scheduler is
536
+  able today.)
537
+
538
+* The choices made by Placement and made by Neutron port binding must
539
+  be the same. Therefore the selection must be coordinated.
540
+
541
+  If more than one Neutron backend can satisfy the resource requirements
542
+  of a port on the same host then it cannot happen that Placement chooses
543
+  one, but Neutron binds another.
544
+
545
+  One way to do that is for Neutron-server to read (from Placement)
546
+  the allocation of the port currently being bound and let it influence
547
+  the binding. However this introduces a slow remote call in the middle of
548
+  port binding therefore it is not recommended.
549
+
550
+  Another way is to pass down part of the allocation record in the
551
+  call/message chain leading to the port binding PUT request. In the
552
+  port binding PUT request use the binding_profile attribute. That way
553
+  we would not need a new remote call, just to add an argument/payload
554
+  to already existing calls/messages.
555
+
556
+  The Nova spec ([2]_) proposes that the resources requested for a port
557
+  are included in a numbered request group (see `Granular Resource Request
558
+  Syntax` [8]_). A numbered request group is always matched by a single
559
+  resource. In general Neutron needs to know which resource provider
560
+  matched the numbered request group of the port.
561
+
562
+  To express the choice made by placement and nova-scheduler we propose
563
+  to add an ``allocation`` entry to ``binding_profile``::
564
+
565
+      {
566
+        "name": "port with minimum bw being bound",
567
+        "id": ...,
568
+        "network_id": ...,
569
+        "binding_profile": { "allocation": RP_UUID }
570
+      }
571
+
572
+  If a port has the ``resource_request`` attribute then it must
573
+  be bound with ``binding_profile.allocation`` supplied. Otherwise
574
+  ``binding_profile.allocation`` must not be present.
575
+
576
+  Usually ML2 port binding tries the mechanism drivers in their
577
+  configuration order until one succeeds to set the binding. However
578
+  if a port being bound has ``binding_profile.allocation`` then only a
579
+  single mechanism driver can be tried - the one implicitly identified by
580
+  ``RP_UUID``.
581
+
582
+  In case of hierarchical port binding ``binding_profile.allocation``
583
+  is meant to drive the binding only on the binding level that represents
584
+  the closest physical interface to the nova server.
585
+
586
+Out of Scope
587
+------------
588
+
589
+Minimum bandwidth rule update:
590
+When a minimum bandwidth rule is updated, the ML2 plugin will list the bound
591
+ports with this QoS policy and rule attached and will update the Allocation
592
+value. The `consumer_id` of each Allocation is the `device_id` of the port.
593
+This is out of scope in this spec and should be done during the work related
594
+to `os-vif migration tasks` [5]_.
595
+
596
+Trunk port:
597
+Subports of a trunk port are unknown to Nova. Allocating resources for
598
+subports is a task for Neutron only. This is out of scope too.
599
+
600
+Testing
601
+-------
602
+
603
+* Unit tests.
604
+
605
+* Functional tests.
606
+
607
+  * Agent-server interactions.
608
+
609
+* Fullstack.
610
+
611
+  * Handling agent failure cases.
612
+
613
+* Tempest API tests.
614
+
615
+  * Port API extended with ``resource_request``.
616
+  * Extensions of ``binding_profile``.
617
+
618
+* Tempest scenario tests.
619
+
620
+  * End-to-end feature test.
621
+
622
+In test frameworks where we cannot depend on Nova we can mock it away by:
623
+
624
+* Creating and binding the port as Nova would have done it, including.
625
+
626
+  * Setting its ``binding_profile``.
627
+  * Setting its ``binding_host_id`` as if Placement and Scheduler would
628
+    have chosen the host.
629
+
630
+Upgrade
631
+-------
632
+
633
+* When upgrading a system with ``minimum_bandwidth`` rules to support
634
+  both data plane and placement enforcement we see two options:
635
+
636
+  1. It is the responsibility of the admin to create the
637
+     allocations in Placement for all ports using ``minimum_bandwidth``
638
+     rules. Please note: this assumes that bandwidth is not overallocated
639
+     at the time of upgrade.
640
+
641
+  2. Add tooling for 1. as described in the 'Upgrade impact' section of [2]_.
642
+
643
+* The desired upgrade order of components is the following:
644
+  Placement, Nova, Neutron
645
+
646
+  If for some reason the reverse Neutron-Nova order is desired, then the
647
+  Neutron port API extension of ``resource_request`` must not be turned on
648
+  until both components are upgraded.
649
+
650
+* Neutron-server must be able to handle agent heartbeats both with
651
+  and without resource information in the ``configurations``.
652
+
653
+Work Items
654
+----------
655
+
656
+These work items are designed so Neutron end-to-end behavior can be
657
+prototyped and tested independently of the progress of related work in
658
+Nova. But part of it depends on already available Placement features.
659
+
660
+* Extend agent heartbeat configuration with resource provider information.
661
+* (We already have it): Persist extended agent configuration reported via
662
+  heartbeat.
663
+* (We already have it): Placement client in neutron-lib for the use of
664
+  neutron-server.
665
+* Neutron-server initially diffs resource info reported by agent against
666
+  Placement.
667
+* Neutron-server diffs consequent agent heartbeat configurations.
668
+* Neutron-server turns the diffs into Placement requests (with generation
669
+  handling).
670
+* Extend rule ``qos_minimum_bandwidth_rule`` with direction ``ingress``.
671
+* Extend port with ``resource_request`` based on QoS rule
672
+  mimimum-bandwidth-placement.
673
+* Make the reported agent configuration queriable so neutron-server can
674
+  infer which backend is implied in the RP allocated (as in
675
+  ``binding_profile.allocation``).
676
+* In binding a port with ``binding_profile.allocation`` replace the list of
677
+  tried mechanism drivers with the one-element list of the inferred backend.
678
+* (We already have it): Send ``binding_profile`` to all agents.
679
+* In sriov-agent force the choice of PF as implied by the RP allocated.
680
+
681
+For each of the above:
682
+
683
+* Tests.
684
+* Documentation: api-ref, devref, networking guide.
685
+* Release notes.
686
+
687
+References
688
+==========
689
+
690
+.. [1] `Placement API`:
691
+       https://docs.openstack.org/nova/latest/user/placement.html
692
+
693
+.. [2] `SPEC: Network bandwidth resource provider`:
694
+       https://review.openstack.org/502306
695
+
696
+.. [3] `Add a spec for strict minimum bandwidth support`:
697
+       https://review.openstack.org/396297
698
+
699
+.. [4] `[RFE] Minimum bandwidth support (egress)`:
700
+       https://bugs.launchpad.net/neutron/+bug/1560963
701
+
702
+.. [5] `BP: os-vif migration tasks`:
703
+       https://blueprints.launchpad.net/neutron/+spec/os-vif-migration
704
+
705
+.. [6] `Newton Release Notes`:
706
+       https://docs.openstack.org/releasenotes/neutron/newton.html
707
+
708
+.. [7] `Networking Guide: SR-IOV`:
709
+       https://docs.openstack.org/neutron/queens/admin/config-sriov.html#enable-neutron-sriov-agent-compute
710
+
711
+.. [8] `Granular Resource Request Syntax`:
712
+       https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/granular-resource-requests.html
713
+
714
+.. [9] `any-traits-in-allocation_candidates-query`:
715
+       https://blueprints.launchpad.net/nova/+spec/any-traits-in-allocation-candidates-query
716
+
717
+.. [10] `QoS devref`:
718
+        https://docs.openstack.org/neutron/latest/contributor/internals/quality_of_service.html#agent-backends
719
+
720
+.. [11] `SPEC: Prep work for Network aware scheduling (Pike)`:
721
+        https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/prep-for-network-aware-scheduling-pike.html

Loading…
Cancel
Save