Browse Source

Add Support for Smart NIC

Story: #2003346

Change-Id: I68c7334e1e377a85694a9791295683ff4fcbee35
Moshe Levi 9 months ago
parent
commit
f358fbdde9
2 changed files with 413 additions and 0 deletions
  1. 412
    0
      specs/approved/support-smart-nic.rst
  2. 1
    0
      specs/not-implemented/support-smart-nic.rst

+ 412
- 0
specs/approved/support-smart-nic.rst View File

@@ -0,0 +1,412 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+====================
8
+Smart NIC Networking
9
+====================
10
+
11
+https://storyboard.openstack.org/#!/story/2003346
12
+
13
+This spec describes proposed changes to Ironic to enable a generic,
14
+vendor-agnostic, baremetal networking service running on smart NICs,
15
+enabling baremetal networking with feature parity to the virtualization
16
+use-case.
17
+
18
+Problem description
19
+===================
20
+
21
+While Ironic today supports Neutron provisioned network connectivity for
22
+baremetal servers through an ML2 mechanism driver, the existing support
23
+is based largely on configuration of TORs through vendor-specific mechanism
24
+drivers, with limited capabilities.
25
+
26
+Proposed change
27
+===============
28
+
29
+There is a wide range of smart/intelligent NICs emerging on the market.
30
+These NICs generally incorporate one or more general purpose CPU cores along
31
+with data-plane packet processing acceleration, and can efficiently run
32
+virtual switches such as OVS, while maintaining the existing interfaces to the
33
+SDN layer.
34
+
35
+The proposal is to extend Ironic to enable use of smart NICs to implement
36
+generic networking services for Bare Metal servers. The goal is to enable
37
+running the standard Neutron Open vSwitch L2 agent, providing a generic,
38
+vendor-agnostic bare metal networking service with feature parity compared
39
+to the virtualization use-case. The Neutron Open vSwitch L2 agent manages the
40
+OVS bridges on the smart NIC.
41
+
42
+In this proposal, we address two use-cases:
43
+
44
+#. Neutron OVS L2 agent runs locally on the smart NIC.
45
+
46
+   This use case requires a smart NIC capable or running openstack control
47
+   services such as the Neutron OVS L2 agent. This use case strives to view
48
+   the smart NIC as an isolated hypervisor for the baremetal node, with the
49
+   smart NIC providing the services to the bare metal image running on the host
50
+   (as a hypervisor would provide services to a VM). While this spec initially
51
+   targets Neutron OVS L2 agent, the same implementation would naturally and
52
+   easily be extended to any other ML2 plugin as well as to additional
53
+   agents/services (for example exposing emulated NVMe storage devices
54
+   back-ended by a storage initiator on the smart NIC).
55
+
56
+#. Neutron OVS L2 agent(s) run remotely and manages
57
+   the OVS bridges for all the baremetal smart NICs.
58
+
59
+
60
+The enhancements for Neutron OVS L2 agent captured in [1]_, [2]_ and [3]_.
61
+
62
+* Set the smart NIC configuration
63
+
64
+  smart NIC configuration includes the following:
65
+
66
+  #. extend the ironic port with is_smartnic field. (default to False)
67
+  #. smart NIC hostname - the hostname of server/smart NIC where the Neutron
68
+     OVS agent is running. (required)
69
+  #. smart NIC port id - the port name that needs to be plugged to the
70
+     integration bridge. B in the diagram below (required)
71
+  #. smart NIC SSH public key - ssh public key of the smart NIC
72
+     (required only for remote)
73
+  #. smart NIC OVSDB SSL certificate - OVSDB SSL of the OVS in smart NIC
74
+     (required only for remote)
75
+
76
+  The OVS ML2 mechanism driver will determine if the Neutron OVS Agent runs
77
+  locally or remotely based on smart NIC configuration passed from ironic.
78
+  The config attribute will be stored in the local_link_information of the
79
+  baremetal port.
80
+
81
+  In the scope of this spec the smart NIC config will be set manually by
82
+  the admin.
83
+
84
+* Deployment Interfaces
85
+
86
+  Extending the ramdisk, direct, iscsi and ansible to support the smart nic
87
+  use-cases.
88
+
89
+  The Deployment Interfaces call network interface methods such as:
90
+  add_provisioning_network, remove_provisioning_network,
91
+  configure_tenant_networks, unconfigure_tenant_networks, add_cleaning_network
92
+  and remove_cleaning_network.
93
+
94
+  These network methods are currently ordinarily called when the baremetal is
95
+  powered down, ensuring proper network configuration on the TOR before booting
96
+  the bare metal.
97
+
98
+  smart NICs share the power state with the baremetal, requiring the baremetal
99
+  to be powered up before configuring the network. This leads to a potential
100
+  race where the baremetal boots and access the network prior to the network
101
+  being properly configured on the OVS within the smart NIC.
102
+
103
+  To ensure proper network configuration prior to baremetal boot, the
104
+  deployment interfaces will intermittently boot the baremetal into the BIOS
105
+  shell, providing a state where the ovs on the smart NIC may be configured
106
+  properly before rebooting the bare metal into the actual guest image or
107
+  ramdisk.
108
+
109
+
110
+  The following code for configure/unconfigure network:
111
+
112
+  .. code-block:: python
113
+
114
+      if task.driver.network.need_power_on(task):
115
+          old_power_state = task.driver.power.get_power_state(task)
116
+          if old_power_state == states.POWER_OFF:
117
+              # set next boot to BIOS to halt the baremetal boot
118
+              manager_utils.node_set_boot_device(task, boot_devices.BIOS,
119
+                                                 persistent=False)
120
+              manager_utils.node_power_action(task, states.POWER_ON)
121
+
122
+      # ...
123
+      # call task.driver.network method(s)
124
+      # ...
125
+
126
+      if task.driver.network.need_power_on(task):
127
+          manager_utils.node_power_action(task, old_power_state)
128
+
129
+  The following methods in the deployment interface are calling to one or
130
+  more configure/unconfigure networks and should be updated with the logic
131
+  above.
132
+
133
+  * iscsi Deploy Interface
134
+
135
+    - iscsi_deploy::prepare
136
+    - iscsi_deploy::deploy
137
+    - iscsi_deploy::tear_down
138
+
139
+  * ansible Deploy Interface
140
+
141
+    - ansible/deploy::reboot_and_finish_deploy
142
+    - ansible/deploy::prepare
143
+    - ansible/deploy::tear_down
144
+    - ansible/deploy::prepare_cleaning
145
+    - ansible/deploy::tear_down_cleaning
146
+
147
+  * direct Interface
148
+
149
+    - agent::prepare
150
+    - agent::tear_down
151
+    - agent::deploy
152
+    - agent::rescue
153
+    - agent::unrescue
154
+    - agent_base_vendor::reboot_and_finish_deploy
155
+    - agent_base_vendor::_finalize_rescue
156
+
157
+  * RAM Disk Interface
158
+
159
+    - pxe::deploy
160
+
161
+  * Common cleaning methods
162
+
163
+    - deploy_utils::prepare_inband_cleaning
164
+    - deploy_utils::tear_down_inband_clean
165
+
166
+* Network Interface
167
+
168
+  Extend the base `network_interface` with need_power_on -
169
+  return true if any ironic port attached to the node is a smart nic
170
+
171
+  Extend the ironic.common.neutron add_ports_to_network/
172
+  remove_ports_from_network methods for the smart NIC case:
173
+
174
+  * on add_ports_to_network and has smartNIC do the following:
175
+
176
+    - check neutron agent alive - verify that neutron agent is alive
177
+    - create neutron port
178
+    - check neutron port active - verify that neutron port is in active state
179
+
180
+  * on remove_ports_from_network and has smartNIC do the following:
181
+
182
+    - check neutron agent alive - verify that neutron agent is alive
183
+    - delete neutron port
184
+    - check neutron port is removed
185
+
186
+
187
+* Neutron ml2 OVS changes:
188
+
189
+  - Introduce a new vnic_type for ``smart-nic``.
190
+  - Update the Neutron ml2 OVS to bind smart-nic vnic_type with
191
+    `binding:profile` smart NIC config.
192
+
193
+* Neutron OVS agent changes:
194
+
195
+Example of smart NIC model::
196
+
197
+  +---------------------+
198
+  |      baremetal      |
199
+  | +-----------------+ |
200
+  | |  OS Server    | | |
201
+  | |               | | |
202
+  | |      +A       | | |
203
+  | +------|--------+ | |
204
+  |        |          | |
205
+  | +------|--------+ | |
206
+  | |  OS SmartNIC  | | |
207
+  | |    +-+B-+     | | |
208
+  | |    |OVS |     | | |
209
+  | |    +-+C-+     | | |
210
+  | +------|--------+ | |
211
+  +--------|------------+
212
+           |
213
+
214
+  A - port on the baremetal host.
215
+  B - port that represents the baremetal port in the smart NIC.
216
+  C - port that represents to the physical port in the smart NIC.
217
+
218
+  Add/Remove Port B to the OVS br-int with external-ids
219
+
220
+  In our case we will use the neutron OVS agent to plug the port on update
221
+  port event with the following external-ids: iface-id,iface-status, attached-mac
222
+  and node-uuid
223
+
224
+
225
+Alternatives
226
+------------
227
+
228
+* Delay the Neutron port binding (port binding means setting all the
229
+  OVSDB/Openflows config on the SmartNIC) to be performed by Neutron
230
+  later (once the bare metal is powered up). The problem with this
231
+  approach is that we have no guarantee of if/when the rules will be
232
+  programmed, and thus may inadvertently boot the baremetal while
233
+  the smart NIC is still programmed on the old network.
234
+
235
+Data model impact
236
+-----------------
237
+
238
+A new ``is_smartnic``  boolean field will be added to Port object.
239
+
240
+
241
+State Machine Impact
242
+--------------------
243
+
244
+None
245
+
246
+REST API impact
247
+---------------
248
+
249
+The port REST API will be modified to support the new ``is_smartnic``
250
+field.  The field will be readable by users with the baremetal observer role
251
+and writable by users with the baremetal admin role.
252
+
253
+Updates to the is_smartnic field of ports will be restricted in the
254
+same way as for other connectivity related fields (link local connection, etc.)
255
+- they will be restricted to nodes in the ``enroll``, ``inspecting`` and
256
+``manageable`` states.
257
+
258
+Client (CLI) impact
259
+-------------------
260
+
261
+
262
+"ironic" CLI
263
+~~~~~~~~~~~~
264
+
265
+None
266
+
267
+"openstack baremetal" CLI
268
+~~~~~~~~~~~~~~~~~~~~~~~~~
269
+
270
+The openstack baremetal CLI will be updated to support getting and setting the
271
+``is_smartnic`` field on ports.
272
+
273
+RPC API impact
274
+--------------
275
+
276
+None
277
+
278
+Driver API impact
279
+-----------------
280
+
281
+None
282
+
283
+Nova driver impact
284
+------------------
285
+
286
+None
287
+
288
+Ramdisk impact
289
+--------------
290
+
291
+None
292
+
293
+Security impact
294
+---------------
295
+
296
+* Smart NIC Isolation
297
+
298
+Both use cases run infrastructure functionality on the smart NIC, with
299
+the first use case also running control plane functionality.
300
+
301
+This requires proper isolation between the untrusted bare metal host and the
302
+smart NIC, preventing any/all direct or indirect access, both through the
303
+network interface exposed to the host and through side channels such as the
304
+platform BMC.
305
+
306
+Such isolation is implemented by the smart NIC device and/or the hardware
307
+platform vendor. There are multiple approaches for such isolation,
308
+ranging from completely physical disconnection of the smart NIC from the
309
+platform BMC to a platform with a trusted BMC wherein the BMC considers
310
+the baremetal host an untrusted entity and restricts its capabilities/access
311
+to the platform.
312
+
313
+In the absence of such isolation, the untrusted baremetal tenant
314
+may be able to gain access to the provisioning network, and in the second
315
+may be able to compromise the control plane.
316
+
317
+Proper isolation is dependent on the platform hardware/firmware, and cannot
318
+be directly enforced/guaranteed by ironic. Users of smart NIC use case should
319
+be made well aware of this via explicit documentation, and should be guided
320
+to verify the proper isolation exists on their platform when enabling such
321
+use cases.
322
+
323
+* Security Groups
324
+
325
+This will allow to use Neutron OVS agent pipeline. One of the features in the
326
+pipeline is security groups which will enhance the security model when using
327
+baremetal in a cloud.
328
+
329
+* Security credentials
330
+
331
+The node running the Neutron OVS agent (smart NIC or remote, according to use
332
+case) should be configured with the message bus credentials for the Neutron
333
+server.
334
+
335
+In addition, for the second use case, the SSH public key and OVSDB SSL
336
+certificate should be configured for the smart NIC port.
337
+
338
+
339
+Other end user impact
340
+---------------------
341
+
342
+* Baremetal admin needs to update the SmartNIC config manually.
343
+
344
+Scalability impact
345
+------------------
346
+
347
+None
348
+
349
+Performance Impact
350
+------------------
351
+
352
+None
353
+
354
+Other deployer impact
355
+---------------------
356
+
357
+None
358
+
359
+Developer impact
360
+----------------
361
+
362
+None
363
+
364
+Implementation
365
+==============
366
+
367
+Assignee(s)
368
+-----------
369
+
370
+Primary assignee:
371
+  hamdyk  - hamdy@mellanox.com
372
+
373
+Work Items
374
+----------
375
+
376
+* Update the Neutron network interface to populate the Smart NIC config from
377
+  the ironic port to the Neutron port `binding:profile` attribute.
378
+* Update the network_interface and common.neutron as described above
379
+* Update deployment interfaces as described above
380
+* Documentation updates.
381
+
382
+
383
+Dependencies
384
+============
385
+
386
+None, but the Neutron specs [1]_, [2]_ and [3]_ depend on this spec.
387
+
388
+Testing
389
+=======
390
+
391
+* Mellanox CI Jobs testing with Bluefield SmartNIC
392
+
393
+Upgrades and Backwards Compatibility
394
+====================================
395
+
396
+None
397
+
398
+
399
+Documentation Impact
400
+====================
401
+
402
+* Update the multitenancy.rst with setting the SmartNIC config
403
+* Document the security implications/guidelines under admin/security.rst
404
+
405
+References
406
+==========
407
+
408
+.. [1] https://review.openstack.org/#/c/619920/
409
+
410
+.. [2] https://review.openstack.org/#/c/595402/
411
+
412
+.. [3] https://review.openstack.org/#/c/595512/

+ 1
- 0
specs/not-implemented/support-smart-nic.rst View File

@@ -0,0 +1 @@
1
+../approved/support-smart-nic.rst

Loading…
Cancel
Save