From 8a9a07db863c860e474f8bc2a16a7f0af616a726 Mon Sep 17 00:00:00 2001 From: XiongQiu Date: Tue, 12 Apr 2016 19:48:37 +0800 Subject: [PATCH] Add cross-pod L2 Networking spec file 1. What is the problem? In the Tricircle, the cross pod L2 networking automation is established after the VM is plugged into. The simplest way to stretch one L2 network across multiple OpenStack instances is to use a same VLAN network, but there is a lot of limitation: the number of VLAN segment is limited, the VLAN network itself is not good to spread across multiple sites, although you can use some gateways to do so. But there are so many tenants in the cloud, and new tenants could be added into the cloud dynamically, fixed physical network configuration for dynamic tenant networking is hard to manage. 2. What is the solution to the problem? To deal with the above problem, flexible tenant level L2 networking automation across multiple OpenStack instances in one site or in multiple sites is needed in the Tricirlce. 3. What the features need to be implemented to the Tricircle to realize the solution? To implement the features that networking automation supports more than one bottom pod in AZ or multiple AZs for different use cases in the Tricircle Blueprint: https://blueprints.launchpad.net/tricircle/+spec/cross-site-connectivity Change-Id: I616048c13d03f48aa16d9ff48572b0d5a49d6fb4 --- specs/cross-pod-l2-networking.rst | 563 ++++++++++++++++++++++++++++++ 1 file changed, 563 insertions(+) create mode 100644 specs/cross-pod-l2-networking.rst diff --git a/specs/cross-pod-l2-networking.rst b/specs/cross-pod-l2-networking.rst new file mode 100644 index 00000000..1a2a5848 --- /dev/null +++ b/specs/cross-pod-l2-networking.rst @@ -0,0 +1,563 @@ +====================================== +Cross pod L2 networking in Tricircle +====================================== + +Background +========== +The Tricircle provides unified OpenStack API gateway and networking automation +functionality. Those main functionalities allow cloud operators to manage +multiple OpenStack instances which are running in one site or multiple sites +as a single OpenStack cloud. + +Each bottom OpenStack instance which is managed by the Tricircle is also called +a pod. + +The Tricircle has the following components: + +* Nova API-GW +* Cinder API-GW +* Neutron API Server with Neutron Tricircle plugin +* Admin API +* XJob +* DB + +Nova API-GW provides the functionality to trigger automatic networking creation +when new VMs are being provisioned. Neutron Tricircle plug-in is the +functionality to create cross OpenStack L2/L3 networking for new VMs. After the +binding of tenant-id and pod finished in the Tricircle, Cinder API-GW and Nova +API-GW will pass the cinder api or nova api request to appropriate bottom +OpenStack instance. + +Please refer to the Tricircle design blueprint[1], especially from +'7. Stateless Architecture Proposal' for the detail description of each +components. + + +Problem Description +=================== +When a user wants to create a network in Neutron API Server, the user can +specify the 'availability_zone_hints'(AZ or az will be used for short for +availability zone) during network creation[5], in the Tricircle, the +'az_hints' means which AZ the network should be spreaded into. The 'az_hints' +meaning in Tricircle is a little different from the 'az_hints' meaning in +Neutron[5]. If no 'az_hints' was specified during network creation, this created +network will be spread into any AZ. If there is a list of 'az_hints' during the +network creation, that means the network should be able to be spread into these +AZs which are suggested by a list of 'az_hints'. + +When a user creates VM or Volume, there is also one parameter called +availability zone. The AZ parameter is used for Volume and VM co-location, so +that the Volume and VM will be created into same bottom OpenStack instance. + +When a VM is being attached to a network, the Tricircle will check whether a +VM's AZ is inside in the network's AZs scope. If a VM is not in the network's +AZs scope, the VM creation will be rejected. + +Currently, the Tricircle only supports one pod in one AZ. And only supports a +network associated with one AZ. That means currently a tenant's network will +be presented only in one bottom OpenStack instance, that also means all VMs +connected to the network will be located at one bottom OpenStack instance. +If there are more than one pod in one AZ, refer to the dynamic pod binding[6]. + +There are lots of use cases where a tenant needs a network being able to be +spread out into multiple bottom OpenStack instances in one AZ or multiple AZs. + +* Capacity expansion: tenants add VMs more and more, the capacity of one + OpenStack may not be enough, then a new OpenStack instance has to be added + to the cloud. But the tenant still wants to add new VMs into same network. + +* Cross OpenStack network service chaining. Service chaining is based on + the port-pairs. Leveraging the cross pod L2 networking capability which + is provided by the Tricircle, the chaining could also be done by across sites. + For example, vRouter1 in pod1, but vRouter2 in pod2, these two VMs could be + chained. + +* Applications are often required to run in different availability zones to + achieve high availability. Application needs to be designed as + Active-Standby/Active-Active/N-Way to achieve high availability, and some + components inside one application are designed to work as distributed + cluster, this design typically leads to state replication or heart + beat among application components (directly or via replicated database + services, or via private designed message format). When this kind of + applications are distributedly deployed into multiple OpenStack instances, + cross OpenStack L2 networking is needed to support heart beat + or state replication. + +* When a tenant's VMs are provisioned in different OpenStack instances, there + is E-W (East-West) traffic for these VMs, the E-W traffic should be only + visible to the tenant, and isolation is needed. If the traffic goes through + N-S (North-South) via tenant level VPN, overhead is too much, and the + orchestration for multiple site to site VPN connection is also complicated. + Therefore cross OpenStack L2 networking to bridge the tenant's routers in + different OpenStack instances can provide more light weight isolation. + +* In hybrid cloud, there is cross L2 networking requirement between the + private OpenStack and the public OpenStack. Cross pod L2 networking will + help the VMs migration in this case and it's not necessary to change the + IP/MAC/Security Group configuration during VM migration. + +The spec[5] is to explain how one AZ can support more than one pod, and how +to schedule a proper pod during VM or Volume creation. + +And this spec is to deal with the cross OpenStack L2 networking automation in +the Tricircle. + +The simplest way to spread out L2 networking to multiple OpenStack instances +is to use same VLAN. But there is a lot of limitations: (1) A number of VLAN +segment is limited, (2) the VLAN network itself is not good to spread out +multiple sites, although you can use some gateways to do the same thing. + +So flexible tenant level L2 networking across multiple OpenStack instances in +one site or in multiple sites is needed. + +Proposed Change +=============== + +Cross pod L2 networking can be divided into three categories, +``Shared VLAN``, ``Shared VxLAN`` and ``Mixed VLAN/VxLAN``. + +* Shared VLAN + + Network in each bottom OpenStack is VLAN type and has the same VLAN ID. + If we want shared VLAN L2 networking to work in multi-site scenario, i.e., + Multiple OpenStack instances in multiple sites, physical gateway needs to + be manually configured to make one VLAN networking be extended to other + sites. + + *Manual setup physical gateway is out of the scope of this spec* + +* Shared VxLAN + + Network in each bottom OpenStack instance is VxLAN type and has the same + VxLAN ID. + + Leverage L2GW[2][3] to implement this type of L2 networking. + +* Mixed VLAN/VxLAN + + Network in each bottom OpenStack instance may have different types and/or + have different segment IDs. + + Leverage L2GW[2][3] to implement this type of L2 networking. + +There is another network type called “Local Network”. For “Local Network”, +the network will be only presented in one bottom OpenStack instance. And the +network won't be presented in different bottom OpenStack instances. If a VM +in another pod tries to attach to the “Local Network”, it should be failed. +This use case is quite useful for the scenario in which cross pod L2 +networking is not required, and one AZ will not include more than bottom +OpenStack instance. + +Cross pod L2 networking will be able to be established dynamically during +tenant's VM is being provisioned. + +There is assumption here that only one type of L2 networking will work in one +cloud deployment. + + +A Cross Pod L2 Networking Creation +------------------------------------ + +A cross pod L2 networking creation will be able to be done with the az_hint +attribute of the network. If az_hint includes one AZ or more AZs, the network +will be presented only in this AZ or these AZs, if no AZ in az_hint, it means +that the network can be extended to any bottom OpenStack. + +There is a special use case for external network creation. For external +network creation, you need to specify the pod_id but not AZ in the az_hint +so that the external network will be only created in one specified pod per AZ. + + *Support of External network in multiple OpenStack instances in one AZ + is out of scope of this spec.* + +Pluggable L2 networking framework is proposed to deal with three types of +L2 cross pod networking, and it should be compatible with the +``Local Network``. + +1. Type Driver under Tricircle Plugin in Neutron API server + +* Type driver to distinguish different type of cross pod L2 networking. So + the Tricircle plugin need to load type driver according to the configuration. + The Tricircle can reuse the type driver of ML2 with update. + +* Type driver to allocate VLAN segment id for shared VLAN L2 networking. + +* Type driver to allocate VxLAN segment id for shared VxLAN L2 networking. + +* Type driver for mixed VLAN/VxLAN to allocate VxLAN segment id for the + network connecting L2GWs[2][3]. + +* Type driver for Local Network only updating ``network_type`` for the + network to the Tricircle Neutron DB. + +When a network creation request is received in Neutron API Server in the +Tricircle, the type driver will be called based on the configured network +type. + +2. Nova API-GW to trigger the bottom networking automation + +Nova API-GW can be aware of when a new VM is provisioned if boot VM api request +is received, therefore Nova API-GW is responsible for the network creation in +the bottom OpenStack instances. + +Nova API-GW needs to get the network type from Neutron API server in the +Tricircle, and deal with the networking automation based on the network type: + +* Shared VLAN + Nova API-GW creates network in bottom OpenStack instance in which the VM will + run with the VLAN segment id, network name and type that are retrieved from + the Neutron API server in the Tricircle. + +* Shared VxLAN + Nova API-GW creates network in bottom OpenStack instance in which the VM will + run with the VxLAN segment id, network name and type which are retrieved from + Tricricle Neutron API server. After the network in the bottom OpenStack + instance is created successfully, Nova API-GW needs to make this network in the + bottom OpenStack instance as one of the segments in the network in the Tricircle. + +* Mixed VLAN/VxLAN + Nova API-GW creates network in different bottom OpenStack instance in which the + VM will run with the VLAN or VxLAN segment id respectively, network name and type + which are retrieved from Tricricle Neutron API server. After the network in the + bottom OpenStack instances is created successfully, Nova API-GW needs to update + network in the Tricircle with the segmentation information of bottom netwoks. + +3. L2GW driver under Tricircle Plugin in Neutron API server + +Tricircle plugin needs to support multi-segment network extension[4]. + +For Shared VxLAN or Mixed VLAN/VxLAN L2 network type, L2GW driver will utilize the +multi-segment network extension in Neutron API server to build the L2 network in the +Tricircle. Each network in the bottom OpenStack instance will be a segment for the +whole cross pod L2 networking in the Tricircle. + +After the network in the bottom OpenStack instance was created successfully, Nova +API-GW will call Neutron server API to update the network in the Tricircle with a +new segment from the network in the bottom OpenStack instance. + +If the network in the bottom OpenStack instance was removed successfully, Nova +API-GW will call Neutron server api to remove the segment in the bottom OpenStack +instance from network in the Tricircle. + +When L2GW driver under Tricircle plugin in Neutron API server receives the +segment update request, L2GW driver will start async job to orchestrate L2GW API +for L2 networking automation[2][3]. + + +Data model impact +----------------- + +In database, we are considering setting physical_network in top OpenStack instance +as ``bottom_physical_network#bottom_pod_id`` to distinguish segmentation information +in different bottom OpenStack instance. + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +None + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + + +Implementation +============== + +**Local Network Implementation** + +For Local Network, L2GW is not required. In this scenario, no cross pod L2/L3 +networking is required. + +A user creates network ``Net1`` with single AZ1 in az_hint, the Tricircle plugin +checks the configuration, if ``tenant_network_type`` equals ``local_network``, +it will invoke Local Network type driver. Local Network driver under the +Tricircle plugin will update ``network_type`` in database. + +For exmaple, a user creates VM1 in AZ1 which has only one pod ``POD1``, and +connects it to network ``Net1``. ``Nova API-GW`` will send network creation +request to ``POD1`` and the VM will be booted in AZ1 (There should be only one +pod in AZ1). + +If a user wants to create VM2 in AZ2 or ``POD2`` in AZ1, and connect it to +network ``Net1`` in the Tricircle, it would be failed. Because the ``Net1`` is +local_network type network and it is limited to present in ``POD1`` in AZ1 only. + +**Shared VLAN Implementation** + +For Shared VLAN, L2GW is not required. This is the most simplest cross pod +L2 networking for limited scenario. For example, with a small number of +networks, all VLANs are extended through physical gateway to support cross +site VLAN networking, or all pods under same core switch with same visible +VLAN ranges that supported by the core switch are connected by the core +switch. + +when a user creates network called ``Net1``, the Tricircle plugin checks the +configuration. If ``tenant_network_type`` equals ``shared_vlan``, the +Tricircle will invoke Shared VLAN type driver. Shared VLAN driver will +create ``segment``, and assign ``network_type`` with VLAN, update +``segment`` and ``network_type`` and ``physical_network`` with DB + +A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be +booted in ``POD1``, ``Nova API-GW`` needs to get the network information and +send network creation message to ``POD1``. Network creation message includes +``network_type`` and ``segment`` and ``physical_network``. + +Then the user creates VM2 in AZ2, and connects it to network Net1. If VM will +be booted in ``POD2``, ``Nova API-GW`` needs to get the network information and +send create network message to ``POD2``. Create network message includes +``network_type`` and ``segment`` and ``physical_network``. + +**Shared VxLAN Implementation** + +A user creates network ``Net1``, the Tricircle plugin checks the configuration, if +``tenant_network_type`` equals ``shared_vxlan``, it will invoke shared VxLAN +driver. Shared VxLAN driver will allocate ``segment``, and assign +``network_type`` with VxLAN, and update network with ``segment`` and +``network_type`` with DB + +A user creates VM1 in AZ1, and connects it to network ``Net1``. If VM1 will be +booted in ``POD1``, ``Nova API-GW`` needs to get the network information and send +create network message to ``POD1``, create network message includes +``network_type`` and ``segment``. + +``Nova API-GW`` should update ``Net1`` in Tricircle with the segment information +got by ``POD1``. + +Then the user creates VM2 in AZ2, and connects it to network ``Net1``. If VM2 will +be booted in ``POD2``, ``Nova API-GW`` needs to get the network information and +send network creation massage to ``POD2``, network creation message includes +``network_type`` and ``segment``. + +``Nova API-GW`` should update ``Net1`` in the Tricircle with the segment information +get by ``POD2``. + +The Tricircle plugin detects that the network includes more than one segment +network, calls L2GW driver to start async job for cross pod networking for +``Net1``. The L2GW driver will create L2GW1 in ``POD1`` and L2GW2 in ``POD2``. In +``POD1``, L2GW1 will connect the local ``Net1`` and create L2GW remote connection +to L2GW2, then populate the information of MAC/IP which resides in L2GW1. In +``POD2``, L2GW2 will connect the local ``Net1`` and create L2GW remote connection +to L2GW1, then populate remote MAC/IP information which resides in ``POD1`` in L2GW2. + +L2GW driver in the Tricircle will also detect the new port creation/deletion API +request. If port (MAC/IP) created or deleted in ``POD1`` or ``POD2``, it needs to +refresh the L2GW2 MAC/IP information. + +Whether to populate the information of port (MAC/IP) should be configurable according +to L2GW capability. And only populate MAC/IP information for the ports that are not +resides in the same pod. + +**Mixed VLAN/VxLAN** + +To achieve cross pod L2 networking, L2GW will be used to connect L2 network in +different pods, using L2GW should work for Shared VxLAN and Mixed VLAN/VxLAN +scenario. + +When L2GW connected with local network in the same OpenStack instance, no +matter it's VLAN or VxLAN or GRE, the L2GW should be able to connect the +local network, and because L2GW is extension of Neutron, only network +UUID should be enough for L2GW to connect the local network. + +When admin user creates network in Tricircle, he/she specifies the network +type as one of the network type as discussed above. In the phase of creating +network in Tricircle, only one record is saved in the database, no network +will be created in bottom OpenStack. + +After the network in the bottom created successfully, need to retrieve the +network information like segment id, network name and network type, and make +this network in the bottom pod as one of the segments in the network in +Tricircle. + +In the Tricircle, network could be created by tenant or admin. For tenant, no way +to specify the network type and segment id, then default network type will +be used instead. When user uses the network to boot a VM, ``Nova API-GW`` +checks the network type. For Mixed VLAN/VxLAN network, ``Nova API-GW`` first +creates network in bottom OpenStack without specifying network type and segment +ID, then updates the top network with bottom network segmentation information +returned by bottom OpenStack. + +A user creates network ``Net1``, plugin checks the configuration, if +``tenant_network_type`` equals ``mixed_vlan_vxlan``, it will invoke mixed VLAN +and VxLAN driver. The driver needs to do nothing since segment is allocated +in bottom. + +A user creates VM1 in AZ1, and connects it to the network ``Net1``, the VM is +booted in bottom ``POD1``, and ``Nova API-GW`` creates network in ``POD1`` and +queries the network detail segmentation information (using admin role), and +gets network type, segment id, then updates this new segment to the ``Net1`` +in Tricircle ``Neutron API Server``. + +Then the user creates another VM2, and with AZ info AZ2, then the VM should be +able to be booted in bottom ``POD2`` which is located in AZ2. And when VM2 should +be able to be booted in AZ2, ``Nova API-GW`` also creates a network in ``POD2``, +and queries the network information including segment and network type, +updates this new segment to the ``Net1`` in Tricircle ``Neutron API Server``. + +The Tricircle plugin detects that the ``Net1`` includes more than one network +segments, calls L2GW driver to start async job for cross pod networking for +``Net1``. The L2GW driver will create L2GW1 in ``POD1`` and L2GW2 in ``POD2``. In +``POD1``, L2GW1 will connect the local ``Net1`` and create L2GW remote connection +to L2GW2, then populate information of MAC/IP which resides in ``POD2`` in L2GW1. +In ``POD2``, L2GW2 will connect the local ``Net1`` and create L2GW remote connection +to L2GW1, then populate remote MAC/IP information which resides in ``POD1`` in L2GW2. + +L2GW driver in Tricircle will also detect the new port creation/deletion api +calling, if port (MAC/IP) created or deleted in ``POD1``, then needs to refresh +the L2GW2 MAC/IP information. If port (MAC/IP) created or deleted in ``POD2``, +then needs to refresh the L2GW1 MAC/IP information, + +Whether to populate MAC/IP information should be configurable according to +L2GW capability. And only populate MAC/IP information for the ports that are +not resides in the same pod. + +**L3 bridge network** + +Current implementation without cross pod L2 networking. + +* A special bridge network is created and connected to the routers in + different bottom OpenStack instances. We configure the extra routes of the routers + to route the packets from one OpenStack to another. In current + implementation, we create this special bridge network in each bottom + OpenStack with the same ``VLAN ID``, so we have an L2 network to connect + the routers. + +Difference between L2 networking for tenant's VM and for L3 bridging network. + +* The creation of bridge network is triggered during attaching router + interface and adding router external gateway. + +* The L2 network for VM is triggered by ``Nova API-GW`` when a VM is to be + created in one pod, and finds that there is no network, then the network + will be created before the VM is booted, network or port parameter is + required to boot VM. The IP/Mac for VM is allocated in the ``Tricircle``, + top layer to avoid IP/mac collision if they are allocated separately in + bottom pods. + +After cross pod L2 networking is introduced, the L3 bridge network should +be updated too. + +L3 bridge network N-S (North-South): + +* For each tenant, one cross pod N-S bridge network should be created for router + N-S inter-connection. Just replace the current shared VLAN N-S bridge network + to corresponding Shared VxLAN or Mixed VLAN/VxLAN. + +L3 bridge network E-W (East-West): + +* When attaching router interface happened, for Shared VLAN, it will keep + current process to establish E-W bridge network. For Shared VxLAN and Mixed + VLAN/VxLAN, if a L2 network is able to expand to the current pod, then just + expand the L2 network to the pod, all E-W traffic will go out from local L2 + network, then no bridge network is needed. + +* For example, (Net1, Router1) in ``Pod1``, (Net2, Router1) in ``Pod2``, if + ``Net1`` is a cross pod L2 network, and can be expanded to Pod2, then will just + expand ``Net1`` to Pod2. After the ``Net1`` expansion ( just like cross pod L2 networking + to spread one network in multiple pods ), it’ll look like (Net1, Router1) + in ``Pod1``, (Net1, Net2, Router1) in ``Pod2``, In ``Pod2``, no VM in ``Net1``, only for + E-W traffic. Now the E-W traffic will look like this: + +from Net2 to Net1: + +Net2 in Pod2 -> Router1 in Pod2 -> Net1 in Pod2 -> L2GW in Pod2 ---> L2GW in +Pod1 -> Net1 in Pod1. + +Note: The traffic for ``Net1`` in ``Pod2`` to ``Net1`` in ``Pod1`` can bypass the L2GW in +``Pod2``, that means outbound traffic can bypass the local L2GW if the remote VTEP of +L2GW is known to the local compute node and the packet from the local compute +node with VxLAN encapsulation cloud be routed to remote L2GW directly. It's up +to the L2GW implementation. With the inbound traffic through L2GW, the inbound +traffic to the VM will not be impacted by the VM migration from one host to +another. + +If ``Net2`` is a cross pod L2 network, and can be expanded to ``Pod1`` too, then will +just expand ``Net2`` to ``Pod1``. After the ``Net2`` expansion(just like cross pod L2 +networking to spread one network in multiple pods ), it’ll look like (Net2, +Net1, Router1) in ``Pod1``, (Net1, Net2, Router1) in ``Pod2``, In ``Pod1``, no VM in +Net2, only for E-W traffic. Now the E-W traffic will look like this: +from ``Net1`` to ``Net2``: + +Net1 in Pod1 -> Router1 in Pod1 -> Net2 in Pod1 -> L2GW in Pod1 ---> L2GW in +Pod2 -> Net2 in Pod2. + +To limit the complexity, one network’s az_hint can only be specified when +creating, and no update is allowed, if az_hint need to be updated, you have +to delete the network and create again. + +If the network can’t be expanded, then E-W bridge network is needed. For +example, Net1(AZ1, AZ2,AZ3), Router1; Net2(AZ4, AZ5, AZ6), Router1. +Then a cross pod L2 bridge network has to be established: + +Net1(AZ1, AZ2, AZ3), Router1 --> E-W bridge network ---> Router1, +Net2(AZ4, AZ5, AZ6). + +Assignee(s) +------------ + +Primary assignee: + + +Other contributors: + + +Work Items +------------ + +Dependencies +============ + +None + + +Testing +======= + +None + + +Documentation Impact +==================== + +None + + +References +========== +[1] https://docs.google.com/document/d/18kZZ1snMOCD9IQvUKI5NVDzSASpw-QKj7l2zNqMEd3g/ + +[2] https://review.openstack.org/#/c/270786/ + +[3] https://github.com/openstack/networking-l2gw/blob/master/specs/kilo/l2-gateway-api.rst + +[4] http://developer.openstack.org/api-ref-networking-v2-ext.html#networks-multi-provider-ext + +[5] http://docs.openstack.org/mitaka/networking-guide/adv-config-availability-zone.html + +[6] https://review.openstack.org/#/c/306224/ \ No newline at end of file