tricircle/specs/cross-pod-l2-networking.rst
XiongQiu 8a9a07db86 Add cross-pod L2 Networking spec file
1. What is the problem?
In the Tricircle, the cross pod L2 networking automation is
established after the VM is plugged into. The simplest way to
stretch one L2 network across multiple OpenStack instances is to
use a same VLAN network, but there is a lot of limitation: the
number of VLAN segment is limited, the VLAN network itself is not
good to spread across multiple sites, although you can use some
gateways to do so. But there are so many tenants in the cloud, and
new tenants could be added into the cloud dynamically, fixed physical
network configuration for dynamic tenant networking is hard to manage.

2. What is the solution to the problem?
To deal with the above problem, flexible tenant level L2 networking
automation across multiple OpenStack instances in one site or in
multiple sites is needed in the Tricirlce.

3. What the features need to be implemented to the Tricircle to
realize the solution?
To implement the features that networking automation supports more
than one bottom pod in AZ or multiple AZs for different use cases
in the Tricircle

Blueprint: https://blueprints.launchpad.net/tricircle/+spec/cross-site-connectivity
Change-Id: I616048c13d03f48aa16d9ff48572b0d5a49d6fb4
2016-06-16 22:05:54 +08:00

24 KiB
Raw Blame History

Cross pod L2 networking in Tricircle

Background

The Tricircle provides unified OpenStack API gateway and networking automation functionality. Those main functionalities allow cloud operators to manage multiple OpenStack instances which are running in one site or multiple sites as a single OpenStack cloud.

Each bottom OpenStack instance which is managed by the Tricircle is also called a pod.

The Tricircle has the following components:

  • Nova API-GW
  • Cinder API-GW
  • Neutron API Server with Neutron Tricircle plugin
  • Admin API
  • XJob
  • DB

Nova API-GW provides the functionality to trigger automatic networking creation when new VMs are being provisioned. Neutron Tricircle plug-in is the functionality to create cross OpenStack L2/L3 networking for new VMs. After the binding of tenant-id and pod finished in the Tricircle, Cinder API-GW and Nova API-GW will pass the cinder api or nova api request to appropriate bottom OpenStack instance.

Please refer to the Tricircle design blueprint[1], especially from '7. Stateless Architecture Proposal' for the detail description of each components.

Problem Description

When a user wants to create a network in Neutron API Server, the user can specify the 'availability_zone_hints'(AZ or az will be used for short for availability zone) during network creation[5], in the Tricircle, the 'az_hints' means which AZ the network should be spreaded into. The 'az_hints' meaning in Tricircle is a little different from the 'az_hints' meaning in Neutron[5]. If no 'az_hints' was specified during network creation, this created network will be spread into any AZ. If there is a list of 'az_hints' during the network creation, that means the network should be able to be spread into these AZs which are suggested by a list of 'az_hints'.

When a user creates VM or Volume, there is also one parameter called availability zone. The AZ parameter is used for Volume and VM co-location, so that the Volume and VM will be created into same bottom OpenStack instance.

When a VM is being attached to a network, the Tricircle will check whether a VM's AZ is inside in the network's AZs scope. If a VM is not in the network's AZs scope, the VM creation will be rejected.

Currently, the Tricircle only supports one pod in one AZ. And only supports a network associated with one AZ. That means currently a tenant's network will be presented only in one bottom OpenStack instance, that also means all VMs connected to the network will be located at one bottom OpenStack instance. If there are more than one pod in one AZ, refer to the dynamic pod binding[6].

There are lots of use cases where a tenant needs a network being able to be spread out into multiple bottom OpenStack instances in one AZ or multiple AZs.

  • Capacity expansion: tenants add VMs more and more, the capacity of one OpenStack may not be enough, then a new OpenStack instance has to be added to the cloud. But the tenant still wants to add new VMs into same network.
  • Cross OpenStack network service chaining. Service chaining is based on the port-pairs. Leveraging the cross pod L2 networking capability which is provided by the Tricircle, the chaining could also be done by across sites. For example, vRouter1 in pod1, but vRouter2 in pod2, these two VMs could be chained.
  • Applications are often required to run in different availability zones to achieve high availability. Application needs to be designed as Active-Standby/Active-Active/N-Way to achieve high availability, and some components inside one application are designed to work as distributed cluster, this design typically leads to state replication or heart beat among application components (directly or via replicated database services, or via private designed message format). When this kind of applications are distributedly deployed into multiple OpenStack instances, cross OpenStack L2 networking is needed to support heart beat or state replication.
  • When a tenant's VMs are provisioned in different OpenStack instances, there is E-W (East-West) traffic for these VMs, the E-W traffic should be only visible to the tenant, and isolation is needed. If the traffic goes through N-S (North-South) via tenant level VPN, overhead is too much, and the orchestration for multiple site to site VPN connection is also complicated. Therefore cross OpenStack L2 networking to bridge the tenant's routers in different OpenStack instances can provide more light weight isolation.
  • In hybrid cloud, there is cross L2 networking requirement between the private OpenStack and the public OpenStack. Cross pod L2 networking will help the VMs migration in this case and it's not necessary to change the IP/MAC/Security Group configuration during VM migration.

The spec[5] is to explain how one AZ can support more than one pod, and how to schedule a proper pod during VM or Volume creation.

And this spec is to deal with the cross OpenStack L2 networking automation in the Tricircle.

The simplest way to spread out L2 networking to multiple OpenStack instances is to use same VLAN. But there is a lot of limitations: (1) A number of VLAN segment is limited, (2) the VLAN network itself is not good to spread out multiple sites, although you can use some gateways to do the same thing.

So flexible tenant level L2 networking across multiple OpenStack instances in one site or in multiple sites is needed.

Proposed Change

Cross pod L2 networking can be divided into three categories, Shared VLAN, Shared VxLAN and Mixed VLAN/VxLAN.

  • Shared VLAN

    Network in each bottom OpenStack is VLAN type and has the same VLAN ID. If we want shared VLAN L2 networking to work in multi-site scenario, i.e., Multiple OpenStack instances in multiple sites, physical gateway needs to be manually configured to make one VLAN networking be extended to other sites.

    Manual setup physical gateway is out of the scope of this spec

  • Shared VxLAN

    Network in each bottom OpenStack instance is VxLAN type and has the same VxLAN ID.

    Leverage L2GW[2][3] to implement this type of L2 networking.

  • Mixed VLAN/VxLAN

    Network in each bottom OpenStack instance may have different types and/or have different segment IDs.

    Leverage L2GW[2][3] to implement this type of L2 networking.

There is another network type called “Local Network”. For “Local Network”, the network will be only presented in one bottom OpenStack instance. And the network won't be presented in different bottom OpenStack instances. If a VM in another pod tries to attach to the “Local Network”, it should be failed. This use case is quite useful for the scenario in which cross pod L2 networking is not required, and one AZ will not include more than bottom OpenStack instance.

Cross pod L2 networking will be able to be established dynamically during tenant's VM is being provisioned.

There is assumption here that only one type of L2 networking will work in one cloud deployment.

A Cross Pod L2 Networking Creation

A cross pod L2 networking creation will be able to be done with the az_hint attribute of the network. If az_hint includes one AZ or more AZs, the network will be presented only in this AZ or these AZs, if no AZ in az_hint, it means that the network can be extended to any bottom OpenStack.

There is a special use case for external network creation. For external network creation, you need to specify the pod_id but not AZ in the az_hint so that the external network will be only created in one specified pod per AZ.

Support of External network in multiple OpenStack instances in one AZ is out of scope of this spec.

Pluggable L2 networking framework is proposed to deal with three types of L2 cross pod networking, and it should be compatible with the Local Network.

  1. Type Driver under Tricircle Plugin in Neutron API server
  • Type driver to distinguish different type of cross pod L2 networking. So the Tricircle plugin need to load type driver according to the configuration. The Tricircle can reuse the type driver of ML2 with update.
  • Type driver to allocate VLAN segment id for shared VLAN L2 networking.
  • Type driver to allocate VxLAN segment id for shared VxLAN L2 networking.
  • Type driver for mixed VLAN/VxLAN to allocate VxLAN segment id for the network connecting L2GWs[2][3].
  • Type driver for Local Network only updating network_type for the network to the Tricircle Neutron DB.

When a network creation request is received in Neutron API Server in the Tricircle, the type driver will be called based on the configured network type.

  1. Nova API-GW to trigger the bottom networking automation

Nova API-GW can be aware of when a new VM is provisioned if boot VM api request is received, therefore Nova API-GW is responsible for the network creation in the bottom OpenStack instances.

Nova API-GW needs to get the network type from Neutron API server in the Tricircle, and deal with the networking automation based on the network type:

  • Shared VLAN Nova API-GW creates network in bottom OpenStack instance in which the VM will run with the VLAN segment id, network name and type that are retrieved from the Neutron API server in the Tricircle.
  • Shared VxLAN Nova API-GW creates network in bottom OpenStack instance in which the VM will run with the VxLAN segment id, network name and type which are retrieved from Tricricle Neutron API server. After the network in the bottom OpenStack instance is created successfully, Nova API-GW needs to make this network in the bottom OpenStack instance as one of the segments in the network in the Tricircle.
  • Mixed VLAN/VxLAN Nova API-GW creates network in different bottom OpenStack instance in which the VM will run with the VLAN or VxLAN segment id respectively, network name and type which are retrieved from Tricricle Neutron API server. After the network in the bottom OpenStack instances is created successfully, Nova API-GW needs to update network in the Tricircle with the segmentation information of bottom netwoks.
  1. L2GW driver under Tricircle Plugin in Neutron API server

Tricircle plugin needs to support multi-segment network extension[4].

For Shared VxLAN or Mixed VLAN/VxLAN L2 network type, L2GW driver will utilize the multi-segment network extension in Neutron API server to build the L2 network in the Tricircle. Each network in the bottom OpenStack instance will be a segment for the whole cross pod L2 networking in the Tricircle.

After the network in the bottom OpenStack instance was created successfully, Nova API-GW will call Neutron server API to update the network in the Tricircle with a new segment from the network in the bottom OpenStack instance.

If the network in the bottom OpenStack instance was removed successfully, Nova API-GW will call Neutron server api to remove the segment in the bottom OpenStack instance from network in the Tricircle.

When L2GW driver under Tricircle plugin in Neutron API server receives the segment update request, L2GW driver will start async job to orchestrate L2GW API for L2 networking automation[2][3].

Data model impact

In database, we are considering setting physical_network in top OpenStack instance as bottom_physical_network#bottom_pod_id to distinguish segmentation information in different bottom OpenStack instance.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Implementation

Local Network Implementation

For Local Network, L2GW is not required. In this scenario, no cross pod L2/L3 networking is required.

A user creates network Net1 with single AZ1 in az_hint, the Tricircle plugin checks the configuration, if tenant_network_type equals local_network, it will invoke Local Network type driver. Local Network driver under the Tricircle plugin will update network_type in database.

For exmaple, a user creates VM1 in AZ1 which has only one pod POD1, and connects it to network Net1. Nova API-GW will send network creation request to POD1 and the VM will be booted in AZ1 (There should be only one pod in AZ1).

If a user wants to create VM2 in AZ2 or POD2 in AZ1, and connect it to network Net1 in the Tricircle, it would be failed. Because the Net1 is local_network type network and it is limited to present in POD1 in AZ1 only.

Shared VLAN Implementation

For Shared VLAN, L2GW is not required. This is the most simplest cross pod L2 networking for limited scenario. For example, with a small number of networks, all VLANs are extended through physical gateway to support cross site VLAN networking, or all pods under same core switch with same visible VLAN ranges that supported by the core switch are connected by the core switch.

when a user creates network called Net1, the Tricircle plugin checks the configuration. If tenant_network_type equals shared_vlan, the Tricircle will invoke Shared VLAN type driver. Shared VLAN driver will create segment, and assign network_type with VLAN, update segment and network_type and physical_network with DB

A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be booted in POD1, Nova API-GW needs to get the network information and send network creation message to POD1. Network creation message includes network_type and segment and physical_network.

Then the user creates VM2 in AZ2, and connects it to network Net1. If VM will be booted in POD2, Nova API-GW needs to get the network information and send create network message to POD2. Create network message includes network_type and segment and physical_network.

Shared VxLAN Implementation

A user creates network Net1, the Tricircle plugin checks the configuration, if tenant_network_type equals shared_vxlan, it will invoke shared VxLAN driver. Shared VxLAN driver will allocate segment, and assign network_type with VxLAN, and update network with segment and network_type with DB

A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be booted in POD1, Nova API-GW needs to get the network information and send create network message to POD1, create network message includes network_type and segment.

Nova API-GW should update Net1 in Tricircle with the segment information got by POD1.

Then the user creates VM2 in AZ2, and connects it to network Net1. If VM2 will be booted in POD2, Nova API-GW needs to get the network information and send network creation massage to POD2, network creation message includes network_type and segment.

Nova API-GW should update Net1 in the Tricircle with the segment information get by POD2.

The Tricircle plugin detects that the network includes more than one segment network, calls L2GW driver to start async job for cross pod networking for Net1. The L2GW driver will create L2GW1 in POD1 and L2GW2 in POD2. In POD1, L2GW1 will connect the local Net1 and create L2GW remote connection to L2GW2, then populate the information of MAC/IP which resides in L2GW1. In POD2, L2GW2 will connect the local Net1 and create L2GW remote connection to L2GW1, then populate remote MAC/IP information which resides in POD1 in L2GW2.

L2GW driver in the Tricircle will also detect the new port creation/deletion API request. If port (MAC/IP) created or deleted in POD1 or POD2, it needs to refresh the L2GW2 MAC/IP information.

Whether to populate the information of port (MAC/IP) should be configurable according to L2GW capability. And only populate MAC/IP information for the ports that are not resides in the same pod.

Mixed VLAN/VxLAN

To achieve cross pod L2 networking, L2GW will be used to connect L2 network in different pods, using L2GW should work for Shared VxLAN and Mixed VLAN/VxLAN scenario.

When L2GW connected with local network in the same OpenStack instance, no matter it's VLAN or VxLAN or GRE, the L2GW should be able to connect the local network, and because L2GW is extension of Neutron, only network UUID should be enough for L2GW to connect the local network.

When admin user creates network in Tricircle, he/she specifies the network type as one of the network type as discussed above. In the phase of creating network in Tricircle, only one record is saved in the database, no network will be created in bottom OpenStack.

After the network in the bottom created successfully, need to retrieve the network information like segment id, network name and network type, and make this network in the bottom pod as one of the segments in the network in Tricircle.

In the Tricircle, network could be created by tenant or admin. For tenant, no way to specify the network type and segment id, then default network type will be used instead. When user uses the network to boot a VM, Nova API-GW checks the network type. For Mixed VLAN/VxLAN network, Nova API-GW first creates network in bottom OpenStack without specifying network type and segment ID, then updates the top network with bottom network segmentation information returned by bottom OpenStack.

A user creates network Net1, plugin checks the configuration, if tenant_network_type equals mixed_vlan_vxlan, it will invoke mixed VLAN and VxLAN driver. The driver needs to do nothing since segment is allocated in bottom.

A user creates VM1 in AZ1, and connects it to the network Net1, the VM is booted in bottom POD1, and Nova API-GW creates network in POD1 and queries the network detail segmentation information (using admin role), and gets network type, segment id, then updates this new segment to the Net1 in Tricircle Neutron API Server.

Then the user creates another VM2, and with AZ info AZ2, then the VM should be able to be booted in bottom POD2 which is located in AZ2. And when VM2 should be able to be booted in AZ2, Nova API-GW also creates a network in POD2, and queries the network information including segment and network type, updates this new segment to the Net1 in Tricircle Neutron API Server.

The Tricircle plugin detects that the Net1 includes more than one network segments, calls L2GW driver to start async job for cross pod networking for Net1. The L2GW driver will create L2GW1 in POD1 and L2GW2 in POD2. In POD1, L2GW1 will connect the local Net1 and create L2GW remote connection to L2GW2, then populate information of MAC/IP which resides in POD2 in L2GW1. In POD2, L2GW2 will connect the local Net1 and create L2GW remote connection to L2GW1, then populate remote MAC/IP information which resides in POD1 in L2GW2.

L2GW driver in Tricircle will also detect the new port creation/deletion api calling, if port (MAC/IP) created or deleted in POD1, then needs to refresh the L2GW2 MAC/IP information. If port (MAC/IP) created or deleted in POD2, then needs to refresh the L2GW1 MAC/IP information,

Whether to populate MAC/IP information should be configurable according to L2GW capability. And only populate MAC/IP information for the ports that are not resides in the same pod.

L3 bridge network

Current implementation without cross pod L2 networking.

  • A special bridge network is created and connected to the routers in different bottom OpenStack instances. We configure the extra routes of the routers to route the packets from one OpenStack to another. In current implementation, we create this special bridge network in each bottom OpenStack with the same VLAN ID, so we have an L2 network to connect the routers.

Difference between L2 networking for tenant's VM and for L3 bridging network.

  • The creation of bridge network is triggered during attaching router interface and adding router external gateway.
  • The L2 network for VM is triggered by Nova API-GW when a VM is to be created in one pod, and finds that there is no network, then the network will be created before the VM is booted, network or port parameter is required to boot VM. The IP/Mac for VM is allocated in the Tricircle, top layer to avoid IP/mac collision if they are allocated separately in bottom pods.

After cross pod L2 networking is introduced, the L3 bridge network should be updated too.

L3 bridge network N-S (North-South):

  • For each tenant, one cross pod N-S bridge network should be created for router N-S inter-connection. Just replace the current shared VLAN N-S bridge network to corresponding Shared VxLAN or Mixed VLAN/VxLAN.

L3 bridge network E-W (East-West):

  • When attaching router interface happened, for Shared VLAN, it will keep current process to establish E-W bridge network. For Shared VxLAN and Mixed VLAN/VxLAN, if a L2 network is able to expand to the current pod, then just expand the L2 network to the pod, all E-W traffic will go out from local L2 network, then no bridge network is needed.
  • For example, (Net1, Router1) in Pod1, (Net2, Router1) in Pod2, if Net1 is a cross pod L2 network, and can be expanded to Pod2, then will just expand Net1 to Pod2. After the Net1 expansion ( just like cross pod L2 networking to spread one network in multiple pods ), itll look like (Net1, Router1) in Pod1, (Net1, Net2, Router1) in Pod2, In Pod2, no VM in Net1, only for E-W traffic. Now the E-W traffic will look like this:

from Net2 to Net1:

Net2 in Pod2 -> Router1 in Pod2 -> Net1 in Pod2 -> L2GW in Pod2 ---> L2GW in Pod1 -> Net1 in Pod1.

Note: The traffic for Net1 in Pod2 to Net1 in Pod1 can bypass the L2GW in Pod2, that means outbound traffic can bypass the local L2GW if the remote VTEP of L2GW is known to the local compute node and the packet from the local compute node with VxLAN encapsulation cloud be routed to remote L2GW directly. It's up to the L2GW implementation. With the inbound traffic through L2GW, the inbound traffic to the VM will not be impacted by the VM migration from one host to another.

If Net2 is a cross pod L2 network, and can be expanded to Pod1 too, then will just expand Net2 to Pod1. After the Net2 expansion(just like cross pod L2 networking to spread one network in multiple pods ), itll look like (Net2, Net1, Router1) in Pod1, (Net1, Net2, Router1) in Pod2, In Pod1, no VM in Net2, only for E-W traffic. Now the E-W traffic will look like this: from Net1 to Net2:

Net1 in Pod1 -> Router1 in Pod1 -> Net2 in Pod1 -> L2GW in Pod1 ---> L2GW in Pod2 -> Net2 in Pod2.

To limit the complexity, one networks az_hint can only be specified when creating, and no update is allowed, if az_hint need to be updated, you have to delete the network and create again.

If the network cant be expanded, then E-W bridge network is needed. For example, Net1(AZ1, AZ2,AZ3), Router1; Net2(AZ4, AZ5, AZ6), Router1. Then a cross pod L2 bridge network has to be established:

Net1(AZ1, AZ2, AZ3), Router1 --> E-W bridge network ---> Router1, Net2(AZ4, AZ5, AZ6).

Assignee(s)

Primary assignee:

Other contributors:

Work Items

Dependencies

None

Testing

None

Documentation Impact

None

References

[1] https://docs.google.com/document/d/18kZZ1snMOCD9IQvUKI5NVDzSASpw-QKj7l2zNqMEd3g/

[2] https://review.openstack.org/#/c/270786/

[3] https://github.com/openstack/networking-l2gw/blob/master/specs/kilo/l2-gateway-api.rst

[4] http://developer.openstack.org/api-ref-networking-v2-ext.html#networks-multi-provider-ext

[5] http://docs.openstack.org/mitaka/networking-guide/adv-config-availability-zone.html

[6] https://review.openstack.org/#/c/306224/