Browse Source

Merge "Floating IP for routed networks: network:routed API"

Zuul 5 months ago
committed by Gerrit Code Review
9 changed files with 551 additions and 6 deletions
  1. +464
  2. +1
  3. BIN
  4. +2
  5. +22
  6. +27
  7. +3
  8. +21
  9. +11

+ 464
- 0
doc/source/admin/config-bgp-floating-ip-over-l2-segmented-network.rst View File

@ -0,0 +1,464 @@
.. _config-bgp-floating-ip-over-l2-segmented-network:
BGP floating IPs over l2 segmented network
The general principle is that L2 connectivity will be bound to a single rack.
Everything outside the switches of the rack will be routed using BGP. To
perform the BGP announcement, neutron-dynamic-routing is leveraged.
To achieve this, on each rack, servers are setup with a different management
network using a vlan ID per rack (light green and orange network below).
Note that a unique vlan ID per rack isn't mandatory, it's also possible to
use the same vlan ID on all racks. The point here is only to isolate L2
segments (typically, routing between the switch of each racks will be done
over BGP, without L2 connectivity).
.. image:: figures/bgp-floating-ip-over-l2-segmented-network.png
On the OpenStack side, a provider network must be setup, which is using a
different subnet range and vlan ID for each rack. This includes:
* an address scope
* some network segments for that network, which are attached to a named
physical network
* a subnet pool using that address scope
* one provider network subnet per segment (each subnet+segment pair matches
one rack physical network name)
A segment is attached to a specific vlan and physical network name. In the
above figure, the provider network is represented by 2 subnets: the dark green
and the red ones. The dark green subnet is on one network segment, and the red
one on another. Both subnet are of the subnet service type
"network:floatingip_agent_gateway", so that they cannot be used by virtual
machines directly.
On top of all of this, a floating IP subnet without a segment is added, which
spans in all of the racks. This subnet must have the below service types:
* network:routed
* network:floatingip
* network:router_gateway
Since the network:routed subnet isn't bound to a segment, it can be used on all
racks. As the service types network:floatingip and network:router_gateway are
used for the provider network, the subnet can only be used for floating IPs and
router gateways, meaning that the subnet using segments will be used as
floating IP gateways (ie: the next HOP to reach these floating IP / router
external gateways).
Configuring the Neutron API side
On the controller side (ie: API and RPC server), only the Neutron Dynamic
Routing Python library must be installed (for example, in the Debian case,
that would be the neutron-dynamic-routing-common and
python3-neutron-dynamic-routing packages). On top of that, "segments" and
"bgp" must be added to the list of plugins in service_plugins. For example
in neutron.conf:
.. code-block:: ini
The BGP agent
The neutron-bgp-agent must be installed. Best is to install it twice per rack,
on any machine (it doesn't mater much where). Then each of these BGP agents
will establish a session with one switch, and advertise all of the BGP
Setting-up BGP peering with the switches
A peer that represents the network equipment must be created. Then a matching
BGP speaker needs to be created. Then, the BGP speaker must be
associated to a dynamic-routing-agent (in our example, the dynamic-routing
agents run on compute 1 and 4). Finally, the peer is added to the BGP speaker,
so the speaker initiates a BGP session to the network equipment.
.. code-block:: console
$ # Create a BGP peer to represent the switch 1,
$ # which runs FRR on with AS 64601
$ openstack bgp peer create \
--peer-ip \
--remote-as 64601 \
$ # Create a BGP speaker on compute-1
$ BGP_SPEAKER_ID_COMPUTE_1=$(openstack bgp speaker create \
--local-as 64999 --ip-version 4 \
--format value -c id)
$ # Get the agent ID of the dragent running on compute 1
$ BGP_AGENT_ID_COMPUTE_1=$(openstack network agent list \
--host --agent-type bgp \
--format value -c ID)
$ # Add the BGP speaker to the dragent of compute 1
$ openstack bgp dragent add speaker \
$ # Add the BGP peer to the speaker of compute 1
$ openstack bgp speaker add peer \ rack1-switch-1
$ # Tell the speaker not to advertize tenant networks
$ openstack bgp speaker set \
It is possible to repeat this operation for a 2nd machine on the same rack,
if the deployment is using bonding (and then, LACP between both switches),
as per the figure above. It also can be done on each rack. One way to
deploy is to select two computers in each rack (for example, one compute
node and one network node), and install the neutron-dynamic-routing-agent
on each of them, so they can "talk" to both switches of the rack. All of
this depends on what the configuration is on the switch side. It may be
that you only need to talk to two ToR racks in the whole deployment. The
thing you must know is that you can deploy as many dynamic-routing agent
as needed, and that one agent can talk to a single device.
Setting-up physical network names
Before setting-up the provider network, the physical network name must be set
in each host, according to the rack names. On the compute or network nodes,
this is done in /etc/neutron/plugins/ml2/openvswitch_agent.ini using the
bridge_mappings directive:
.. code-block:: ini
bridge_mappings = physnet-rack1:br-ex
All of the physical networks created this way must be added in the
configuration of the neutron-server as well (ie: this is used by both
neutron-api and neutron-rpc-server). For example, with 3 racks,
here's how /etc/neutron/plugins/ml2/ml2_conf.ini should look like:
.. code-block:: ini
flat_networks = physnet-rack1,physnet-rack2,physnet-rack3
network_vlan_ranges = physnet-rack1,physnet-rack2,physnet-rack3
Once this is done, the provider network can be created, using physnet-rack1
as "physical network".
Setting-up the provider network
Everything that is in the provider network's scope will be advertised through
BGP. Here is how to create the network scope:
.. code-block:: console
$ # Create the address scope
$ openstack address scope create --share --ip-version 4 provider-addr-scope
Then, the network can be ceated using the physical network name set above:
.. code-block:: console
$ # Create the provider network that spawns over all racks
$ openstack network create --external --share \
--provider-physical-network physnet-rack1 \
--provider-network-type vlan \
--provider-segment 11 \
This automatically creates a network AND a segment. Though by default, this
segment has no name, which isn't convenient. This name can be changed though:
.. code-block:: console
$ # Get the network ID:
$ PROVIDER_NETWORK_ID=$(openstack network show provider-network \
--format value -c id)
$ # Get the segment ID:
$ FIRST_SEGMENT_ID=$(openstack network segment list \
--format csv -c ID -c Network | \
$ # Set the 1st segment name, matching the rack name
$ openstack network segment set --name segment-rack1 ${FIRST_SEGMENT_ID}
Setting-up the 2nd segment
The 2nd segment, which will be attached to our provider network, is created
this way:
.. code-block:: console
$ # Create the 2nd segment, matching the 2nd rack name
$ openstack network segment create \
--physical-network physnet-rack2 \
--network-type vlan \
--segment 13 \
--network provider-network \
Setting-up the provider subnets for the BGP next HOP routing
These subnets will be in use in different racks, depending on what physical
network is in use in the machines. In order to use the address scope, subnet
pools must be used. Here is how to create the subnet pool with the two ranges
to use later when creating the subnets:
.. code-block:: console
$ # Create the provider subnet pool which includes all ranges for all racks
$ openstack subnet pool create \
--pool-prefix \
--pool-prefix \
--address-scope provider-addr-scope \
--share \
Then, this is how to create the two subnets. In this example, we are keeping
the addresses in .1 for the gateway, .2 for the DHCP server, and .253 +.254,
as these addresses will be used by the switches for the BGP announcements:
.. code-block:: console
$ # Create the subnet for the physnet-rack-1, using the segment-rack-1, and
$ # the subnet_service_type network:floatingip_agent_gateway
$ openstack subnet create \
--service-type 'network:floatingip_agent_gateway' \
--subnet-pool provider-subnet-pool \
--subnet-range \
--allocation-pool start=,end= \
--gateway \
--network provider-network \
--network-segment segment-rack1 \
$ # The same, for the 2nd rack
$ openstack subnet create \
--service-type 'network:floatingip_agent_gateway' \
--subnet-pool provider-subnet-pool \
--subnet-range \
--allocation-pool start=,end= \
--gateway \
--network provider-network \
--network-segment segment-rack2 \
Note the service types. network:floatingip_agent_gateway makes sure that these
subnets will be in use only as gateways (ie: the next BGP hop). The above can
be repeated for each new rack.
Adding a subnet for VM floating IPs and router gateways
This is to be repeated each time a new subnet must be created for floating IPs
and router gateways. First, the range is added in the subnet pool, then the
subnet itself is created:
.. code-block:: console
$ # Add a new prefix in the subnet pool for the floating IPs:
$ openstack subnet pool set \
--pool-prefix \
$ # Create the floating IP subnet
$ openstack subnet create vm-fip \
--service-type 'network:routed' \
--service-type 'network:floatingip' \
--service-type 'network:router_gateway' \
--subnet-pool provider-subnet-pool \
--subnet-range \
--network provider-network
The service-type network:routed ensures we're using BGP through the provider
network to advertize the IPs. network:floatingip and network:router_gateway
limits the use of these IPs to floating IPs and router gateways.
Setting-up BGP advertizing
The provider network needs to be added to each of the BGP speakers. This means
each time a new rack is setup, the provider network must be added to the 2 BGP
speakers of that rack.
.. code-block:: console
$ # Add the provider network to the BGP speakers.
$ openstack bgp speaker add network \ provider-network
$ openstack bgp speaker add network \ provider-network
In this example, we've selected two compute nodes that are also running an
instance of the neutron-dynamic-routing-agent daemon.
Per project operation
This can be done by each customer. A subnet pool isn't mandatory, but it is
nice to have. Typically, the customer network will not be advertized through
BGP (but this can be done if needed).
.. code-block:: console
$ # Create the tenant private network
$ openstack network create tenant-network
$ # Self-service network pool:
$ openstack subnet pool create \
--pool-prefix \
--share \
$ # Self-service subnet:
$ openstack subnet create \
--network tenant-network \
--subnet-pool tenant-subnet-pool \
--prefix-length 24 \
$ # Create the router
$ openstack router create tenant-router
$ # Add the tenant subnet to the tenant router
$ openstack router add subnet \
tenant-router tenant-subnet-1
$ # Set the router's default gateway. This will use one public IP.
$ openstack router set \
--external-gateway provider-network tenant-router
$ # Create a first VM on the tenant subnet
$ openstack server create --image debian-10.5.0-openstack-amd64.qcow2 \
--flavor cpu2-ram6-disk20 \
--nic net-id=tenant-network \
--key-name yubikey-zigo \
$ # Eventually, add a floating IP
$ openstack floating ip create provider-network
| Field | Value |
| created_at | 2020-12-15T11:48:36Z |
| description | |
| dns_domain | None |
| dns_name | None |
| fixed_ip_address | None |
| floating_ip_address | |
| floating_network_id | 859f5302-7b22-4c50-92f8-1f71d6f3f3f4 |
| id | 01de252b-4b78-4198-bc28-1328393bf084 |
| name | |
| port_details | None |
| port_id | None |
| project_id | d71a5d98aef04386b57736a4ea4f3644 |
| qos_policy_id | None |
| revision_number | 0 |
| router_id | None |
| status | DOWN |
| subnet_id | None |
| tags | [] |
| updated_at | 2020-12-15T11:48:36Z |
$ openstack server add floating ip test-server-1
Cumulus switch configuration
Because of the way Neutron works, for each new port associated with an IP
address, a GARP is issued, to inform the switch about the new MAC / IP
association. Unfortunately, this confuses the switches where they may think
they should use local ARP table to route the packet, rather than giving it to
the next HOP to route. The definitive solution would be to patch Neutron to
make it stop sending GARP for any port on a subnet with the network:routed
service type. Such patch would be hard to write, though lucky, there's a fix
that works (at least with Cumulus switches). Here's how.
In /etc/network/switchd.conf we change this:
.. code-block:: ini
# configure a route instead of a neighbor with the same ip/mask
#route.route_preferred_over_neigh = FALSE
route.route_preferred_over_neigh = TRUE
and then simply restart switchd:
.. code-block:: console
systemctl restart switchd
This reboots the switch ASIC of the switch, so it may be a dangerous thing to
do with no switch redundancy (so be careful when doing it). The completely safe
procedure, if having 2 switches per rack, looks like this:
.. code-block:: console
# save clagd priority
OLDPRIO=$(clagctl status | sed -r -n 's/.*Our.*Role: ([0-9]+) 0.*/\1/p')
# make sure that this switch is not the primary clag switch. otherwise the
# secondary switch will also shutdown all interfaces when loosing contact
# with the primary switch.
clagctl priority 16535
# tell neighbors to not route through this router
vtysh# router bgp 64999
vtysh# bgp graceful-shutdown
vtysh# exit
systemctl restart switchd
clagctl priority $OLDPRIO
If everything goes well, the floating IPs are advertized over BGP through the
provider network. Here is an example with 4 VMs deployed on 2 racks. Neutron
is here picking-up IPs on the segmented network as Nexthop.
.. code-block:: console
$ # Check the advertized routes:
$ openstack bgp speaker list advertised routes \
| Destination | Nexthop |
| | |
| | |
| | |
| | |

+ 1
- 0
doc/source/admin/config.rst View File

@ -37,6 +37,7 @@ Configuration

doc/source/admin/figures/bgp-floating-ip-over-l2-segmented-network.png View File

Before After
Width: 700  |  Height: 1075  |  Size: 73 KiB

+ 2
- 0
File diff suppressed because it is too large
View File

+ 22
- 4
neutron/db/ View File

@ -354,9 +354,24 @@ class IpamBackendMixin(db_base_plugin_common.DbBasePluginCommon):
def _validate_segment(self, context, network_id, segment_id, action=None,
segments = subnet_obj.Subnet.get_values(
context, 'segment_id', network_id=network_id)
old_segment_id=None, requested_service_types=None,
# NOTE(zigo): If we're creating a network:routed subnet (here written
# as: const.DEVICE_OWNER_ROUTED), then the created subnet must be
# removed from the segment list, otherwise its segment ID will be
# returned as None, and SubnetsNotAllAssociatedWithSegments will be
# raised.
if (action == 'create' and requested_service_types and
const.DEVICE_OWNER_ROUTED in requested_service_types):
to_create_subnet_id = subnet_id
to_create_subnet_id = None
segments = subnet_obj.Subnet.get_subnet_segment_ids(
context, network_id,
associated_segments = set(segments)
if None in associated_segments and len(associated_segments) > 1:
raise segment_exc.SubnetsNotAllAssociatedWithSegments(
@ -584,7 +599,10 @@ class IpamBackendMixin(db_base_plugin_common.DbBasePluginCommon):
# TODO(slaweq): when check is segment exists will be integrated in
# self._validate_segment() method, it should be moved to be done before
# subnet object is created
self._validate_segment(context, network['id'], segment_id)
self._validate_segment(context, network['id'], segment_id,
# NOTE(changzhi) Store DNS nameservers with order into DB one
# by one when create subnet with DNS nameservers

+ 27
- 0
neutron/objects/ View File

@ -20,6 +20,7 @@ from neutron_lib.utils import net as net_utils
from oslo_utils import versionutils
from oslo_versionedobjects import fields as obj_fields
from sqlalchemy import and_, or_
from sqlalchemy.sql import exists
from neutron.db.models import dns as dns_models
from neutron.db.models import segment as segment_model
@ -490,6 +491,32 @@ class Subnet(base.NeutronDbObject):
if _target_version < (1, 1): # version 1.1 adds "dns_publish_fixed_ip"
primitive.pop('dns_publish_fixed_ip', None)
def get_subnet_segment_ids(cls, context, network_id,
query = context.session.query(cls.db_model.segment_id)
query = query.filter(cls.db_model.network_id == network_id)
# NOTE(zigo): Subnet who hold the type ignored_service_type should be
# removed from the segment list, as they can be part of a segmented
# network but they don't have a segment ID themselves.
if ignored_service_type:
service_type_model = SubnetServiceType.db_model
query = query.filter(~exists().where(and_( == service_type_model.subnet_id,
service_type_model.service_type == ignored_service_type)))
# (zigo): When a subnet is created, at this point in the code,
# its service_types aren't populated in the subnet_service_types
# object, so the subnet to create isn't filtered by the ~exists
# above. So we just filter out the subnet to create completely
# from the result set.
if subnet_id:
query = query.filter( != subnet_id)
return [segment_id for (segment_id,) in query.all()]
class NetworkSubnetLock(base.NeutronDbObject):

+ 3
- 1
neutron/tests/unit/db/ View File

@ -27,6 +27,7 @@ from neutron.db import db_base_plugin_v2
from neutron.db import ipam_backend_mixin
from neutron.db import portbindings_db
from neutron.objects import subnet as subnet_obj
from import db as segments_db
from neutron.tests import base
from neutron.tests.unit.db import test_db_base_plugin_v2
@ -346,7 +347,8 @@ class TestIpamBackendMixin(base.BaseTestCase):
class TestPlugin(db_base_plugin_v2.NeutronDbPluginV2,
__native_pagination_support = True
__native_sorting_support = True

+ 21
- 1
neutron/tests/unit/extensions/ View File

@ -130,7 +130,8 @@ class SegmentTestPlugin(db_base_plugin_v2.NeutronDbPluginV2,
__native_sorting_support = True
supported_extension_aliases = [seg_apidef.ALIAS, portbindings.ALIAS,
def get_plugin_description(self):
return "Network Segments"
@ -501,6 +502,25 @@ class TestSegmentSubnetAssociation(SegmentTestCase):
self.assertEqual(webob.exc.HTTPBadRequest.code, res.status_int)
def test_only_some_subnets_associated_allowed_with_routed_network(self):
with as network:
net = network['network']
segment = self._test_create_segment(network_id=net['id'],
segment_id = segment['segment']['id']
with self.subnet(network=network, segment_id=segment_id) as subnet:
subnet = subnet['subnet']
res = self._create_subnet(self.fmt,
self.assertEqual(webob.exc.HTTPCreated.code, res.status_int)
def test_association_to_dynamic_segment_not_allowed(self):
cxt = context.get_admin_context()
with as network:

+ 11
- 0
releasenotes/notes/network-routed-subnets-cf4874d97ddacd77.yaml View File

@ -0,0 +1,11 @@
- |
A new subnet of type ``network:routed`` has been added. If such a subnet is
used, the IPs of that subnet will be advertized with BGP over a provider
network, which itself can use segments. This basically achieves a
BGP-to-the-rack feature, where the L2 connectivity can be confined to a
rack only, and all external routing is done by the switches, using BGP.
In this mode, it is still possible to use VXLAN connectivity between the
compute nodes, and only floating IPs and router gateways are using BGP