From 9cbcaa13e34729831d878b5853a8b6a722217e63 Mon Sep 17 00:00:00 2001
From: XiaoYu Zhu <>
Date: Thu, 21 Jan 2021 02:10:23 -0800
Subject: [PATCH] L3 router support ECMP

This spec outlines the Implementation plan of ECMP in neutron.
Patch for this spec:

Related-Bug: #1880532
Change-Id: I67ebf642fbb130a7701792d66629dbab2d76181b
 specs/wallaby/l3-router-support-ecmp.rst | 410 +++++++++++++++++++++++
 1 file changed, 410 insertions(+)
 create mode 100644 specs/wallaby/l3-router-support-ecmp.rst

diff --git a/specs/wallaby/l3-router-support-ecmp.rst b/specs/wallaby/l3-router-support-ecmp.rst
new file mode 100644
index 000000000..a80591592
--- /dev/null
+++ b/specs/wallaby/l3-router-support-ecmp.rst
@@ -0,0 +1,410 @@
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+L3 router support ECMP
+Launchpad Bug:
+ECMP is a kind of routing technology which allows traffic to reach the
+same destination via multiple different links. Neutron does not need to
+calculate the equivalent route path, but leave that part of the work to
+those applications using ECMP API. Neutron just receives those parameters
+and configures routers. Since we have "ip route" command provided by the
+iproute2 utility in Linux, Neutron can simply address ECMP by using pyroute2
+and adding route entry into Neutron router namespace.
+This feature is currently designed to support Octavia's multi-active scheme,
+allowing LoadBalancer in Octavia to have multiple amphoras at the same time.
+By configuring the ECMP route in the router, multiple amphoras can have a
+virtual IP at the same time to serve a set of functions that require high
+concurrency support.
+.. _P2:
+.. note::
+   Items marked with [`P2`_] refer to lower priority features
+   to be designed / implemented only after initial release.
+[`P2`_] Currently the equal cost route is a simple 5 tuple, that means if
+we have one <nexthop> unreachable and remove it from ECMP routes, all
+connections get redistributed. To avoid this, we intend to use a consistent
+hashing instead of the original scheme. This scheme which can support
+consistent hashing is based on hmark which was added in iptables-1.4.15 or
+later. See the history file of the iptables on [1]_.
+Then this spec describes how to implement ECMP in Neutron.
+Problem Description
+Octavia has proposed an active-active load balancing design on [2]_.
+Topology Description
+                                                             Tenant Backend
+                                +----------------+              Network
+                                |                |                 +
+        Internet+-------------->+    router/gw   +----------------->
+                                |                |    ECMP         |
+                                +----------------+                 |
+                                                                   |
+        Management                                                 |
+         Network                                                   |
+            +                                                      |
+            |                                                      |         +----------+
+            |               +-----------------------+              |         |  Tenant  |
+            |          +----+                  +---------+         <---------+Service(1)|
+            |          |MGMT|  loadbalancer(1) | VIP|Back|         |         |          |
+            <----------+ IP |                  |    | IP +--------->         +----------+
+            |          +---------------------------------+         |
+            |                                          |           |         +----------+
+            |                                          |           |         |  Tenant  |
+            |                                          |  ICMP     <---------+Service(2)|
+            |                                          | DETECT    |         |          |
+            |                                          |           |         +----------+
+            |                                          |           |
+            |               +-----------------------+  v           |         +----------+
+            |          +----+                  +---------+         |         |  Tenant  |
+            |          |MGMT|  loadbalancer(2) | VIP|Back|         <---------+service(3)|
+            <----------+ IP |                  |    | IP +--------->         |          |
+            |          +---------------------------------+         |         +----------+
+            |                                          |           |
+            |                                          |           |
+            |         +-------------+                  |           |           ● ● ●
+            |         |Octavia Lbaas|                  |           |
+            <---------+ Controller  |   ● ● ●          |  ICMP     |
+            |         +-------------+                  | DETECT    |         +----------+
+            |                                          |           |         |  Tenant  |
+            |                                          |           <---------+Service(M)|
+            |                                          |           |         |          |
+            |               +-----------------------+  v           |         +----------+
+            |          +----+                  +---------+         |
+            |          |MGMT|   loadbalancer(n)| VIP|Back|         |
+            <----------+ IP |                  |    | IP +--------->
+            |          +---------------------------------+         |
+            +                                                      +
+This program proposed such a scheme:
+* Multiple load balancing servers in a vip-subnet, sharing one virtual IP
+  and one or more back end pools to response clients' request, and each
+  loadbalancer has its own IP address.
+* Clients send requests to VIP, then the router distributes every single
+  request to a load balancing server which has the correct VIP configured
+  on it.
+* Finally, the load balancing server distributes the request to a back end.
+  The loadbalancers and tenant service vm can be in the same subnet or
+  different networks.
+In such a situation, Octavia needs the router to support ECMP for distributing
+requests. So Octavia can send a request to Neutron for creating an ECMP route,
+then Neutron L3 agent executes command in the Neutron router's namespace to
+create an ECMP entry in it, using VIP as the destination IP of the route's
+entry, and several load balancers' IP as nexthop IP. So those requests having
+VIP as their destinations can be distributed to each loadbalancer.
+The whole process implements two levels of load balancing, i.e. load balancing
+between multiple loadbalancers and load balancing between the backend
+real servers
+[`P2`_] Based on current public cloud operator implementations in production
+environments, tenants usually only see IPs in the same network, so
+considering the same broadcast domain, the router needs to enable proxy
+ARP on the corresponding interface.(Users need to disable the proxy ARP
+capability of vms in nexthops by themselves)
+User Workflow
+Generally, users can use the ECMP function for their own purposes.
+For putting an ECMP entry into the router namespace,
+user can set routes with same destination by using command::
+  openstack router add route \
+  --route destination=,gateway= \
+  --route destination=,gateway= router-ecmp
+And withdraw the ECMP entry with::
+  openstack router add route \
+  --route destination=,gateway= \
+  --route destination=,gateway= router-ecmp
+For more information about router related OSC, please read [3]_.
+An integrated sequence diagram of the Octavia's use case is here:
+  +------+      +--------+     +-------+   +--------+ +-------+ +------------+
+  |client|      |Octavia |     |Neutron|   |LB Node | |qrouter| |service pool|
+  +------+      +---+----+     +---+---+   +---+----+ +---+---+ +------+-----+
+    |create LB      |              |           |          |            |
+    +-------------> | create ecmp  |           |          |            |
+    |service        +-------------->           |          |            |
+    |               | LB server boot           |          |            |
+    |               +--------------+---------->+          |            |
+    |               |              | set ecmp route       |            |
+    |               | ecmp done    +-----------+--------->+            |
+    |               +<-------------|           |          |            |
+    |               | LB server boot done      |          |            |
+    |LB service done+<-------------+-----------+          |            |
+    +<--------------+              |           |          |            |
+    |               |              |           |          |            |
+    |               |              |           |          |            |
+    |sending request|              |           |          |            |
+    +---------------------------------------------------->|            |
+    |               |              |           |  pick a LB node       |
+    |               |              |           +<---------|            |
+    |               |              |           | pick a service node   |
+    |               |              |           +---------------------->+
+    |               |              |           |          |response    |
+    |               |              |           +<----------------------+
+    |               |  response    |           |          |            |
+    +<-----------------------------------------+          |            |
+    |               |              |           |          |            |
+    |               |              |           |          |            |
+    v               v              +           v          v            v
+Suppose a user has a set of services that require a multi-active
+load-balancing scheme, so the user send a request to Octavia to create a
+loadbalancer, specifying topology as multi-active. And post a vip-subnet
+to Octavia to assign an IP or directly post a virtual port, which is
+defined by Octavia, and then users need to submit parameters such as
+pool, member, listener, etc., but the latter are irrelevant to Neutron,
+you can find them in Octavia document.
+While Octavia is creating a loadbalancer, it will also send an `update_router`
+request or an `add_extraroutes` request to Neutron, post severval `routes`
+entries with same `destination` param, and load balancers' IPs as
+`nexthop` param.
+Neutron receives the request from Octavia, determines whether to add an ECMP
+route by calculating whether there are multiple routes with the same
+destination address, making sure the router will distribute those packets
+with vip as their destination.
+Those ECMP routes will be removed when user drops the multi-active
+loadbalancer, and it could be modified when adding or removing a load balancing
+Data flow
+* [`P2`_] (If on a same network, use ARP proxy) A client requests mac
+  address of the VIP and accesses the service based on this mac address.
+  the router will use gateway MAC address to respond.
+* The client's datagram will be transmitted to the router first.
+* The router gateway checks ECMP routing entries then forwards the
+  client's packets to the load balancers.
+* Load balancer accepts connections from clients, receives traffic, then
+  distributes it to the back-end server pool.
+* The reply traffic from the back-end server pool go through load balancers
+  and then comes to the router (directly comes back to intranet clients if on
+  a same network), these packets are eventually forwarded back by the router.
+Proposed Change
+In Server Side
+* There are no changes that have to be made in server side.
+In Agent Side
+Modify the logic of processing router_update event in L3 agent to
+support adding ECMP routes in routers.
+The `routes_updated` function in RouterInfo will behave as below:
+* When more than one route is found to have the same destination, L3
+  agent should execute a pyroute2 code, which looks like
+  ip.route('replace', dst='<destination_ip>',multipath=[{"gateway":
+  "<nexthop1>"},{"gateway":"<nexthop2>"}])
+* Then there will be an ip route entry in the namespace, which looks like
+  <vip> proto static
+      nexthop via <nexthop_ip1> dev qr-xxxxxxxx-nn weight 1
+      nexthop via <nexthop_ip2> dev qr-xxxxxxxx-nn weight 1
+Then router will randomly pick a <nexthop_ip> and fill its mac address into
+the package's dst_mac address when it wants to get to the <destination_ip>.
+[`p2`_]For keeping connection while removing a load balancing node, use
+iptables instead of simply a ip route entry.
+- Use `HMARK` to mark flows in mangle table, the `fwmark` values
+  determined by the source address.
+- Distribute flows to different tables by `fwmark` values.
+- There is a mapping between the `fwmark` values and the table values
+- For each table, give it a default nexthop ip.
+- Modify the mapping between `fwmark` values and table values
+  when a `nexthop` is unreachable.
+[`p2`_]In order to let traffic from the same network to pass through the
+router, L3 agent will also let router to use Proxy ARP by setting command::
+  sysctl -w net.ipv4.conf.<NIC_1>.proxy_arp_pvlan=1
+* <NIC_1> is the name of the router interface to which the destination
+  subnet is connected. For example, router `R1` is connected to a
+  subnet `sub-1` whose cidr is ``, so there will be a
+  virtual network interface device `qr-abcdefgh` in the router related
+  namespace as the gateway for the subnet `sub-1`, then add an
+  ECMP route with a destination like `` which is in the
+  scope of the subnet `sub-1`, at this point, the above command
+  will be executed and <NIC_1> will be `qr-abcdefgh`.
+* For making the ARP proxy optional, add an config option in L3Agent.ini::
+    [ECMP]
+    router_interface_arp_proxy = True
+Data Model Impact
+REST API Impact
+Following REST APIs wil be affected::
+  PUT /v2.0/routers/<router_id>/add_extraroutes
+  PUT /v2.0/routers/<router_id>/remove_extraroutes
+  PUT /v2.0/routers/<router_id>
+The above three APIs are the current methods used to add/remove custom
+routes. See the usage of `extraroutes` on [4]_. (The third API
+`PUT /v2.0/routers/<router_id>` is not recommended for adding routes)
+Before the ECMP routing Implementation, when L3 agent receive several route
+entries with same destination and different nexthops, it will only keep one
+entry of them, or replace the existing route with a new one. But now after
+these changes, there will be an ECMP route in the router. So you can add an
+ECMP route entry like this:
+  PUT /v2.0/routers/{router_id}/add_extraroutes
+  { "router":
+    { "routes":
+      [ { "destination": "",
+          "nexthop": "" },
+        { "destination": "",
+          "nexthop": "" }
+        ...
+      ]
+    }
+  }
+Then you can find the ECMP route in router related namespace:
+  #ip route
+ proto static
+    nexthop via dev qr-9adb238b-c2 weight 1
+    nexthop via dev qr-9adb238b-c2 weight 1
+To make this behavior change discoverable, a shim extension called
+'ecmp_routes' will be added.
+[`p2`_]To make ARP proxy behavior discoverable, a shim extension called
+'ecmp_arp' will be added, it will be removed dynamically when related option
+`router_interface_arp_proxy` in config file is `False`.
+* XiaoYu Zhu
+Work Items
+* L3 Agent Update
+* Tests
+* Documentation
+Tempest Tests
+* Tempest tests
+Functional Tests
+* New tests need to be written
+Documentation Impact
+User Documentation
+* User documentation
+* API reference
+Developer Documentation
+* Needs devref documentation
+.. [1]
+.. [2]
+.. [3]
+.. [4]