From 9623a1c850b139ee42f2facaa31b61aed18c0bbc Mon Sep 17 00:00:00 2001
From: Roberto Bartzen Acosta <rbartzen@gmail.com>
Date: Tue, 24 Oct 2023 17:05:27 -0300
Subject: [PATCH] Add spec for BGP speaker peer sessions resilient - RFE

Depends-On: https://review.opendev.org/c/openstack/neutron-specs/+/914043
Related-bug: #2006145
Change-Id: Ib365b9641dd5e932df705bb263bad9e0f73c508b
---
 .../2024.2/l3bgp-resilient-peer-sessions.rst  | 222 ++++++++++++++++++
 1 file changed, 222 insertions(+)
 create mode 100644 specs/2024.2/l3bgp-resilient-peer-sessions.rst

diff --git a/specs/2024.2/l3bgp-resilient-peer-sessions.rst b/specs/2024.2/l3bgp-resilient-peer-sessions.rst
new file mode 100644
index 000000000..483c2650a
--- /dev/null
+++ b/specs/2024.2/l3bgp-resilient-peer-sessions.rst
@@ -0,0 +1,222 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================================================
+L3/BGP - Make BGP Speaker peer sessions resilient to infrastructure outage
+==========================================================================
+
+https://bugs.launchpad.net/neutron/+bug/2006145
+
+
+This RFE intends to implement resilient support for BGP peer sessions
+established by the BGP speaker. Nowadays, the Neutron Dynamic Routing service
+architecture depends on some type of message between the DRAgent service and
+the Neutron server. However, these communications are strongly dependent on the
+messaging service availability (such as RabbitMQ), and any transient/permanent
+failures in OpenStack infrastructure nodes may affect prefix advertising via
+BGP.
+
+
+Problem Description
+===================
+
+When we are using dynamic routing with BGP to advertise network address
+prefixes in a cloud infrastructure, all the network traffic depends on the
+correct operation by the BGP peer elements. The DRAgent has a crucial role in
+that scenario because this service is responsible for advirtise all L3 prefixes
+in a North/South communication, being a single point of vulnerability for the
+cloud operation.
+
+The current DRAgent reference design [1]_ using two separated periodic tasks as
+described below:
+
+State Report periodic task (DrAgentWithStateReport class)::
+
+ +------------------------------------------------+
+ |DrAgentWithStateReport class                    |
+ |                                                |
+ |     +------------------------------------+     |
+ |     |heartbeat = _report_state           |     |
+ |     |interval=CONF.AGENT.report_interval |     |
+ |     |  call _report_state                |     |
+ |     +------------------+-----------------+     |
+ |                        |                       |
+ |     +------------------+-----------------+     |
+ |     |_report_state polling task          |     |
+ |     |                                    |     |
+ |     | RPC processing                     |     |
+ |     |   if agent_status == revived       |     |
+ |     |     call schedule_full_resync      |     |
+ |     +------------------+-----------------+     |
+ |                        |                       |
+ |     +------------------+-----------------+     |
+ |     |schedule_full_resync                |     |
+ |     +------------------------------------+     |
+ |                                                |
+ +------------------------------------------------+
+
+BGP DRAgent periodic task (BgpDrAgent class)::
+
+ +---------------------------------------+
+ |BgpDrAgent class                       |
+ |                                       |
+ |   +-------------------------------+   |
+ |   |periodic_resync polling task   |   |
+ |   |interval=CONF.periodic_interval|   |
+ |   |  call _periodic_resync_helper |   |
+ |   +---------------+---------------+   |
+ |                   |                   |
+ |   +---------------+---------------+   |
+ |   |_periodic_resync_helper        |   |
+ |   | if full_sync or resync/reason |   |
+ |   |   call sync_state             |   |
+ |   +---------------+---------------+   |
+ |                   |                   |
+ |   +---------------+---------------+   |
+ |   |sync_state                     |   |
+ |   |                               |   |
+ |   | if bgp_speaker_id == None     |   |
+ |   |   remove speaker from DRAgent |   |
+ |   |                               |   |
+ |   | Exception (MessagingTimeout)  |   |
+ |   |   call schedule_full_resync   |   |
+ |   +--------------+----------------+   |
+ |                  |                    |
+ |   +--------------+----------------+   |
+ |   |schedule_full_resync           |   |
+ |   +-------------------------------+   |
+ |                                       |
+ +---------------------------------------+
+
+These two periodic tasks have independently configured `interval` ranges, the
+`periodic_interval` for BGP DRAgent task, and the `AGENT.report_interval`
+for the heartbeat report state. However, a full resync request can come from
+both periodic tasks, and the sync_state method will be processed. The case of
+the RPC heartbeat response with agent_status `revived` activates this entire
+flow described above until the speaker is removed from the DRAgent. For the
+case where an `exception` occurs in the sync_state method, a full resync is
+scheduled (according to the flow described above) until the speaker is removed
+from the DRAgent.
+
+Regardless of the origin of the full resync request, the caching mechanism
+apparently is not working, as the agent was previously configured and after
+receiving a `revived` status via RPC it simply removes the speaker from the
+DRAgent, and reschedules a future resync to later be re-included (depending on
+the configured periodic intervals). When the speaker is removed from DRAgent
+the BGP peer sessions are closed and will only be reestablished if the speaker
+is re-added.
+
+This problem can be manifested in two different ways:
+
+* When the messaging service/RabbitMQ is offline and does not respond for a
+  long interval `AMQP server is unreachable` (note the timeouts configured for
+  Neutron agent_down); and then RMQ/RPC becomes active again.
+
+* When the messaging service/RabbitMQ queues are under a lot of pressure and/or
+  experiencing `exception` timeouts, such as: `Timed out waiting for a reply`;
+  and then RMQ/RPC becomes active again.
+
+
+Proposed Change
+===============
+
+To solve the problem described above, the proposal is to introduce a new
+speaker cache logic for the DRAgent can keep the speaker settings and the BGP
+peer sessions in case of RPC Exceptions, and/or reestablishment of
+communication via RPC. In addition, to not remove the BGP speaker configuration
+from DRAgent due to transient failures in RPC communication, it is required to
+change the `get_bgp_speakers` empty returns handling logic to first schedule a
+full sync, and then allow the BGP speaker to be removed in the next periodic
+sync.
+
+To enable the new DRAgent speaker cache mechanism, a new config option should
+be enabled via [BGP] section in bgp_dragent.ini file.
+
+.. code::
+
+  * ``speaker_cache_timeout = 300``
+
+This configuration option enable the DRAgent to keep the speaker settings and
+related BGP peer sessions until the configured timeout (in seconds). The
+purpose of this timeout cache is prevent errors in the resync checking logic
+so that no speaker removed until the timeout condition is satisfied.
+
+The default value for `speaker_cache_timeout` must be zero and thus maintain
+the current DRAgent behavior. Any non-zero value should affect DRAgent behavior
+as described below:
+
+* State Report periodic task: with the `revived` RPC status, the DRAgent must
+  be start the cache timeout timer, and set the `cache_out_of_sync` flag as
+  True. If the DRAgent performs the sync_state method before the
+  `cache_out_of_sync` becomes to False, then the full sync process must be
+  ignored, and wait for the next periodic check.
+
+* DRAgent periodic task: if the sync_state method throws any `exception` during
+  operation, the DRAgent must be start the cache timeout timer, and set the
+  `cache_out_of_sync` flag as True. Similar to the case described above, if the
+  DRAgent performs the sync_state method before the `cache_out_of_sync` becomes
+  to False, then the full sync process must be ignored, and wait for the next
+  periodic check.
+
+* Cache timeout task: if another `revived` or `Exception` event occurs during
+  the timer count, the cache task must reset the timeout interval count and
+  start again. Otherwise, the cache task must terminate at the expiry of the
+  configured timeout, and set the `cache_out_of_sync` flag to False.
+
+* Default workflow: If the `speaker_cache_timeout` is empty or set to zero; or
+  if the configured cache timeout timer has expired - `cache_out_of_sync` flag
+  is False; any periodic task should be run a full resync.
+
+DB Impact
+---------
+
+None
+
+Rest API Changes
+----------------
+
+None
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+* Primary assignees:
+  Roberto Bartzen Acosta <rbartzen@gmail.com>
+
+Work Items
+----------
+
+* Implement a new cache logic in DRAgent speaker resync.
+
+* Implement relevant unit and functional tests using the existing facilities
+  in Neutron Dynamic Routing.
+
+* Write documentation.
+
+
+Documentation Impact
+====================
+
+User Documentation
+------------------
+
+* Information about the DRAgent speaker caching support.
+
+
+Testing
+=======
+
+* Unit/functional tests.
+
+
+References
+==========
+
+.. [1] https://opendev.org/openstack/neutron-dynamic-routing/src/branch/master/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py