From 9623a1c850b139ee42f2facaa31b61aed18c0bbc Mon Sep 17 00:00:00 2001 From: Roberto Bartzen Acosta Date: Tue, 24 Oct 2023 17:05:27 -0300 Subject: [PATCH] Add spec for BGP speaker peer sessions resilient - RFE Depends-On: https://review.opendev.org/c/openstack/neutron-specs/+/914043 Related-bug: #2006145 Change-Id: Ib365b9641dd5e932df705bb263bad9e0f73c508b --- .../2024.2/l3bgp-resilient-peer-sessions.rst | 222 ++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 specs/2024.2/l3bgp-resilient-peer-sessions.rst diff --git a/specs/2024.2/l3bgp-resilient-peer-sessions.rst b/specs/2024.2/l3bgp-resilient-peer-sessions.rst new file mode 100644 index 000000000..483c2650a --- /dev/null +++ b/specs/2024.2/l3bgp-resilient-peer-sessions.rst @@ -0,0 +1,222 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================================================== +L3/BGP - Make BGP Speaker peer sessions resilient to infrastructure outage +========================================================================== + +https://bugs.launchpad.net/neutron/+bug/2006145 + + +This RFE intends to implement resilient support for BGP peer sessions +established by the BGP speaker. Nowadays, the Neutron Dynamic Routing service +architecture depends on some type of message between the DRAgent service and +the Neutron server. However, these communications are strongly dependent on the +messaging service availability (such as RabbitMQ), and any transient/permanent +failures in OpenStack infrastructure nodes may affect prefix advertising via +BGP. + + +Problem Description +=================== + +When we are using dynamic routing with BGP to advertise network address +prefixes in a cloud infrastructure, all the network traffic depends on the +correct operation by the BGP peer elements. The DRAgent has a crucial role in +that scenario because this service is responsible for advirtise all L3 prefixes +in a North/South communication, being a single point of vulnerability for the +cloud operation. + +The current DRAgent reference design [1]_ using two separated periodic tasks as +described below: + +State Report periodic task (DrAgentWithStateReport class):: + + +------------------------------------------------+ + |DrAgentWithStateReport class | + | | + | +------------------------------------+ | + | |heartbeat = _report_state | | + | |interval=CONF.AGENT.report_interval | | + | | call _report_state | | + | +------------------+-----------------+ | + | | | + | +------------------+-----------------+ | + | |_report_state polling task | | + | | | | + | | RPC processing | | + | | if agent_status == revived | | + | | call schedule_full_resync | | + | +------------------+-----------------+ | + | | | + | +------------------+-----------------+ | + | |schedule_full_resync | | + | +------------------------------------+ | + | | + +------------------------------------------------+ + +BGP DRAgent periodic task (BgpDrAgent class):: + + +---------------------------------------+ + |BgpDrAgent class | + | | + | +-------------------------------+ | + | |periodic_resync polling task | | + | |interval=CONF.periodic_interval| | + | | call _periodic_resync_helper | | + | +---------------+---------------+ | + | | | + | +---------------+---------------+ | + | |_periodic_resync_helper | | + | | if full_sync or resync/reason | | + | | call sync_state | | + | +---------------+---------------+ | + | | | + | +---------------+---------------+ | + | |sync_state | | + | | | | + | | if bgp_speaker_id == None | | + | | remove speaker from DRAgent | | + | | | | + | | Exception (MessagingTimeout) | | + | | call schedule_full_resync | | + | +--------------+----------------+ | + | | | + | +--------------+----------------+ | + | |schedule_full_resync | | + | +-------------------------------+ | + | | + +---------------------------------------+ + +These two periodic tasks have independently configured `interval` ranges, the +`periodic_interval` for BGP DRAgent task, and the `AGENT.report_interval` +for the heartbeat report state. However, a full resync request can come from +both periodic tasks, and the sync_state method will be processed. The case of +the RPC heartbeat response with agent_status `revived` activates this entire +flow described above until the speaker is removed from the DRAgent. For the +case where an `exception` occurs in the sync_state method, a full resync is +scheduled (according to the flow described above) until the speaker is removed +from the DRAgent. + +Regardless of the origin of the full resync request, the caching mechanism +apparently is not working, as the agent was previously configured and after +receiving a `revived` status via RPC it simply removes the speaker from the +DRAgent, and reschedules a future resync to later be re-included (depending on +the configured periodic intervals). When the speaker is removed from DRAgent +the BGP peer sessions are closed and will only be reestablished if the speaker +is re-added. + +This problem can be manifested in two different ways: + +* When the messaging service/RabbitMQ is offline and does not respond for a + long interval `AMQP server is unreachable` (note the timeouts configured for + Neutron agent_down); and then RMQ/RPC becomes active again. + +* When the messaging service/RabbitMQ queues are under a lot of pressure and/or + experiencing `exception` timeouts, such as: `Timed out waiting for a reply`; + and then RMQ/RPC becomes active again. + + +Proposed Change +=============== + +To solve the problem described above, the proposal is to introduce a new +speaker cache logic for the DRAgent can keep the speaker settings and the BGP +peer sessions in case of RPC Exceptions, and/or reestablishment of +communication via RPC. In addition, to not remove the BGP speaker configuration +from DRAgent due to transient failures in RPC communication, it is required to +change the `get_bgp_speakers` empty returns handling logic to first schedule a +full sync, and then allow the BGP speaker to be removed in the next periodic +sync. + +To enable the new DRAgent speaker cache mechanism, a new config option should +be enabled via [BGP] section in bgp_dragent.ini file. + +.. code:: + + * ``speaker_cache_timeout = 300`` + +This configuration option enable the DRAgent to keep the speaker settings and +related BGP peer sessions until the configured timeout (in seconds). The +purpose of this timeout cache is prevent errors in the resync checking logic +so that no speaker removed until the timeout condition is satisfied. + +The default value for `speaker_cache_timeout` must be zero and thus maintain +the current DRAgent behavior. Any non-zero value should affect DRAgent behavior +as described below: + +* State Report periodic task: with the `revived` RPC status, the DRAgent must + be start the cache timeout timer, and set the `cache_out_of_sync` flag as + True. If the DRAgent performs the sync_state method before the + `cache_out_of_sync` becomes to False, then the full sync process must be + ignored, and wait for the next periodic check. + +* DRAgent periodic task: if the sync_state method throws any `exception` during + operation, the DRAgent must be start the cache timeout timer, and set the + `cache_out_of_sync` flag as True. Similar to the case described above, if the + DRAgent performs the sync_state method before the `cache_out_of_sync` becomes + to False, then the full sync process must be ignored, and wait for the next + periodic check. + +* Cache timeout task: if another `revived` or `Exception` event occurs during + the timer count, the cache task must reset the timeout interval count and + start again. Otherwise, the cache task must terminate at the expiry of the + configured timeout, and set the `cache_out_of_sync` flag to False. + +* Default workflow: If the `speaker_cache_timeout` is empty or set to zero; or + if the configured cache timeout timer has expired - `cache_out_of_sync` flag + is False; any periodic task should be run a full resync. + +DB Impact +--------- + +None + +Rest API Changes +---------------- + +None + + +Implementation +============== + +Assignee(s) +----------- + +* Primary assignees: + Roberto Bartzen Acosta + +Work Items +---------- + +* Implement a new cache logic in DRAgent speaker resync. + +* Implement relevant unit and functional tests using the existing facilities + in Neutron Dynamic Routing. + +* Write documentation. + + +Documentation Impact +==================== + +User Documentation +------------------ + +* Information about the DRAgent speaker caching support. + + +Testing +======= + +* Unit/functional tests. + + +References +========== + +.. [1] https://opendev.org/openstack/neutron-dynamic-routing/src/branch/master/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py