dd07ac64fc
Patch Set 2: (1 comment) Patch-set: 2 Attention: {"person_ident":"Gerrit User 35432 \u003c35432@4a232e18-c5a9-48ee-94c0-e04e7cca6543\u003e","operation":"ADD","reason":"\u003cGERRIT_ACCOUNT_29074\u003e replied on the change"} Attention: {"person_ident":"Gerrit User 29074 \u003c29074@4a232e18-c5a9-48ee-94c0-e04e7cca6543\u003e","operation":"REMOVE","reason":"\u003cGERRIT_ACCOUNT_29074\u003e replied on the change"}
163 lines
7.4 KiB
Plaintext
163 lines
7.4 KiB
Plaintext
{
|
|
"comments": [
|
|
{
|
|
"unresolved": false,
|
|
"key": {
|
|
"uuid": "4cfadc9c_335a117f",
|
|
"filename": "/PATCHSET_LEVEL",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 0,
|
|
"author": {
|
|
"id": 16137
|
|
},
|
|
"writtenOn": "2023-11-08T08:28:30Z",
|
|
"side": 1,
|
|
"message": "I think this makes sense as impact on dataplane due to RPC issues is really bad, I have a question about the timeout though, being 300 seconds I assume it more to prevent a flapping behaviour and not be resilient against a longer outage?\n\nI.e if the RPC outage is more than 300 seconds, are we causing an outage on dataplane?",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": false,
|
|
"key": {
|
|
"uuid": "032e2cb9_e5ba240c",
|
|
"filename": "/PATCHSET_LEVEL",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 0,
|
|
"author": {
|
|
"id": 35432
|
|
},
|
|
"writtenOn": "2023-11-08T13:04:19Z",
|
|
"side": 1,
|
|
"message": "Thanks for the review Tobias.\n\nThe main idea of this RFE is to prevent flapping that cause the removal of BGP peers. There are no link between this RFE and the total outage time. I mean, if the RMQ is offline/instable for a whole day, no problem, the dataplane will continue working after apply the proposed change because each time an exceptions or communication issues with RMQ occur, a new cache timeout timer will start and will not delete BGP peers until the timer expires.\n\nTherefore, the timeout time is configurable and can be adjusted according to the time it actually takes for the RMQ to respond correctly again (transient time after RMQ comes back - cluster convergence, mysql, etc.)",
|
|
"parentUuid": "4cfadc9c_335a117f",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": true,
|
|
"key": {
|
|
"uuid": "aca320d0_e264f6c8",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 114,
|
|
"author": {
|
|
"id": 29074
|
|
},
|
|
"writtenOn": "2023-11-08T13:57:57Z",
|
|
"side": 1,
|
|
"message": "isn\u0027t the goal to not remove the bgp sessions instead of not doing a full sync?\nWhile that might be the same implementation wise its a different goal from my perspective",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": false,
|
|
"key": {
|
|
"uuid": "174b2882_cc58054f",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 114,
|
|
"author": {
|
|
"id": 35432
|
|
},
|
|
"writtenOn": "2023-11-08T14:40:03Z",
|
|
"side": 1,
|
|
"message": "The goal is both cases because: \n\n1) full resync could remove BGP peers [1] considering instabilities/inconsistencies in RPC responses via RMQ.\n2) Exceptions or timeouts (empty list) could remove the BGP speaker [1] or [2] considering instabilities/inconsistencies in RPC responses via RMQ.\n\nSo, the proposed change is \"keep the speaker settings and the BGP peer sessions in case of RPC Exceptions, and/or reestablishment of communication via RPC.\"\n\n[1] https://opendev.org/openstack/neutron-dynamic-routing/src/commit/00a83aa706225871fc9f3db5d79be324b4e84383/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py#L118\n[2] https://opendev.org/openstack/neutron-dynamic-routing/src/commit/00a83aa706225871fc9f3db5d79be324b4e84383/neutron_dynamic_routing/services/bgp/agent/bgp_dragent.py#L107",
|
|
"parentUuid": "aca320d0_e264f6c8",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": true,
|
|
"key": {
|
|
"uuid": "0029ce76_40d16032",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 134,
|
|
"author": {
|
|
"id": 29074
|
|
},
|
|
"writtenOn": "2023-11-08T13:57:57Z",
|
|
"side": 1,
|
|
"message": "would it maybe be easier to just not touch the existing state if we notice an exception or broken connection?\nThe we could skip all the caching logic",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": false,
|
|
"key": {
|
|
"uuid": "cffd3334_04e29554",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 134,
|
|
"author": {
|
|
"id": 35432
|
|
},
|
|
"writtenOn": "2023-11-08T14:40:03Z",
|
|
"side": 1,
|
|
"message": "It could be a good alternative, but we would need a cache timeout timer anyway, since after reconnecting with the RMQ, the messages received may be inconsistent until the cluster is operating correctly.\n\nI can change the spec to: \n1) block the sync when a failure is observed (I need to see how to do this because oslo_messaging catch and handle the exceptions...)\n2) start the timer when the connection with the RMQ is reestablished\n3) Release the sync after the cache timeout expires.\n\nDoes it make sense to you?",
|
|
"parentUuid": "0029ce76_40d16032",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": true,
|
|
"key": {
|
|
"uuid": "6664b1f7_58fdd713",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 134,
|
|
"author": {
|
|
"id": 29074
|
|
},
|
|
"writtenOn": "2023-11-08T14:52:31Z",
|
|
"side": 1,
|
|
"message": "sounds valid.\nI\u0027m not sure if oslo messaging includes a \"sender\" time in the messages. If that would be the case then we could just drop all messages before the reconnect and do a full sync",
|
|
"parentUuid": "cffd3334_04e29554",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": false,
|
|
"key": {
|
|
"uuid": "fded0d11_a075d9ba",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 134,
|
|
"author": {
|
|
"id": 35432
|
|
},
|
|
"writtenOn": "2023-11-08T22:06:52Z",
|
|
"side": 1,
|
|
"message": "Just to give some context, from the oslo_messaging documentation, it doesn\u0027t seem possible to know when the RMQ is offline or comes back because it\u0027s internal to impl_rabbit driver. So this alternative solution to monitor the RMQ connection to block/unblock the BGP sync may not seem feasible.",
|
|
"parentUuid": "6664b1f7_58fdd713",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
},
|
|
{
|
|
"unresolved": false,
|
|
"key": {
|
|
"uuid": "461cfbce_0f0028d9",
|
|
"filename": "specs/2024.1/l3bgp-resilient-peer-sessions.rst",
|
|
"patchSetId": 2
|
|
},
|
|
"lineNbr": 134,
|
|
"author": {
|
|
"id": 29074
|
|
},
|
|
"writtenOn": "2023-11-09T10:15:55Z",
|
|
"side": 1,
|
|
"message": "ok, thats unfortunate",
|
|
"parentUuid": "fded0d11_a075d9ba",
|
|
"revId": "0c42ee66a51f58c2362aceb39c7f20e2fe7b986f",
|
|
"serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543"
|
|
}
|
|
]
|
|
} |