When HA router is created in "stanby" mode, ipv6 forwarding is
disabled by default in its namespace.
But when router is transitioned to be "master" on node, ipv6
forwarding should be enabled. This was fine for routers with
configured gateway but we somehow missed the case when router don't
have gateway configured.
Because of that missing ipv6 forwarding setting in such case, IPv6
W-E traffic between 2 subnets was not working fine in L3 HA case.
This patch fixes it by adding configuring ipv6_forwarding on
"all" interface in router's namespace always, even if it don't have
gateway configured.
Conflicts:
neutron/tests/functional/agent/l3/framework.py
neutron/tests/unit/agent/l3/test_agent.py
Change-Id: I8b1b2b426f7a26a4b2407a83f9bf29dd6e9ba7b0
CLoses-Bug: #1818224
(cherry picked from commit b119247bea)
(cherry picked from commit 270912a8c7)
Sometimes in case of HA routers it may happend that
keepalived will set status of router to MASTER before
neutron-keepalived-state-change daemon will spawn "ip monitor"
to monitor changes of IPs in router's namespace.
In such case neutron-keepalived-state-change process will never
notice that keepalived set router to be MASTER and L3 agent will
not be notified about that so router will not be configured properly.
To avoid such race condition neutron-keepalived-state-change will
now check if VIP address is already configured on ha interface
before it will spawn "ip monitor". If it is already configured
by keepalived, it will notify L3 agent that router is set to
MASTER.
Change-Id: Ie3fe825d65408fc969c478767b411fe0156e9fbc
Closes-Bug: #1818614
(cherry picked from commit 8fec1ffc83)
In some cases our db migration tests which run on MySQL are
failing with timeout and it happens due to slow VMs on which
job is running.
Sometimes it may also happen that timeout exception is raised
in the middle of some sqlalchemy operations and
sqlalchemy.InterfaceError is raised as last one.
Details about this exception can be found in [1].
To avoid many rechecks because of this reason this patch
introduces new decorator which is very similar to "unstable_test"
but will skip test only if one of exceptions mentioned above will
be raised.
In all other cases it will fail test.
That should be a bit more safe for us because we will not miss
some other failures raised in those tests and will avoid rechecks
because of this "well-known" reason described in related bug.
[1] http://sqlalche.me/e/rvf5
Conflicts:
neutron/tests/functional/db/test_migrations.py
neutron/tests/base.py
Change-Id: Ie291fda7d23a696aaa1160d126a3cf72b08c522f
Related-Bug: #1687027
(cherry picked from commit c0fec67672)
(cherry picked from commit e6f22ce81c)
When the external gateway is plugged and we enable IPv6
forwarding on it, make sure the 'all' sysctl knob is also
enabled, else IPv6 packets will not be forwarded. This
seems to only affect HA routers that default to disabling
this 'all' knob on creation.
Also, when we are removing all the IPv6 addresses from a
HA router internal interface, set 'accept_ra' to zero so
it doesn't accidentally auto-configure an address. Set
it back to one when adding them back.
Re-homed newly added _wait_until_ipv6_forwarding_has_state()
accordingly.
Conflicts:
neutron/tests/functional/agent/l3/test_ha_router.py
Closes-bug: #1787919
Change-Id: Ia1f311ee31d1479089685367a97bf13cf170b342
(cherry picked from commit b847cd02c5)
(cherry picked from commit dfedafe5f6)
When a deployment has instance ports that are neutron trunk ports with
DPDK vhu in vhostuserclient mode, when the instance reboots nova will
delete the ovs port and then recreate when the host comes back from
reboot. This quick transition change can trigger a race condition that
causes the tbr trunk bridge to be deleted after the port has been
recreated. See the bug for more details.
This change mitigates the race condition by adding a check for active
service ports within the trunk port deletion function.
Change-Id: I70b9c26990e6902f8888449bfd7483c25e5bff46
Closes-Bug: #1807239
(cherry picked from commit bd2a1bc6c3)
It may happen that L3 agent works in dvr_snat mode but
it handles some router as "normal" dvr router because
snat for this router is handled on other node.
In such case we shouldn't try to get floating IPs cidrs
from snat namespace as it doesn't exists on host.
Change-Id: Ib27dc223fcca56030ebb528625cc927fc60553e1
Related-Bug: #1717302
(cherry picked from commit 7d0e1ccd34)
With DVR routers, if a port is associated with a FloatingIP,
before it is used by a VM, the FloatingIP will be initially
started at the Network Node SNAT Namespace, since the port
is not bound to any host.
Then when the port is attached to a VM, the port gets its
host binding, and then the FloatingIP setup should be migrated
to the Compute host and the original FloatingIP in the Network
Node SNAT Namespace should be cleared.
But the original FloatingIP setup in SNAT Namespace was not
cleared by the agent.
This patch addresses the issue.
Change-Id: I55a16bcc0020087aa1abe76f5bc85cd64ccdaecd
Closes-Bug: #1796491
(cherry picked from commit cd0cc47a6a)
In case when 2 dvr routers are connected to each other with
tenant network, those routers needs to be always deployed
on same compute nodes.
So this patch changes dvr routers scheduler that it will create
dvr router on each host on which there are vms or other dvr routers
connected to same subnets.
Co-Authored-By: Swaminathan Vasudevan <SVasudevan@suse.com>
Closes-Bug: #1786272
Conflicts:
neutron/agent/l3/agent.py
neutron/db/l3_dvr_db.py
neutron/tests/unit/agent/l3/test_agent.py
Change-Id: I579c2522f8aed2b4388afacba34d9ffdc26708e3
(cherry picked from commit 5018d70241)
(cherry picked from commit b127433f38)
In test test_ha_router_namespace_has_ipv6_forwarding_disabled
functional test it may happen that L3 agent will not change ipv6
forwarding and test fails because it checks that only once just
after router state is change to master.
This patch fixes that race by adding wait for 60 seconds to
ipv6 forwarding change.
Change-Id: I85a602561ebe9b7ab135913af49a3f010b09f196
Closes-Bug: #1801930
(cherry picked from commit 916e774516)
For L3 DVR HA router, the centralized floating IP nat rules are not
installed in every HA node snat namespace. So, install the rules to
all the router snat-namespace on every scheduled HA router host.
Conflicts:
neutron/tests/common/l3_test_common.py
neutron/tests/functional/agent/l3/test_dvr_router.py
Conflicts:
neutron/tests/common/l3_test_common.py
Closes-Bug: #1793527
Change-Id: I08132510b3ed374a3f85146498f3624a103873d7
(cherry picked from commit ee7660f593)
(cherry picked from commit 2a1cdf01b5)
(cherry picked from commit b93ef2f7e8)
All FloatingIP for DVR_NO_EXTERNAL agents will be configured
in the SNAT Namespace. So there is no need to configure the
address scope related routes in the router namespace when the
agent is configured as DVR_NO_EXTERNAL.
Change-Id: I009dae9e7f485641f2f19dce8dd575da04bfb044
Related-Bug: #1753434
(cherry picked from commit 7c4da6fb75)
Patch [1] increased timeouts for test_walk_version functional tests
for MySQL backend to 300 seconds to avoid failures due to timeouts.
Unfortunately it looks that on nodes from some cloud providers used
in the gate and with number of migration scripts which we have in
Neutron those tests can take sometimes even around 400 seconds.
So lets increase this to 600 seconds to avoid such failures of
functional tests job.
[1] https://review.openstack.org/#/c/610003/
Change-Id: I9d129f0e90a072ec980aadabb2c6b812c08e1618
Closes-Bug: #1687027
(cherry picked from commit c39afbd5fc)
Extra routes are not configured on Router namespaces in dvr_snat
node with DVR-HA configuration.
This patch fixes the problem.
Change-Id: If620b23564479042aa6f58640bcd6705e5eb52cf
Closes-Bug: #1797037
(cherry picked from commit 81652cd939)
In Neutron we hit quite often same issue as Manila, see [1] for
details.
It looks that solution for this problem may be increase timeout
for test_walk_version functional tests.
Higher timeout will be applied for tests for both pgsql and mysql
backends but it is mostly needed for mysql because 'pymysql' works
much slower on slow nodes than 'psycopg2'
This patch adds also new decorator to set individual timeout for
tests.
[1] https://bugs.launchpad.net/manila/+bug/1501272
Change-Id: I5f344af6dc3e5a6ee5f52c250b6c719e1b43e02d
Closes-Bug: #1687027
(cherry picked from commit c2c37272bf)
Sometimes we have seen the 'fg' ports within the fip-namespace
either goes down, not created in time or getting deleted due to
some race conditions.
When this happens, the code tries to recover itself after couple
of exceptions when there is a router_update message.
But after recovery we could see that the fip-namespace is
recreated and the 'fg-' port is plugged in and active, but the
'fpr' and the 'rfp' ports are missing which leads to the
FloatingIP failure.
This patch will fix this issue by checking for the missing devices
in all router_updates.
Change-Id: I78c7ea9f3b6a1cf5b208286eb372da05dc1ba379
Closes-Bug: #1776984
(cherry picked from commit 5a7c12f245)
By default number of MAC addresses which ovs stores in memory
is quite low - 2048.
Any eviction of a MAC learning table entry triggers revalidation.
Such revalidation is very costly so it cause high CPU usage by
ovs-vswitchd process.
To workaround this problem, higher value of mac-table-size
option can be set for bridge. Then this revalidation will happen
less often and CPU usage will be lower.
This patch adds config option for neutron-openvswitch-agent to allow
users tune this setting in bridges managed by agent.
By default this value is set to 50000 which should be enough for most
systems.
Change-Id: If628f52d75c2b5fec87ad61e0219b3286423468c
Closes-Bug: #1775797
(cherry picked from commit 1f8378e0ac)
In case of HA routers IPv6 forwarding is not disabled by default and
then enabled only on master node.
Before this patch it was done in opposite way, so forwarding was
enabled by default and then disabled on backup nodes.
When forwarding was enabled/disabled for qg- port, MLDv2 packets are
sent and that might lead to temportary packets loss as packets to
FIP were sent to this backup node instead of master one.
Related-Bug: #1771841
Change-Id: Ia6b772e91c1f94612ca29d7082eca999372e60d6
(cherry picked from commit 3e9e2a5b4b)
Agent OVS interface code adds ports without a vlan tag,
if neutron-openvswitch-agent fails to set the tag, or takes
too long, the port will be a trunk port, receiving
traffic from the external network or any other port
sending traffic on br-int.
Also, those kinds of ports are triggering a code path
on the ovs-vswitchd revalidator thread which can eventually
hog the CPU of the host (that's a bug under investigation [1])
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1558336
Conflicts:
neutron/tests/functional/agent/test_ovs_lib.py
needed the addition of the following import:
from neutron.plugins.ml2.drivers.openvswitch.agent.common import (
constants as agent_const)
Co-Authored-By: Slawek Kaplonski <skaplons@redhat.com>
Change-Id: I024bbbdf7059835b2f23c264b48478c71633a43c
Closes-Bug: 1767422
(cherry picked from commit 88f5e11d8b)
(cherry picked from commit 2b1d413ee9)
Recent pep8 upgrade and corresponding pycodestyle update break
pep8 job due to the new rules.
This commit fixes the following new errors:
- E266 too many leading '#' for block comment
- E501 line too long
- H903 Windows style line endings not allowed in code
The following errors are added to the ignore list
as there are many errors:
- E402 module level import not at top of file
- E731 do not assign a lambda expression, use a def
- W503 line break before binary operator
Conflicts:
neutron/tests/unit/plugins/ml2/drivers/linuxbridge/agent/test_linuxbridge_neutron_agent.py
neutron/tests/tempest/api/test_timestamp.py
Change-Id: I1fd3357479bb2ba3d89de92739ffac99900761b6
(cherry picked from commit 7a714aeb13)
(cherry picked from commit 71b305cb9e)
Post-binding information about router ports is missing in results of RPC
calls made by l3 agents. sync_routers code ensures that bindings are
present, however, it does not refresh router objects before returning
them - for RPC clients ports remain unbound before the next sync and
there is no necessary address scope information present to create routes
from fip namespaces to qrouter namespaces.
Conflicts:
neutron/api/rpc/handlers/l3_rpc.py
Change-Id: Ia135f0ed7ca99887d5208fa78fe4df1ff6412c26
Closes-Bug: #1759971
(cherry picked from commit ff5e8d7d6c)
When l3 agent is restarted on a dvr_snat node that is configured
for L3_HA and has a centralized FloatingIP configured to the
qg-interface in the snat_namespace, that FloatingIP is not
re-configured to the qg-interface when agent starts.
The reason being, the cidr is not being retrieved from the
keepalived instance and only retrieved from the
centralized_fip_cidr_set.
If 'L3_HA' is configured we need to retrieve it from the keepalived
instance.
This patch fixes the problem by retrieving the cidrs from the
keepalived instance for the qg-interface.
Change-Id: I848a20d06e2d344503a4cb1776dbe2617d91bc41
Closes-Bug: #1740450
(cherry picked from commit 64028a389f)
l3-agent checks the HA state of routers when a router is updated.
To ensure that the HA state is only checked on HA routers the following
check is performed: `if router.get('ha') and not is_dvr_only_agent`.
This check should ensure that the check is only performed on
DvrEdgeHaRouter and HaRouter objects.
Unfortunately, there are cases where we have DvrEdgeRouter objects
running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
neutron-lbaas in a landscape with 6 network nodes and
max_l3_agents_per_router set to 3, it may happen that the loadbalancer
is placed on a network node that does not have a DvrEdgeHaRouter running
on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
network node to serve the loadbalancer, just like it would deploy a
DvrEdgeRouter on a compute node when deploying a VM.
Under such circumstances each update to the router will lead to an
AttributeError, because the DvrEdgeRouter object does not have the
ha_state attribute.
This patch circumvents the issue by doing an additional check on the
router object to ensure that it actually has the ha_state attribute.
Closes-Bug: #1755243
Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
(cherry picked from commit 8c2dae659a)
Allowed_address_pair IP when associated with a network port will
inherit the services MAC.
Right now the ARP entry is updated with the last MAC that it is
associated with. But when allowed_address_pair IPs are used in
the context of VRRP the MAC keeps switching between the MASTER
and SLAVE. VRRP instance sends out GARP, but the ARP entry in the
router namespace is not getting updated based on the GARP.
This might cause the VRRP IP and the service using the IP to fail.
Since we having been adding the ARP entry with NUD state as
PERMANENT, the ARP entries are set for ever and does not adopt the
GARP sent out by the VRRP instance.
This will cause instances associated with DVR routers to have a
service interruption.
So the proposed patch will add the ARP entry for the Allowed address
pair with NUD for 'REACHABLE'.
This allows the Allowed_address_pair IP MAC to be updated on the
fly.
Change-Id: I43c3471f5d259e8c2ee1685398a06a4680c0bfcd
Closes-Bug: #1608400
(cherry-picked from commit fbe308bdc1)
Centralized floating IP return to Error state when
the 'dvr_no_external' agent restarts.
The sync data received from the server was not handling
the agent properly and so was not update the 'dvr_snat_bound'
flag.
This would initiate an floating IP Error state.
This patch will fix the issue mentioned above.
Closes-Bug: #1741411
Change-Id: Id1cf26ffba8262ba7b3e5f41faa4cb28ba9dcb7d
(cherry picked from commit 477d4135ba)
Previously, running neutron_ovs_cleanup on an installation with
5000 ports would time out even after setting the timeout to 3
hours. The code would do a bridge and port "by name" lookup for
every port due to being based off the ovs-vsctl implementation
where names are how everything is looked up. With this change,
the same test runs in ~1.5 mins.
This implementation adds a new OVSDB command that just looks up
the bridge, iterates over its ports, and deletes the ones that
should be deleted in a single transaction per bridge.
Change-Id: I23c81813654596d61d8d930e0bfb0f016f91bc46
(cherry picked from commit fef374131b)
ovsdb maps accept strings as values only. This patch converts integer to
be passed to ovsdb in case vxlan_udp_port config value is used.
Change-Id: Idba77939a80d80a4bc9625d10c8b37b23b91b9c5
Closes-bug: #1742931
(cherry picked from commit 622a137974)
If callers of get_devices_with_ip(), or
device.addr.list(to=address) pass an ip_cidr, it
could match any ip_cidr in that range on the interface.
Callers need to pass the IP without the prefix portion in
order to match it exactly. Added a helper utility to
strip the cidr part from a ip_cidr.
Determined the unit test for this can't actually check
this case since we are mocking the return value from
/sbin/ip, so modified it to just make sure the dict
is correct.
Added a functional test that adds two IP addresses in
the same IP range to verify that we actually filter
correctly when a 'to=IP' is specified.
Change-Id: I3a95b3bb72a43f322ad23892d8959398aac22a1c
Closes-bug: #1728080
(cherry picked from commit 7b8289253c)
As soon as we call router_info.initialize(), we could
possibly try and process a router. If it is HA, and
we have not fully initialized the HA port or keepalived
manager, we could trigger an exception.
Move the call to check_ha_state_for_router() into the
update notification code so it's done after the router
has been created. Updated the functional tests for this
since the unit tests are now invalid.
Also added a retry counter to the RouterUpdate object so
the l3-agent code will stop re-enqueuing the same update
in an infinite loop. We will delete the router if the
limit is reached.
Finally, have the L3 HA code verify that ha_port and
keepalived_manager objects are valid during deletion since
there is no need to do additional work if they are not.
Change-Id: Iae65305cbc04b7af482032ddf06b6f2162a9c862
Closes-bug: #1726370
(cherry picked from commit d2b909f533)
With the current change in allowing the unbound fip
to be associated with the snat node, we are seeing
that all floating IPs that are associated with an
unbound port are created at the snat node.
This is also applicable for floating IPs that are
created just before associating the port to a VM.
We have seen such scenarios in the test cases.
This is the right behavior as per design. But when
the port is bound to a host, the floating IP should
be migrated to the respective host.
This patch fixes the issue by sending notification to
the respective node, when the port is bound and also
clear the fip from the snat node.
Closes-Bug: #1718788
Change-Id: I6b1f3ffc3c3336035632f6a82d3a87b3be57b403
(cherry picked from commit 27fcf86bcb)
When HA is enabled with DVR routers the centralized floating
IPs are not configured properly in the DVR snat namespace
for the master router namespace.
The reason is we were not calling the add_centralized_floatingip
and the remove_centralized_floatingip in the DvrEdgeHaRouter
class.
This patch overrides the add_centralized_floatingip and
remove_centralized_floatingip in dvr_edge_ha_router.py file
to add the cidr to the vips.
Closes-Bug: #1716829
Change-Id: Icc8c5d4e22313448e2066a29dbe509e4345b364c
(cherry picked from commit b9ecb3804c)
With a recent change to the neutron server code, the server was
processing floating IPs that were not bound to the respective
agent during fullsync operation.
Change to always initialize floating IP host info so callers
can determine if info should be sent to an agent or not.
Also changed the logic that decides when the server should
send a floating IP to an agent to be easier to understand.
Closes-bug: #1713927
Change-Id: Ic916225e0a11c3fb8cd94437ca063e0d3295a569
(cherry picked from commit 7bff99ac4a)
Otherwise we don't see some of them for the agent, for example,
AGENT.root_helper is missing.
To make sure the logging is as early as possible, and to make sure that
options that may be registered by extensions are also logged, some
refactoring was applied to the code to move the extension manager
loading as early as possible, even before agent's __init__ is called.
Related-Bug: #1718767
Change-Id: I823150cf6406f709d1e4ffa74897d598e80f5329
(cherry picked from commit 45be804b40)
When router interfaces are added to DVR router, if the router has
gateway configured, then the internal csnat ports are created for
the corresponding router interfaces.
We have seen recently after the csnat port is created if the
RouterPort table update fails, there is a DB retry that is happening
and that retry operation is creating an additional csnat port.
This additional port is not getting removed automatically when the
router interfaces are deleted.
This issue is seen when testing with a simple heat template as
per the bug report.
This patch fixes the issue by calling the RouterPort create with
delete_port_on_error context.
Change-Id: I916011f2200f02556ebb30bce30e349a8023602c
Closes-Bug: #1709774
(cherry picked from commit 8c3cb2e15b)
Remove duplicated and empty fields from users requests
in Pecan to preserve the old legacy API controller behavior.
Closes-Bug: #1714384
Change-Id: I1afc24b146a8fcc6c8ebae708f32dd7c1795292e
(cherry picked from commit 700d609ace)
Change [1] altered the behavior of the legacy API controller
to do the sane thing and return an HTTP 403 instead of a 404
whenever a user got a policy authorization failure when trying
to mutate a resource they have the permission to view.
This carries the same logic over to the pecan API.
This also adjusts the logic for GET requests to return 404s
instead of 403s to match the resource hiding behavior of the
old controller.
1. I7a5b0a9e89c8a71490dd74497794a52489f46cd2
Closes-Bug: #1714388
Change-Id: I9e0d288a42bc63c2927bebe9c581b83e6fbe010b
(cherry picked from commit fe8107a817)