When HA router is created in "stanby" mode, ipv6 forwarding is
disabled by default in its namespace.
But when router is transitioned to be "master" on node, ipv6
forwarding should be enabled. This was fine for routers with
configured gateway but we somehow missed the case when router don't
have gateway configured.
Because of that missing ipv6 forwarding setting in such case, IPv6
W-E traffic between 2 subnets was not working fine in L3 HA case.
This patch fixes it by adding configuring ipv6_forwarding on
"all" interface in router's namespace always, even if it don't have
gateway configured.
Conflicts:
neutron/tests/functional/agent/l3/framework.py
neutron/tests/unit/agent/l3/test_agent.py
Change-Id: I8b1b2b426f7a26a4b2407a83f9bf29dd6e9ba7b0
CLoses-Bug: #1818224
(cherry picked from commit b119247bea)
(cherry picked from commit 270912a8c7)
Since iptables-restore doesn't support --dport with protocol vrrp,
it errors out setting the security groups on the hypervisor.
Marking this a partial fix, since we need a change to prevent
adding those incompatible rules in the first place, but this
patch will stop the bleeding.
Change-Id: If5e557a8e61c3aa364ba1e2c60be4cbe74c1ec8f
Partial-Bug: #1818385
(cherry picked from commit 8c213e4590)
In case when L3 agent is running in dvr_snat mode on compute node,
it is like that e.g. in some of the gate jobs, it may happen that
same router is scheduled to be in standby mode on compute node and
on same compute node there is instance connected to it.
So in such case metadata proxy needs to be spawned in router namespace
even if it is in standby mode.
Conflicts:
neutron/tests/unit/agent/l3/test_agent.py
Change-Id: Id646ab2c184c7a1d5ac38286a0162dd37d72df6e
Closes-Bug: #1817956
Closes-Bug: #1606741
(cherry picked from commit 6ae228cc2e)
Need to pass centralized floating IPs as preserve_ips to
_external_gateway_added during DVR router update.
Otherwise IP addresses will be deleted from gw device in certain case.
The case is when a router with active centralized floating IPs is
being scheduled to a new dvr_snat L3 agent (rescheduled from a down one).
Please see corresponding traces in the bug description.
Change-Id: Iaeb9fbed73144df6fcd9092c665ed19986e85f4d
Closes-bug: #1817306
(cherry picked from commit 1ee18775a9)
The firewall won't attempt on update to initialize port in case
port hasn't been initialized by sg_agent yet. This fixes a race where update
rpc call arrives between wiring tap device with integration bridge and
firewall initialization.
Change-Id: Ice0667df606ae23061acebceea23ab6e49dadbcf
Closes-bug: #1740885
(cherry picked from commit ed57c3de42)
RouterInfo class has got internal_ports cache which is updated
in _process_internal_ports() method.
There was an issue in this updates logic because it was
iterating through enumerate local variable "internal_ports"
which represents current router ports and if such current port
was found in updated_ports list it was storred in
RouterInfo().internal_ports variable under same index as was
found in "internal_ports" local variable.
This sometimes leads to an issue because same port can be
stored under different index in internal_ports and
RouterInfo().internal_ports lists thus wrong port in
RouterInfo().internal_ports was overwritten.
Such issue leads to problem with generating radvd config file
because in ports cache list there was duplicate info about same port
so radvd config file contained duplicate interface definitions too.
This should be properly fixed by changing RouterInfo.internal_ports
to be a dict instead of list of ports but such patch would be much
bigger and (possibly) harded to backport to stable branches.
Change-Id: I2e38457942518c8a3e07e606091bb6720317b77e
Closes-Bug: #1813279
(cherry picked from commit 21cddc47b4)
Dnsmasq driver used by dhcp agent has restart() method which is
calling disable() and then enable() dnsmasq process again.
What can be observed in functional tests from time to time it may
happen that start dnsmasq process will be called before old process
is really down. That leads to error that IP address to which
dnsmasq wants to bind is already in use and it fails to start.
This patch adds possibility to call disable() method with block flag
set to True. In such case driver will ensure in disable() method that
process is really not active.
This blocking disable() is used in restart() method now.
Change-Id: I419a451633badbc3d32edcee1945fca3e3d9f6be
Closes-Bug: #1811126
(cherry picked from commit d471a85931)
Bug #1244589 re-appeared for IPv6.
This change adds an ip6tables rule to fix the checksum of DHCPv6
response packets. Those checksums were left unfilled by virtio (as a
hypervisor internal optimization), but some picky dhcp clients (AFAIU
particularly ISC dhclient) try verifying the checksums, so they fail
to acquire an address if the checksums are left incorrect.
Change-Id: I4a045e0dcfcbd3c7959a78f1460d5bf7da0252ff
Closes-Bug: #1811639
Related-Bug: #1244589
(cherry picked from commit 26eb2509fe)
Currently any dhcp agent instance will work as an open resolver. For
deployments using publicly routed addresses for tenant networks, this
allows the agent being abused in dDoS attacks, see [1].
By setting the `--local-service` option dnsmasq will filter DNS queries
and reply only to queries from directly attached networks.
[1] https://bugs.launchpad.net/neutron/+bug/1501206
Conflicts:
neutron/cmd/sanity_check.py
Closes-Bug: 1501206
Change-Id: I76d810aad2ce0f15a88bd798963012fa0efca74e
(cherry picked from commit 0fce3ca2c1)
If DHCP agent port cache is out of sync with neutron server, dnsmasq
entries are wrong and VMs may not acquire an IP because of duplicate
entries.
When DHCP agent executes port_create_end method, port's
IP should be checked before being used, if there are duplicate IP
addresses in the same network in the cache we should resync.
Co-Authored-By: doreilly@suse.com
Closes-Bug: #1645835
Change-Id: Icc555050283420fddfb90bb67e02bc303e989e27
AsyncProcess.stop() method has now additional parameter
kill_timeout. If this is set to some value different than
None, eventlet.green.subprocess.Popen.wait() will be called
with this timeout, so TimeoutExpired exception will be raised
in case if process will not be killed for this "kill_timeout"
time.
In such case process will be killed "again" with SIGKILL signal
to make sure that it is gone.
This should fix problem with failing fullstack tests, when
ovs_agent process is sometimes not killed and test timeout was
reached in this wait() method.
Conflicts:
neutron/agent/linux/async_process.py
Change-Id: I1e12255e5e142c395adf4e67be9d9da0f7a3d4fd
Closes-Bug: #1798472
(cherry picked from commit 9b23abbdb6)
The unit test
test_enable_dhcp_helper_enable_metadata_nonisolated_dist_network
modifies the global variables fake_port1, fake_port2, creating flakiness
on unit tests that use those variables when execured in environments
with high concurrency.
Creating a deepcopy of the variable avoid that those changes can be
propagated to other unit tests.
Closes-Bug: #1809643
Change-Id: Idfd0e99739952baf4d7b545b406cd1b251deb5f8
Signed-off-by: aojeagarcia <aojeagarcia@suse.com>
(cherry picked from commit e83e5618b7)
When a subnet's enable_dhcp attribute is updated, we must restart
dhcp device. So,when we decide whether 'restart' or
'reload_allocations' in refresh_dhcp_helper function we only compare
the cidr of subnets which enabled dhcp.
The previous logic only calls 'restart' when deleting or adding a
subnet. This may cause the dhcp port not updated when the subnet's
enable_dhcp is updated to True.
Change-Id: Ic547946ac786c5fab82b4ee7078bf86483f51eb5
Closes-Bug: #1805824
(cherry picked from commit 9aa7af8221)
With DVR routers, if a port is associated with a FloatingIP,
before it is used by a VM, the FloatingIP will be initially
started at the Network Node SNAT Namespace, since the port
is not bound to any host.
Then when the port is attached to a VM, the port gets its
host binding, and then the FloatingIP setup should be migrated
to the Compute host and the original FloatingIP in the Network
Node SNAT Namespace should be cleared.
But the original FloatingIP setup in SNAT Namespace was not
cleared by the agent.
This patch addresses the issue.
Change-Id: I55a16bcc0020087aa1abe76f5bc85cd64ccdaecd
Closes-Bug: #1796491
(cherry picked from commit cd0cc47a6a)
In case when 2 dvr routers are connected to each other with
tenant network, those routers needs to be always deployed
on same compute nodes.
So this patch changes dvr routers scheduler that it will create
dvr router on each host on which there are vms or other dvr routers
connected to same subnets.
Co-Authored-By: Swaminathan Vasudevan <SVasudevan@suse.com>
Closes-Bug: #1786272
Conflicts:
neutron/agent/l3/agent.py
neutron/db/l3_dvr_db.py
neutron/tests/unit/agent/l3/test_agent.py
Change-Id: I579c2522f8aed2b4388afacba34d9ffdc26708e3
(cherry picked from commit 5018d70241)
(cherry picked from commit b127433f38)
Merge the system protocol assignments into the iptables name
to protocol mapping array, IPTABLES_PROTOCOL_NAME_MAP, so that
systems with updated or new values in /etc/protocols can
successfully install iptables rules.
This was done as an IptablesFirewallDriver() instance mapping
since there is typically only a single instance per-agent,
and it also allows us to more easily unit test it.
Conflicts:
neutron/tests/unit/agent/linux/test_iptables_firewall.py
Change-Id: Ib73def4e2a9e3644462fdee312768382fcb800a5
Closes-Bug: #1783378
(cherry picked from commit 034db863a0)
For L3 DVR HA router, the centralized floating IP nat rules are not
installed in every HA node snat namespace. So, install the rules to
all the router snat-namespace on every scheduled HA router host.
Conflicts:
neutron/tests/common/l3_test_common.py
neutron/tests/functional/agent/l3/test_dvr_router.py
Conflicts:
neutron/tests/common/l3_test_common.py
Closes-Bug: #1793527
Change-Id: I08132510b3ed374a3f85146498f3624a103873d7
(cherry picked from commit ee7660f593)
(cherry picked from commit 2a1cdf01b5)
(cherry picked from commit b93ef2f7e8)
Since we know the IP and MAC addresses of both sides of the
fip/qrouter namespace veth pair device, just add permanent
ARP entries for them.
Change-Id: I6193b00681dfb79222eedfd00c89620321ac1b4f
Related-Bug: #1791989
(cherry picked from commit ac5815a110)
The issue scenario happens when we disassociate a floating IP
while the 'master' router host is restarted or powered-off.
When the L3 agent is powered-on again, the HA router state config
still remains 'master', but the ha port is down. And the message
queue still has one 'router_update' message (floating IP
disassociate message), so the L3 agent will sync this router info
at least twice during the restart, one is the router_update, the
other is the L3 agent full-sync.
The first one will add the centralized FIP to the qg-device, because
the router state is 'master'. So for DVR HA routers, only add the
centralized floating IP to the qg-device in the snat-namespace when
the HA port is up. For the restart procedure, if the HA port is up,
but the router is set to 'backup', do not add the floating IP.
Closes-Bug: #1794305
Change-Id: Ib39fe7dcd437a867c69852885c461a594167f6a1
(cherry picked from commit 656a8f8729)
The port delete events are not synchronized with network rpc events. This
creates a condition which makes it possible for a port delete event to be
processed just before a previously started network query completes.
The problematic order of operations is as follows:
1) a network is scheduled to an agent; a network rpc is sent to the
agent
2) the agent queries the network data from the server
3) while that query is in progress a port on that network is deleted; a
port rpc is sent to the agent
4) that port delete rpc is received before the network query rpc
completes
5) the port delete results in no action because the port was not present
on the agent
6) the network query finishes and adds the port to the cache (even
though the port has already been deleted)
7) some time passes and a new port is configured with the same IP
address as the port that was deleted in (3)
8) the dhcp host file is corrupted with 2 entries for the same IP
address.
9) dhcp queries for the newest port is rejected because of the duplicate
entry in the dhcp host file.
The solution is to add the network_id to the port_delete_end rpc event
so that the _net_lock(network_id) synchronization point can be acquired
so that it is processed serially with other network related events.
To ensure backwards compatibility with newer agents running against older
servers the determination of which network_id value to use in the lock is
handled using a utility that will fallback to the previous mode of operation
whenever the network_id attribute is not present in the *_delete_end RPC
events. That utility can be removed in the future when it is guaranteed
that the network_id attribute will be present in RPC messages from the
server.
Closes-Bug: #1732456
Change-Id: I735f8b1c9248b12e5feb6cbe970cf67f321e6ebc
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
(cherry picked from commit fa78b58010)
It turns out that in environments with a big number of VMs, sometimes
the neutron dhcp agent fails to read the dhcp lease file because some
lines with the ipv4/ipv6 entries don't have enough fields and causes the
dhcp agent to fail.
When this happens the agent calls sync_state to
fully resync the agent state, that causes a serious performance problems
in scale environments.
We need to be more robust reading the file to handle these exceptions.
Co-authored-by: stephen-ma
Partial-Bug: #1788556
Change-Id: Ia681a5e929df5bf8c97ae9445876c306c34061b5
(cherry picked from commit 8a3ff8a19e)
Move the iptables metadata marking rule earlier in
router init, that way any stray metadata requests
that arrive before the filter metadata redirect rule is
installed will just be dropped. We do this irregardless
of whether we will be running the metadata proxy.
Partial-bug: #1735724
Change-Id: I8982523dbb94a7c5b8a4db88a196fabc4dd2873f
(cherry picked from commit 6941977827)
When multiple ports are bound to qos-policy with the same id,
ovs-agent should check whether the cache has policy information
instead of directly reading rpc when processing the port.
Change-Id: I88f9f5af95439f1536799169390764c89109f467
Closes-Bug: #1783559
(cherry picked from commit 7a27e24447)
Without this flag, dnsmasq prefers to ask the servers that
are known to be up, rather than hitting servers that are either
down or known to be broken. This greatly reduces responsivity
impact of broken upstream servers.
Closes-Bug: #1746000
Change-Id: Ieee4dafc578c3bda0935fcdb80faad6c342a10e9
(cherry picked from commit d3c69dc4f2)
Sometimes calls to dhcp_release(6) do not result in removal
of a lease from the leases file, for example, when the release
packet is not received by dnsmasq. Trying more than once is
recommended in this case.
Instead of blindly trying some number of times, we monitor
the lease file contents, and retry the dhcp_release(6) call
when an entry still remains. This is possible since
dhcp_release(6) is being run from the DHCP server itself.
We try three times and wait 0.3 seconds between tries.
We also now check for any stale leases in the leases file
that are unknown to neutron, also trying to remove them.
Change-Id: Ic1864f7efbc94db1369ac7f3e2879fda86f95a11
Closes-bug: #1764481
Closes-bug: #1783908
(cherry picked from commit fab032b426)
Patch [1] added configuration of forward rule for trusted ports in
iptables firewall driver.
This patch fixes issue with many "duplicate iptables rule detected"
warning messages due to try to add such forward rule each time when
trusted port is updated.
Now such rule is added only once for port.
[1] https://review.openstack.org/#/c/525607/
Change-Id: Ib816887f07f16b6ac865bb81d0f27f12d0b47dfb
Closes-Bug: #1754770
(cherry picked from commit 8be0c2a551)
There was missing mock of
ipv6_utils.is_enabled_and_bind_by_default() in BridgeLibTest
unit tests and that cause failing some of tests from this module
when tests are running on host with disabled IPv6.
Now it's mocked and tests are running properly and are
testing what they should test.
Closes-Bug: #1773818
Change-Id: I9144450ce85e020c0e33c5214a2178acbbbf5f54
(cherry picked from commit 8930d33c71)
Sometimes we have seen the 'fg' ports within the fip-namespace
either goes down, not created in time or getting deleted due to
some race conditions.
When this happens, the code tries to recover itself after couple
of exceptions when there is a router_update message.
But after recovery we could see that the fip-namespace is
recreated and the 'fg-' port is plugged in and active, but the
'fpr' and the 'rfp' ports are missing which leads to the
FloatingIP failure.
This patch will fix this issue by checking for the missing devices
in all router_updates.
Change-Id: I78c7ea9f3b6a1cf5b208286eb372da05dc1ba379
Closes-Bug: #1776984
(cherry picked from commit 5a7c12f245)