When we manually move a router from one dvr_snat node to
another dvr_snat node the snat_namespace should be removed in
the originating node by the agent and will be re-created in the
destination node by the destination agent.
But when the agent dies, the router_update message reaches the
agent after the agent restarts. At this time the agent should
remove the snat_namespace since it is no more hosted by the
current agent.
Even though we do have logic in agent to take care of cleaning
up the snat namespaces if the gw_port_host does not match with the
existing agent host, in this particular use case the self.snat_namespace
is always set to 'None' in the dvr_edge_router init call when agent
restarts.
This patch fixes the above issue by initializing the snat namespace
object during the router_init. Since we do have a valid snat
namespace object and if the gw_port_host mismatches, the agent
should clean up the namespace.
Change-Id: I30524dc77b743429ef70941479c9b6cccb21c23c
Closes-Bug: #1557909
(cherry picked from commit 9dc70ed77e)
The DHCP rules in the fixed iptables firewall rules were too permissive.
They permitted any UDP traffic with a source port of 68 and destination
port of 67. Care must be taken since these rules return before the IP
spoofing prevention rules. This patch splits the fixed DHCP rules into
two, one for the discovery and request messages which take place before
the instance has bound an IP address and a second to permit DHCP
renewals.
Conflicts:
neutron/tests/functional/agent/test_firewall.py
Change-Id: Ibc2b0fa80baf2ea8b01fa568cd1fe7a7e092e7a5
Partial-Bug: #1558658
(cherry picked from commit 6a93ee8ac1)
This fixes the iptables rules generated by the L3 agent
(SNAT, DNAT, set-mark and metadata), and the DHCP agent
(checksum-fill) to match the format that will be returned
by iptables-save to prevent excessive extra replacement
work done by the iptables manager.
It also fixes the iptables test that was not passing the
expected arguments (-p PROTO -m PROTO) for block rules.
A simple test was added to the L3 agent to ensure that the
rules have converged during the normal lifecycle tests.
Closes-Bug: #1566007
Change-Id: I5e8e27cdbf0d0448011881614671efe53bb1b6a1
(cherry picked from commit b8d520ffe2)
Right now we are seeing a race condition in the l3 agent
for DVR routers when a floatingip is deleted and created.
The agent tries to delete the floatingip namespace and
while it tries to delete there is another call to add a
namespace. There is a timing window in between these two
calls where sometimes the call to create a namespace succeeds
but, when tried to execute any commands in the namespace
it fails, since the namespace was deleted concurrently.
Since the fip namespace is associated with an external net
and each node has only one fip namespace for an external net,
we would like to only delete the fip namespace when the
external net is deleted.
The first step is to split the delete functionality into two.
The call to fip_ns.cleanup will only remove the dependency that
the fipnamespace has with the router namespace such as fpr and
rfp veth pairs.
The call to fip_ns.delete will actually delete the
the fip namespace and the fg device.
Partial-Bug: #1501873
(cherry picked from commit c874f6dada)
Change-Id: Ic94625d5a968f554af70c274b2b2c20ab64e2487
Currently, once the metadata_process is created for the network,
it will never be eliminated unless the network is deleted. Even if
user disable the metadata for network and restart dhcp agent, the
metdata proxy for network will still be there. This will waste the
resource of neutron host. This patch will let the dhcp-agent
delete useless metadata_proxy at startup.
Additional functional tests are added to for related scenario.
Change-Id: Id867b211fe7c01a11ba73a5ebc275c595933becf
Closes-Bug: #1507950
(cherry picked from commit dc0c7b5588)
These will happen all of the time as networks are quickly
created/updated and then deleted. It's not any kind of
actionable warning condition so this patch downgrades them
to debug.
Conflicts:
neutron/agent/dhcp/agent.py
neutron/tests/unit/agent/dhcp/test_agent.py
Change-Id: Idcfb185b9a0540c13101dceb3681132f38f1716c
Closes-Bug: #1555842
(cherry picked from commit 72e9fc9001)
During a lot of port deletions, the OVS agent will
build up a lot of remote security group member updates
for a single device. Once the call to delete all
of the removed remote IP conntrack state gets issued,
there will be many duplicated entries for the same
device in the devices_with_updated_sg_members dicionary
of lists.
This results in many duplicated calls to remove conntrack
entries that are just a waste of time. The longer it takes
to remove conntrack entries, the more of these duplicates
build up for other pending changes, to the point where there
can be hundreds of duplicate calls for a single device.
This just adjusts the conntrack manager clearing logic to
make sure it de-duplicates all of its delete commands before
it issues them.
In a local test on a single host I have 11 threads create
11 ports each, plug them into OVS, and then delete them.
Here are the number of conntrack delete calls issued:
Before this patch - ~232000
With this patch - ~5200
While the remaining number still seems high, the agent is now
fast enough to keep up with all of the deletes.
Closes-Bug: #1513765
Depends-On: I4041478ca09bd124827782774b8520908ef07be0
Change-Id: Icba88ab47ee17bf5d6ccdfc0f78bec911987ca90
(cherry picked from commit d7aeb8dd4b)
Explicitly disable IPv6 on Neutron created interfaces in the default
namespace before setting link up. Since the default behavior of IPv6 is
to bind to all interfaces as opposed to IPv4 where an address must be
explicitly configured we disable IPv6 on each interface before enabling
the interface. This avoids leaving a time window between when the
interface is enabled and when it is attached to bridge device during
which the host could be access from a tenant network.
Move disable_ipv6() from BridgeDevice to base IPDevice class so it is
usable by all interfaces. Then we explicitly disable IPv6 on veth
interfaces in the default namespaces and VXLAN and VLAN interfaces
created by the LinuxBridge agent.
In addition vlan interface is moved from LinuxBridgeManager to IPWrapper
so it can return an IPDevice object.
Conflicts:
neutron/agent/linux/bridge_lib.py
neutron/tests/unit/plugins/ml2/drivers/linuxbridge/agent/test_linuxbridge_neutron_agent.py
Closes-Bug: #1534652
Change-Id: Id879075f2d5ee42f8ff153e813e7519a4424447b
(cherry picked from commit fc8ebae035)
This fixes an issue where the lb agent did not plug the
dhcp tap device into the bridge when having vlan networking
set up. Caused by setting of disable_ipv6 value.
Conflicts:
neutron/agent/linux/bridge_lib.py
neutron/tests/functional/agent/linux/test_bridge_lib.py
neutron/tests/unit/agent/linux/test_bridge_lib.py
Closes-Bug: #1520618
Change-Id: I0d21fad3a676d1fdd30501ea6a295f1e9b207a3a
Co-Authored-By: Brian Haley <brian.haley@hpe.com>
(cherry picked from commit cac2436f29)
Currently 'force_gateway_on_subnet' configuration is set to True
by default and enforces the subnet on to the gateway. With this
fix 'force_gateway_on_subnet' can be changed to False, and
gateway outside the subnet can be added.
Before adding the default route, a route to the gateway IP is
added. This applies to both external and internal networks.
Change-Id: I3a942cf98d263681802729cf09527f06c80fab2b
Closes-Bug: #1335023
Closes-Bug: #1398768
(cherry picked from commit b6126bc0f1)
When the DHCP agent fails to create a namespace for the DHCP
service we will release the DHCP port instead of failing silently.
This will at least give the user an indication that there is no DHCP
service. No DHCP port will exist.
Change-Id: I59af745d3991e6deb424ecd9b916b03f146c246a
Closes-bug: #1544548
(cherry picked from commit 80cfec6625)
We don't want to create a bridge device with an IPv6 address because
it will see the Router Advertisement from Neutron.
Conflicts:
neutron/agent/linux/bridge_lib.py
Change-Id: If59a823804d3477c5d8877f46fcc4c018af57a5a
Closes-bug: 1302080
(cherry picked from commit 404eaead79)
Today static routes are added to the SNAT namespace
for DVR routers. But they are not added to the qrouter
namespace.
Also while configuring the static routes to SNAT
namespace, the router is not checked for the existence
of the gateway.
When routes are added to a router without a gateway the
routes are only configured in the router namespace, but
when a gateway is set later, those routes have to be
populated in the snat_namespace as well.
This patch addresses the above mentioned issues.
Closes-Bug: #1499785
Closes-Bug: #1499787
Conflicts:
neutron/agent/l3/dvr_edge_router.py
neutron/tests/functional/agent/test_l3_agent.py
neutron/tests/functional/agent/l3/framework.py
neutron/tests/functional/agent/l3/test_dvr_router.py
Change-Id: I37e0d0d723fcc727faa09028045b776957c75a82
(cherry picked from commit 158f9eabe2)
DHCP agent will report an error when enable configuration
named 'dnsmasq_base_log_dir'.
For example, if this configuration is set to '/tmp'.
The code checks dir '/tmp' if exists. This is wrong.
The right way is to check '/tmp + [network_id]' existence.
Change-Id: I0e060ca7c84f38bb0ccd55ac16da5446a3d015c5
Closes-Bug: #1538386
(cherry picked from commit 8ae32a6681)
Because the _watch_process and the failing_process are asynchronous,
there might be a chance that failing_process exit and _watch_process
is not executed.
If the _watch_process is blocked, the method that will be asserted
will not be called. This will fail the UT, but it is intermittent.
Change-Id: Ic951c1b91c5a10462f548544a5e8d482c52ad665
Closes-Bug: #1519160
Related-Bug: #1543040
Related Bug: #1506021
(cherry picked from commit dcd0498c17)
The commit 3686d035de caused a
regression for setups that support a number of DNS servers that
are injected via the DHCP options 5 and 6.
If the dnsmaq has a configured dns-server then it will ignore
the ones that are injected by the admin. That is what the commit
above did. This causes a number of problems. The main one is that
it requires the DHCP agent to have connectivity to the DNS server.
The original code was added in commit
2afff147c4
Change-Id: Iae3e994533102a2b076cc2dc205cdd5caaee1206
Closes-bug: #1540960
(cherry picked from commit 13a4268062)
The procedure to update security group rules and members in
firewall driver is called after update_port_filter call.
Because of this, new rules and member updates are not applied
on the port.
With this change the call to update rules and members
is moved before the port update call, giving a chance to
firewall drivers to update their rule and member caches.
Closes Bug: 1511782
Change-Id: I457e17c34b86f861f6e15de7c3adcb3f2b79d14e
(cherry picked from commit a8e9cc848b)
Currently, when calling AsyncProcess.stop(), the code stops the stdout
and stderr readers and kills the process. There exists an end case (as
described in the bug report) in which after the readers have been
stopped the sub-process will generate a substantial amount of outputs to
either fd. Since the 'subprocess' module is launched with
subprocess.PIPE as stdout/stderr, and since Linux's pipes can be filled
to the point where writing new data to them will block, this may cause a
deadlock if the sub-process has a signal handler for the signal (for
example, the process is handling SIGTERM to produce a graceful exit of
the program).
Therefore, this patch proposes to only kill the readers until AFTER
wait() returned and the process truly died. Also, relying on _kill_event
had to cease since invoking its send() method caused a logical loop back
to _kill, causing eventlet errors.
A different possible solution is closing the stdout/stderr pipes. Alas,
this may raise an exception in the sub-process ("what? No stdout?!
Crash!") and defeats the 'graceful' part of the process.
Closes-Bug: #1506021
Change-Id: I506c41c634a8d656d81a8ad7963412b834bdfa5b
(cherry picked from commit ddaee9f060)
In big and busy clusters there could be a condition when
rabbitmq clustering mechanism synchronizes queues and during
this period agents connected to that instance of rabbitmq
can't communicate with the server and server considers them
dead moving resources away. After agent become active again,
it needs to cleanup state entries and synchronize its state
with neutron-server.
The solution is to make agents aware of their state from
neutron-server point of view. This is done by changing state
reports from cast to call that would return agent's status.
When agent was dead and becomes alive, it would receive special
AGENT_REVIVED status indicating that it should refresh its
local data which it would not do otherwise.
Conflicts:
neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py
neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
neutron/tests/unit/agent/dhcp/test_agent.py
neutron/tests/unit/plugins/ml2/drivers/linuxbridge/agent/test_linuxbridge_neutron_agent.py
neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py
Closes-Bug: #1505166
Change-Id: Id28248f4f75821fbacf46e2c44e40f27f59172a9
(cherry picked from commit 3b6bd917e4)
We currently use garp_master_repeat and garp_master_refresh
to solve bug 1453855. We need to spawn keepalived only after
all of the qr/qg ports have been wired so that the
initial GARP will be properly sent. Otherwise you get a routing
black hole. In lieu of a proper sync method, we used those two keepalived
options to send GARPs repeatedly:
a) We did not know it never stops spamming the network
b) It causes VMs to lose their IPv6 default gateway due to a keepalived
bug, which has since been fixed, but it would need to be backported
to every keepalived version on every distro. Here's the patch:
https://github.com/acassen/keepalived/pull/200
The solution this patch proposes is to drop the repeat and refresh
keepalived options. This will fix the IPv6 bug but re-introduce bug
1453855. So, this patch uses the delay option instead. It turns
out keepalived sends a GARP when it transitions to MASTER, and then
it waits a number of seconds determined by the delay option, and
sends a GARP again. We'll use an aggressive 'delay' setting to make
sure that when the node boots and the L3/L2 agents start, we'll
give the L2 agent enough time to wire the ports as a stopgap solution.
Note that this only affects initial synchronization time, not failover
times. Failover times will continue to be fast because the ports
are wired ahead of time, the initial GARP after the state transition
to MASTER will be sent properly.
Conflicts:
neutron/tests/functional/agent/test_l3_agent.py
Change-Id: I7a086472b8742828dae08ffd915c45e94fb4b94e
Closes-Bug: #1520517
Related-Bug: #1453855
(cherry picked from commit 303cbc6b5b)
There seems to be a timing issue between the
ARP entries that arrive from the server to
the agent and the internal qr-device getting
created by the agent.
So those unsuccessful arp entries are dropped.
This patch makes sure that the early ARP entries
are cached in the agent and then utilized when
the internal device is up.
Closes-Bug: #1501086
Change-Id: I9ec5412f14808de73e8dd86e3d51593946d312a0
(cherry picked from commit d9fb3a66b4)
DHCP agent may be used by plugins that don't set mtu value for networks.
Handle the case by not passing the DHCP option when network does not
have the value set.
Most plugins do set the value though, since it's enforced in base db
plugin class.
Closes-Bug: #1534197
Change-Id: I282b3d6b81f91eb8cea901d955cbcca6ecb2a95d
(cherry picked from commit 36effd6600)
If the L3 agent fails to configure a router, commit:
4957b5b435 changed it so
that instead of performing an expensive full sync, only that
router is reconfigured. However, it tries to reconfigure the
cached router. This is a change of behavior from the fullsync
days. The retry is more likely to succeed if the
router is retrieved from the server, instead of using
the locally cached version, in case the user or operator
fixed bad input, or if the router was retrieved in a bad
state due to a server-side race condition.
Note that this is only relevant to full syncs, as those retrieve
routers from the server and queue updates with the router object.
Incremental updates queue up updates without router objects,
so if one of those fails it would always be resynced on a
second attempt.
Related-Bug: #1494682
Change-Id: Id0565e11b3023a639589f2734488029f194e2f9d
(cherry picked from commit 822ad5f06b)
While processing a router update in _process_router_update method,
if an exception occurs, we try to do a full_sync.
We only need to re-sync the router whose update failed.
Addressed a TODO in the same method, which falls in similar lines.
Change-Id: I7c43a508adf46d8524f1cc48b83f1e1c276a2de0
Closes-Bug: #1494682
(cherry picked from commit 4957b5b435)
The ip netns list command adds additional id data in more recent
versions of iproute2 of the format:
qdhcp-35fc068a-750d-4add-b1d2-af392dbd8790 (id: 1)
Update parsing to deal with old and new formats.
Change-Id: I0d3fc4262284172f5ad31e4f2f78ae1fb33b4228
Closes-Bug: 1497309
(cherry picked from commit 3aefdf4de7)
This patch changes our iptables logic to generate a delta of
iptables commands (inserts + deletes) to get from the current
iptables state to the new state. This will significantly reduce
the amount of data that we have to shell out to iptables-restore
on every call (and reduce the amount of data iptables-restore has
to parse).
We no longer have to worry about preserving counters since
we are adding and deleting specific rules, so the rule modification
code got a nice cleanup to get rid of the old rule matching.
This also gives us a new method of functionally testing that we are
generating rules in the correct manner. After applying new rules
once, a subsequent call should always have no work to do. The new
functional tests added leverage that property heavily and should
protect us from regressions in how rules are formed.
Performance metrics relative to HEAD~1:
+====================================+============+=======+
| Scenario | This patch | HEAD~1|
|------------------------------------|------------|-------|
| 200 VMs*22 rules existing - startup| | |
| _modify_rules| 0.67s | 1.05s |
| _apply_synchronized| 1.87s | 2.89s |
|------------------------------------|------------|-------|
| 200 VMs*22 rules existing - add VM | | |
| _modify_rules| 0.68s | 1.05s |
| _apply_synchronized| 2.07s | 2.92s |
|------------------------------------+------------+-------+
|200 VMs*422 rules existing - startup| | |
| _modify_rules| 5.43s | 8.17s |
| _apply_synchronized| 12.77s |28.00s |
|------------------------------------|------------|-------|
|200 VMs*422 rules existing - add VM | | |
| _modify_rules| 6.41s | 8.33s |
| _apply_synchronized| 33.09s |33.80s |
+------------------------------------+------------+-------+
The _apply_synchronized times seem to converge when dealing
with ~85k rules. In the profile I can see that both approaches
seem to wait on iptables-restore for approximately the same
amount of time so it could be hitting the performance limits
of iptables-restore.
DocImpact
Partial-Bug: #1502297
Change-Id: Ia6470c85b6b71979006ffe5da9095fdcce3122c1
(cherry picked from commit f066e46bb7)