Bug #1244589 re-appeared for IPv6.
This change adds an ip6tables rule to fix the checksum of DHCPv6
response packets. Those checksums were left unfilled by virtio (as a
hypervisor internal optimization), but some picky dhcp clients (AFAIU
particularly ISC dhclient) try verifying the checksums, so they fail
to acquire an address if the checksums are left incorrect.
(cherry picked from commit 26eb2509fe)
If DHCP agent port cache is out of sync with neutron server, dnsmasq
entries are wrong and VMs may not acquire an IP because of duplicate
When DHCP agent executes port_create_end method, port's
IP should be checked before being used, if there are duplicate IP
addresses in the same network in the cache we should resync.
The unit test
modifies the global variables fake_port1, fake_port2, creating flakiness
on unit tests that use those variables when execured in environments
with high concurrency.
Creating a deepcopy of the variable avoid that those changes can be
propagated to other unit tests.
Signed-off-by: aojeagarcia <email@example.com>
(cherry picked from commit e83e5618b7)
When a subnet's enable_dhcp attribute is updated, we must restart
dhcp device. So,when we decide whether 'restart' or
'reload_allocations' in refresh_dhcp_helper function we only compare
the cidr of subnets which enabled dhcp.
The previous logic only calls 'restart' when deleting or adding a
subnet. This may cause the dhcp port not updated when the subnet's
enable_dhcp is updated to True.
(cherry picked from commit 9aa7af8221)
The port delete events are not synchronized with network rpc events. This
creates a condition which makes it possible for a port delete event to be
processed just before a previously started network query completes.
The problematic order of operations is as follows:
1) a network is scheduled to an agent; a network rpc is sent to the
2) the agent queries the network data from the server
3) while that query is in progress a port on that network is deleted; a
port rpc is sent to the agent
4) that port delete rpc is received before the network query rpc
5) the port delete results in no action because the port was not present
on the agent
6) the network query finishes and adds the port to the cache (even
though the port has already been deleted)
7) some time passes and a new port is configured with the same IP
address as the port that was deleted in (3)
8) the dhcp host file is corrupted with 2 entries for the same IP
9) dhcp queries for the newest port is rejected because of the duplicate
entry in the dhcp host file.
The solution is to add the network_id to the port_delete_end rpc event
so that the _net_lock(network_id) synchronization point can be acquired
so that it is processed serially with other network related events.
To ensure backwards compatibility with newer agents running against older
servers the determination of which network_id value to use in the lock is
handled using a utility that will fallback to the previous mode of operation
whenever the network_id attribute is not present in the *_delete_end RPC
events. That utility can be removed in the future when it is guaranteed
that the network_id attribute will be present in RPC messages from the
Signed-off-by: Allain Legacy <firstname.lastname@example.org>
(cherry picked from commit fa78b58010)
When a network becomes isolated and isolated_metadata_enabled=True, the DHCP
agent won't spawn the required metadata proxy instance unless the agent gets
restarted. Similarly, it won't stop them when the network is no longer
This patch fixes it by updating the isolated metadata proxy on port_update_end
and port_delete_end methods which are invoked every time a router interface
port is added, updated or deleted.
Signed-off-by: Daniel Alvarez <email@example.com>
(cherry picked from 9362d4f1f2)
(cherry picked from commit b07aa19deb)
As reported in the bug, there may be an case where an empty
namespace file in /run/netns, but the namespace not
actually exist. In such case the DHCP agent throws an error
when pluging the interface in the dhcp namespace.
This may also result in many tap interfaces
getting generated in OVS bridge or Linux bridge.
This patch fixes the above bug by unpluging the tap device
in the bridge if exception occurs, this can prevents the tap
Co-Authored-By: Brian Haley <firstname.lastname@example.org>
(cherry picked from commit 38d058c2cf)
Provisioning blocks merged in Newton so for Pike we can
safely assume we are not running with Liberty agents that
don't notify the server when the port is ready.
This also drops a block of logic in the agent that was providing
forward compatibility with servers that didn't support the
'dhcp_ready_on_ports' endpoint since servers have been supporting
it for so long and we don't normally allow agents to be upgraded
The DHCP namespace used to always have its IPv6 default
route configured from a received Router Advertisement (RA).
A recent change  disabled receipt of RAs, instead
relying on the network topology to configure the namespace.
Unfortunately the code only added an IPv4 default route,
which caused a regression with DNS resolution in some
circumstances where IPv6 was being used.
A default route is now added for both IP versions.
Without this commit, the run_as_root parameter is always True when
stopping a process, which leads to the usage of unnecessary sudo such as
in some functional tests, like the keepalived ones.
This commit fixes the aforemetioned problem by taking run_as_root into
account when stopping a process. However, run_as_root will still always
be True if the process is spawned in a netns.
Signed-off-by: Hunt Xu <email@example.com>
Refactoring Neutron configuration options for agent common config to be
in neutron/conf/agent/common. This will allow centralization of all
configuration options and provide an easy way to import.
Due to the high memory footprint of current Python ns-metadata-proxy,
it has to be replaced with a lighter process to avoid OOM conditions in
This patch spawns haproxy through a process monitor using a pidfile.
This allows tracking the process and respawn it if necessary as it was
done before. Also, it implements an upgrade path which consists of
detecting any running Python instance of ns-metadata-proxy and
replacing them by haproxy. Therefore, upgrades will take place by
simply restarting neutron-l3-agent and neutron-dhcp-agent.
According to /proc/<pid>/smaps, memory footprint goes down from ~50MB
Also, haproxy is added to bindep in order to ensure that it's installed.
When force_metadata=True and enable_isolated_metadata=False,
the namespace metadata proxy process might not be terminated
when the network is deleted because the subnets and ports
will have already been deleted, so we could incorrectly
determine it was started. Calling destroy_monitored_metadata_proxy() is
a noop when there is no process running.
Looking at the cache before aqcuiring a lock may cause the
agent to mistakenly think the network doesn't exist when it
is actually being wired in parallel.
Always acquiring the network-based semaphore will ensure that
the network isn't currently being setup in another coroutine.
During DhcpAgent startup procedure all the following networks
initialization is actually perform twice:
* Killing old dnsmasq processes
* set and configure all TAP interfaces
* building all Dnsmasq config files (lease and host files)
* launching dnsmasq processes
What is done during the second iteration is just clean and redo
exactly the same another time! This is really inefficient and
increase dramatically DHCP startup time (near twice than needed).
Initialization process 'sync_state' method is called twice:
* one time during init_host()
* another time during _report_state()
sync_state() call must stay in init_host() due to bug #1420042.
sync_state() is always called during startup in init_host()
and will be periodically called by periodic_resync()
to do reconciliation.
Hence it can safely be removed from the run() method.
When starting the dhcp-agent after an upgrade, there could
be stale IPv6 addresses in the namespace that had been
configured via SLAAC. These need to be removed, and the
same address added back statically, in order for the
agent to start up correctly.
To avoid the race condition where an IPv6 RA could arrive
while we are making this change, we must move the call
to disable RAs in the namespace from plug(), since devices
may already exist that are receiving packets.
Uncovered by the grenade tests.
When enabling metadata, we iterate through the subnets
on a network multiple times. Do it only once at the
beginning and return early if there are no candidates.
Follow-on to comments in an earlier review,
Had to fix a few tests that were creating "fake" subnets
without an ip_version attribute or passing a network
mock instead of a fake one.
All cache operations and dnsmasq process operations
are scoped to a network ID so we can always safely
perform concurrent actions on different network IDs.
This patch adjusts the DHCP agent to lock based on
network ID rather than having a global lock for every
sync_state calls are still protected with a reader/writer
lock to ensure that when sync_state needs to run, all
other operations are blocked.
Currently the DHCP agent relies on the acceptance of an
RA to configure its IPv6 address with SLAAC or DHCPv6-stateless
network modes. It should explicitly assign addresses to the
agent based on the data model instead.
In order to do this we must disable RAs in the namespace so
that a static assignment doesn't conflict with a previously
created dynamically-generated address.
'refresh_dhcp_helper', which is called after subnet update/create
notifications in the DHCP agent, can end up retrieving ports that
the agent hadn't yet seen. It will then configure those ports but
not notify the server that they are ready.
Unless the port is subsequently updated on the server afterwards to
generate a new port update notification, the DHCP agent won't ever tell
the server that the port has had DHCP provisioned. This led to the
bug this closes. Another patch that removed excessive DHCP ready
notifications uncovered this bug.
This patch just adjusts refresh_dhcp_helper to ensure that all ports
are marked as ready after configuring them all.
The DHCP agent was previously resending every single port to
the server whenever sync_state was called, even if it was just
for one network.
This let to sending way too much unnecessary data to the server
and also potentially resulted in sending a port to the server
that wasn't actually provisioned yet.
This patch corrects the behavior by only sending ports for networks
that are being synced if it's a conditional sync.
With current code, if first subnet of the network is an ipv6 subnet,
the metadata proxy will not be spawned. If user then adds ipv4 subnet
with dhcp enabled, the metadata proxy will still not be spawned. As a
result, the metadata service will not be available for the network.
This patch will kill/spawn metadata proxy, when subnet add/delete.
So, even if the first subnet of the network is not an ipv4 subnet with
dhcp enabled, the metadata proxy can still be spawned if network has
subnets need metadata proxy.
There is a race condition server-side where a port request containing
a subnet_id is processed at the same time the subnet is being deleted,
the port operation may be successful without having a fixed IP on the
requested subnet. This patch makes the DHCP agent resillient to this
bug by checking the port response and raising a SubnetMismatchForPort
to trigger a resync if it doesn't have all of the requested subnet IDs.
Additionally, it avoids skipping assignment of IPv6 addresses to the
interface if they are stateless. The original logic to skip assignment
was only meant to be for SLAAC addresses.
Both of these issues were resulting in the KeyError observed in the
Change I445974b0e0dabb762807c6f318b1b44f51b3fe15 updated the
'revision' field to 'revision_number' but it missed the DHCP
agent and subsequently broke it's ability to detect stale updates.
This fixes the name in the agent.
This is marked as a partial for 1622616 because one of the reasons
the agent was frequently updating the DHCP port was in reaction
to stale port update messages for its own port.
Now that the agent will receive port update events for
all port changes, we need to avoid immediately restarting
when the subnets on the agent's port changes. Otherwise
the restart may request ports on a subnet which is in the
process of being deleted. While the server is equipped to
handle this, it makes subnet deletion much more contentious
than it needs to be.
This alters the logic to schedule a resync for later if the
agent's port has had its subnets changed rather than restarting
right away. Then by the time the agent eventually syncs the
server should have finished deleting the subnet. Even if it hasn't,
it spaces out the request from the agent for the network far enough
that the operation will be much less frequent to avoid racing
with the server.
If the DHCP port setup process fails in the DHCP agent device
manager, it will throw a conflict exception, which will bubble
all of the way up to the main DHCP agent. The issue is that, during
a 'restart' call, the config files are wiped out while maintaining
the VIF before calling setup. This means that, if setup fails, there
is no reference to the interface name anymore so a subsequent destroy
will not first unplug the VIF before destroying the namespace.
This leaves a bunch of orphaned tap ports behind in the OVS case
that don't have an accessible namespace.
This patch addresses the issue by cleaning up all ports inside of
a namespace on a 'setup' failure before reraising the exception.
This ensures that the namespace is clear if destroy is called in the
future without another successful setup.
The previous logic was just ripping the interface out without
stopping dnsmasq. This would lead to a file handle remaining to the
interface which would cause OVS to completely freak out and assign
the same ofport to multiple ports.
This preserves the behavior introduced in
I40b85033d075562c43ce4d0e68296211b3241197 but just fully disables
DHCP rather than relying on an exception generation to cause the
Capture port not found exceptions from port updates of DHCP ports
that no longer exist. The DHCP agent already checks the return
value for None in case any of the other things went missing
(e.g. Subnet, Network), so checking for ports disappearing makes
sense. The corresponding agent-side log message for this has also
been downgraded to debug since this is a normal occurrence.
This also cleans up log noise from calling reload_allocations on
networks that have already been torn down due to all of the subnets
The DHCP agent was using the same context for every RPC
request so it made it difficult to tell server side where one
RPC request began and where another one ended.
This patch has it generate a new context for each RPC request
so they can be tracked independently. In the long term it would
be better if the agent kept the context for server-initiated events
so actions could be tracked end-to-end under the same request-ID.
was given execute permission by mistake
in Change-Id: I57d7c242b2f2b63d71f7830fe355dbf857ffad58.
This proposal want to remove the error permission.
Refactoring neutron configuration options for dhcp agent to be in
neutron/conf/agent. This would allow centralization of all configuration
options and provide an easy way to import.
When subnet is created and network is scheduled to dhcp agent, the
dhcp agent will request neutron server to create dhcp port.
Neutron server will create and mark port as BUILD and wait for the
ready signal from dhcp agent.
dhcp agent will create 'real' dhcp port after getting response from
neutron server. But after that, dhcp agent will not tell neutron server
that the dhcp port is ready. So, the reported bug can be observed.
If ports are created before dhcp is enabled for network, dhcp agent will
not mark ports as 'ready' as there is no network cache. This patch also
marks all ports in network as ready, in case that happens.