Currently, the dhcp Provisioning of ports is the crucial bottleneck
of that concurrently boot multiple VM.
The root cause is that these ports will be processed one by one by dhcp
agent when they belong to the same network, And the 'Provisioning complete'
port is still blocked other port's processing in other dhcp agents. The
patch aim to optimize the dispatch strategy of the port cast to agent to
improve the Provisioning process.
In server side, I classify messages to multi levels. Especially, I classify
the port_update_end or port_create_end message to two levels, the high-level
message only cast to one agent, the low-level message cast to all agent. In
agent side I put these messages to `resource_processing_queue`, with the queue,
We can delete `_net_lock` and process these messages in order of priority.
Additonally, I modified the `resource_processing_queue` for my demand. I update
`_queue` from LIST to PriorityQueue in `ExclusiveResourceProcessor`, by this
way, we can sort all message which cached in `ExclusiveResourceProcessor` by
priority.
Related-Bug: #1760047
Change-Id: I255caa0571c42fb012fe882259ef181070beccef
If DHCP agent port cache is out of sync with neutron server, dnsmasq
entries are wrong and VMs may not acquire an IP because of duplicate
entries.
When DHCP agent executes port_create_end method, port's
IP should be checked before being used, if there are duplicate IP
addresses in the same network in the cache we should resync.
Co-Authored-By: doreilly@suse.com
Closes-Bug: #1645835
Change-Id: Icc555050283420fddfb90bb67e02bc303e989e27
When a subnet's enable_dhcp attribute is updated, we must restart
dhcp device. So,when we decide whether 'restart' or
'reload_allocations' in refresh_dhcp_helper function we only compare
the cidr of subnets which enabled dhcp.
The previous logic only calls 'restart' when deleting or adding a
subnet. This may cause the dhcp port not updated when the subnet's
enable_dhcp is updated to True.
Change-Id: Ic547946ac786c5fab82b4ee7078bf86483f51eb5
Closes-Bug: #1805824
The DHCP agent will resync its state with Neutron to recover from any
transient notification or RPC errors. Currently, the periodic resync
task waits on a timer to determine whether a re-sync is necessary. The
interval between attempts by default is 5 seconds and can be longer
thru config. This may cause a potentially long delay before an agent
gets new work via an agent_updated RPC call.
The idea of this RFE is to change the timer based periodic resync task
into an event driven one. It also proposes a new DHCP agent config
option "resync_throttle" to ensure the minimum interval taken between
resync state events to avoid too frequent resyncing. In this way, we
could force the agent to act on the resync request immediately therefore
decreasing how much time is needed before DHCP services are available.
Co-authored-by: Allain Legacy <Allain.legacy@windriver.com>
Closes-Bug: #1780370
Change-Id: Ie9d758ba5f750a38dc19ea5ce8b2c6b414f9ef80
According to [1], when a network contains more that one IPv4
subnet, they are returned in the 'classless-static-routes'
DHCP option, regardless of whether DHCP is enabled for them
or not.
However, the get_active_networks_info() method used for
synchronizing networks after the dhcp agent restarts filters
subnets with "enable_dhcp=True", which differs from the
get_network_info() method. This will block VM access to
other VMs in the dhcp disabled subnets, even though they are
in the same network. This is visible by looking at the "opts"
file before and after a restart.
Change the dhcp agent to ask for all subnets in its
get_active_networks_info() RPC call by adding an
enable_dhcp_filter argument to toggle the behavior, with the
default being True to not break backwards compatibility.
Based on https://review.openstack.org/#/c/352530/ by Quan Tian.
[1] https://review.openstack.org/#/c/125043/
Change-Id: I11ca1d1a603d02587f3b8d4a5a52a96b0587d61f
Closes-Bug: #1652654
The port delete events are not synchronized with network rpc events. This
creates a condition which makes it possible for a port delete event to be
processed just before a previously started network query completes.
The problematic order of operations is as follows:
1) a network is scheduled to an agent; a network rpc is sent to the
agent
2) the agent queries the network data from the server
3) while that query is in progress a port on that network is deleted; a
port rpc is sent to the agent
4) that port delete rpc is received before the network query rpc
completes
5) the port delete results in no action because the port was not present
on the agent
6) the network query finishes and adds the port to the cache (even
though the port has already been deleted)
7) some time passes and a new port is configured with the same IP
address as the port that was deleted in (3)
8) the dhcp host file is corrupted with 2 entries for the same IP
address.
9) dhcp queries for the newest port is rejected because of the duplicate
entry in the dhcp host file.
The solution is to add the network_id to the port_delete_end rpc event
so that the _net_lock(network_id) synchronization point can be acquired
so that it is processed serially with other network related events.
To ensure backwards compatibility with newer agents running against older
servers the determination of which network_id value to use in the lock is
handled using a utility that will fallback to the previous mode of operation
whenever the network_id attribute is not present in the *_delete_end RPC
events. That utility can be removed in the future when it is guaranteed
that the network_id attribute will be present in RPC messages from the
server.
Closes-Bug: #1732456
Change-Id: I735f8b1c9248b12e5feb6cbe970cf67f321e6ebc
Signed-off-by: Allain Legacy <allain.legacy@windriver.com>
Fix W503 (line break before binary operator) pep8 warnings
and no longer ignore new failures.
Trivialfix
Change-Id: I7539f3b7187f2ad40681781f74b6e05a01bac474
The neutron.common.topics module was rehomed into neutron-lib with
commit Ie88b84949cbd55a4e7ad06341aab77b286cdc485
This patch consumes it by removing the rehomed module from neutron
and using the module from neutron-lib instead.
NeutronLibImpact
Change-Id: Ia4a4604c259ce862597de80c6deeb3d408bf0e95
When a network becomes isolated and isolated_metadata_enabled=True, the DHCP
agent won't spawn the required metadata proxy instance unless the agent gets
restarted. Similarly, it won't stop them when the network is no longer
isolated.
This patch fixes it by updating the isolated metadata proxy on port_update_end
and port_delete_end methods which are invoked every time a router interface
port is added, updated or deleted.
Change-Id: I5c197a5755135357c6465dfe4803019a2ad52c14
Closes-Bug: #1753540
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
The neutron-lib commit I360545b6ee4291547e0c5c8e668ad03d3efa4725 moved
the externally consumed globals from neutron.common.constants into lib.
With the exception of PROVISIONAL_IPV6_PD_PREFIX all other constants
in neutron.common.constants should only be used in neutron, and will
hopefully remain that way. External consumers needing access to other
common constants should move them into lib first.
NeutronLibImpact
Change-Id: Ie4bcffccf626a6e1de84af01f3487feb825f8b65
Changing rpc_api.rst file path from doc/source/devref/rpc_api.rst
to /doc/source/contributor/internals/rpc_api.rst. Because rpc_api.rst
file is located at this path
doc/source/contributor/internals/rpc_api.rst.
Closes-Bug #1722072
Change-Id: Ic243aab9e3428bfec69db61a94b4129cd768e233
neutron-lib contains the synchronized lockutils decorator as well as
the SYNCHRONIZED_PREFIX global. This patch consumes them from
neutron-lib and removes them from neutron.
NeutronLibImpact
Change-Id: I729da348e340509f2d09f8a6436716e2398f1583
When user create network with isolated subnet, dhcp agent will
create md-proxy with vrouter id. This will conflict with then md-proxy
created by l3 agent. This patch updated dhcp agent start md-proxy with
vrouter id only when the network has metadata subnet.
Change-Id: I3288327bf9d0cdf759a6fdf365d1289e8b7442db
Closes-Bug: #1703059
Since Pike log messages should not be translated.
This patch removes calls to i18n _LC, _LI, _LE, _LW from
logging logic throughout the code. Translators definition
from neutron._i18n is removed as well.
This patch also removes log translation verification from
ignore directive in tox.ini.
Change-Id: If9aa76fcf121c0e61a7c08088006c5873faee56e
In order to allow the DHCP agent to service other subnets on the
network in other segments via DHCP relay, we need to use the
'non_local_subnets' network attribute returned by rpc to set up dhcp
for off-link subnets.
Change-Id: I88e1c574bc429dc599ad7c956c03fa0688338186
Closes-Bug: 1692486
Provisioning blocks merged in Newton so for Pike we can
safely assume we are not running with Liberty agents that
don't notify the server when the port is ready.
This also drops a block of logic in the agent that was providing
forward compatibility with servers that didn't support the
'dhcp_ready_on_ports' endpoint since servers have been supporting
it for so long and we don't normally allow agents to be upgraded
first anyway.
Related-Bug: #1453350
Change-Id: Ia86547fb4601915d7dd852b6f7a11c120089d6f6
Without this commit, the run_as_root parameter is always True when
stopping a process, which leads to the usage of unnecessary sudo such as
in some functional tests, like the keepalived ones.
This commit fixes the aforemetioned problem by taking run_as_root into
account when stopping a process. However, run_as_root will still always
be True if the process is spawned in a netns.
Closes-Bug: #1491581
Change-Id: Ib40e1e3357b9a38e760f4e552bf615cdfd54ee5a
Signed-off-by: Hunt Xu <mhuntxu@gmail.com>
When force_metadata=True and enable_isolated_metadata=False,
the namespace metadata proxy process might not be terminated
when the network is deleted because the subnets and ports
will have already been deleted, so we could incorrectly
determine it was started. Calling destroy_monitored_metadata_proxy() is
a noop when there is no process running.
Change-Id: I77ff545ce02f2dca4c38e587b37ea809ad6f072c
Closes-Bug: #1648095
Looking at the cache before aqcuiring a lock may cause the
agent to mistakenly think the network doesn't exist when it
is actually being wired in parallel.
Always acquiring the network-based semaphore will ensure that
the network isn't currently being setup in another coroutine.
Closes-Bug: #1659919
Change-Id: I99ae71e3c5b1cd91dca3f6c80b04d2ecb79de64f
During DhcpAgent startup procedure all the following networks
initialization is actually perform twice:
* Killing old dnsmasq processes
* set and configure all TAP interfaces
* building all Dnsmasq config files (lease and host files)
* launching dnsmasq processes
What is done during the second iteration is just clean and redo
exactly the same another time! This is really inefficient and
increase dramatically DHCP startup time (near twice than needed).
Initialization process 'sync_state' method is called twice:
* one time during init_host()
* another time during _report_state()
sync_state() call must stay in init_host() due to bug #1420042.
sync_state() is always called during startup in init_host()
and will be periodically called by periodic_resync()
to do reconciliation.
Hence it can safely be removed from the run() method.
Change-Id: Id6433598d5c833d2e86be605089d42feee57c257
Closes-bug: #1651368
Closes-Bug: #1650611
All cache operations and dnsmasq process operations
are scoped to a network ID so we can always safely
perform concurrent actions on different network IDs.
This patch adjusts the DHCP agent to lock based on
network ID rather than having a global lock for every
operation.
sync_state calls are still protected with a reader/writer
lock to ensure that when sync_state needs to run, all
other operations are blocked.
Related-Bug: #1548190
Change-Id: I56010dc801d82be56f12e834c5164316872c2f8b
'refresh_dhcp_helper', which is called after subnet update/create
notifications in the DHCP agent, can end up retrieving ports that
the agent hadn't yet seen. It will then configure those ports but
not notify the server that they are ready.
Unless the port is subsequently updated on the server afterwards to
generate a new port update notification, the DHCP agent won't ever tell
the server that the port has had DHCP provisioned. This led to the
bug this closes. Another patch[1] that removed excessive DHCP ready
notifications uncovered this bug.
This patch just adjusts refresh_dhcp_helper to ensure that all ports
are marked as ready after configuring them all.
1. Ie7686837b18ff251baa315ef95dc511cda475672
Change-Id: I1fed60c1835c2ebed7c050c6fa114f89beec3190
Closes-Bug: #1639806
The DHCP agent was previously resending every single port to
the server whenever sync_state was called, even if it was just
for one network.
This let to sending way too much unnecessary data to the server
and also potentially resulted in sending a port to the server
that wasn't actually provisioned yet.
This patch corrects the behavior by only sending ports for networks
that are being synced if it's a conditional sync.
Closes-Bug: #1639086
Change-Id: Ie7686837b18ff251baa315ef95dc511cda475672
With current code, if first subnet of the network is an ipv6 subnet,
the metadata proxy will not be spawned. If user then adds ipv4 subnet
with dhcp enabled, the metadata proxy will still not be spawned. As a
result, the metadata service will not be available for the network.
This patch will kill/spawn metadata proxy, when subnet add/delete.
So, even if the first subnet of the network is not an ipv4 subnet with
dhcp enabled, the metadata proxy can still be spawned if network has
subnets need metadata proxy.
Closes-bug: #1556991
Change-Id: I0b45af8f2b756732f45c13d7e2dbcd30653cc026
There is a race condition server-side where a port request containing
a subnet_id is processed at the same time the subnet is being deleted,
the port operation may be successful without having a fixed IP on the
requested subnet. This patch makes the DHCP agent resillient to this
bug by checking the port response and raising a SubnetMismatchForPort
to trigger a resync if it doesn't have all of the requested subnet IDs.
Additionally, it avoids skipping assignment of IPv6 addresses to the
interface if they are stateless. The original logic to skip assignment
was only meant to be for SLAAC addresses.
Both of these issues were resulting in the KeyError observed in the
bug report.
Related-Bug: #1627480
Closes-Bug: #1624079
Change-Id: I85ef1f4d60efd0309d6a0706e29fdbcc16f0b59d
Change I445974b0e0dabb762807c6f318b1b44f51b3fe15 updated the
'revision' field to 'revision_number' but it missed the DHCP
agent and subsequently broke it's ability to detect stale updates.
This fixes the name in the agent.
This is marked as a partial for 1622616 because one of the reasons
the agent was frequently updating the DHCP port was in reaction
to stale port update messages for its own port.
Partial-Bug: #1622616
Closes-Bug: #1625867
Change-Id: Id41000127e1084f7ff243f8dc9c399999fbdaab4
Now that the agent will receive port update events for
all port changes[1], we need to avoid immediately restarting
when the subnets on the agent's port changes. Otherwise
the restart may request ports on a subnet which is in the
process of being deleted. While the server is equipped to
handle this, it makes subnet deletion much more contentious
than it needs to be.
This alters the logic to schedule a resync for later if the
agent's port has had its subnets changed rather than restarting
right away. Then by the time the agent eventually syncs the
server should have finished deleting the subnet. Even if it hasn't,
it spaces out the request from the agent for the network far enough
that the operation will be much less frequent to avoid racing
with the server.
1. I607635601caff0322fd0c80c9023f5c4f663ca25
Partial-Bug: #1622616
Change-Id: I98761a7e3f4bce8d5485c885f03c6bfdde246802
The previous logic was just ripping the interface out without
stopping dnsmasq. This would lead to a file handle remaining to the
interface which would cause OVS to completely freak out and assign
the same ofport to multiple ports.
This preserves the behavior introduced in
I40b85033d075562c43ce4d0e68296211b3241197 but just fully disables
DHCP rather than relying on an exception generation to cause the
resync.
Closes-bug: #1624701
Change-Id: Icdd9ac136eeb3707c912853b134dbb58109e6940
Capture port not found exceptions from port updates of DHCP ports
that no longer exist. The DHCP agent already checks the return
value for None in case any of the other things went missing
(e.g. Subnet, Network), so checking for ports disappearing makes
sense. The corresponding agent-side log message for this has also
been downgraded to debug since this is a normal occurrence.
This also cleans up log noise from calling reload_allocations on
networks that have already been torn down due to all of the subnets
being removed.
Closes-Bug: #1621650
Change-Id: I495401d225c664b8f1cf7b3d51747f3b47c24fc0
The DHCP agent was using the same context for every RPC
request so it made it difficult to tell server side where one
RPC request began and where another one ended.
This patch has it generate a new context for each RPC request
so they can be tracked independently. In the long term it would
be better if the agent kept the context for server-initiated events
so actions could be tracked end-to-end under the same request-ID.
Change-Id: I1d6dc28ba4752d3f9f1020851af2960859aae520
Closes-Bug: #1618231
This utilizes the new revision number to discard stale
port updates on the DHCP agent.
Change-Id: I8c904b63515692039e7e2bf6cf8c48f3575c5bc1
Partially-Implements: bp/push-notifications
When subnet is created and network is scheduled to dhcp agent, the
dhcp agent will request neutron server to create dhcp port.
Neutron server will create and mark port as BUILD and wait for the
ready signal from dhcp agent.
dhcp agent will create 'real' dhcp port after getting response from
neutron server. But after that, dhcp agent will not tell neutron server
that the dhcp port is ready. So, the reported bug can be observed.
If ports are created before dhcp is enabled for network, dhcp agent will
not mark ports as 'ready' as there is no network cache. This patch also
marks all ports in network as ready, in case that happens.
Change-Id: I363d8727f7ef6e6e08be4b0022c6464d51692b85
Closes-bug: #1588906
When a new subnet is added to a network, the network cache
is updated with the list of subnets regardless of which ones
have DHCP enabled. This changes the index order of the subnet
list which means that the tags used for each subnet change.
This means we must restart the process because the opts file
will be using different tags than the process args. This patch
implements that change. It also sorts the subnets on the RPC
side so the agent indexes don't change if subnets aren't
added/deleted.
The previous logic was only restarting the process when DHCP
enabled subnets changed, which meant that adding a DHCP disabled
subnet would break the association between the opts file tags and
the process arg tags, which led to the reported bug.
Closes-Bug: #1581918
Change-Id: If1452c0e8fe95eb94cd78c7a05b57aead75662b5
Sometimes an object requires multiple disjoint actors to complete
a set of tasks before the status of the object should be transitioned
to ACTIVE. The main example of this is when a port is being created.
The L2 agent has to do its business to wire up the VIF, but at the same
time the DHCP agent has to setup the DHCP reservation. This led to
Nova booting the VM when the L2 agent was done even though the DHCP
agent may have been nowhere near ready.
This patch introduces a provisioning blocks mechansim that allows the
entities to be tracked that need to be involved to make a transition
to ACTIVE happen. See the devref in the dependent patch for a high-level
view of how this works.
The ML2 code is updated to use this new mechanism to prevent updating
the port status to ACTIVE without both the DHCP agent and L2 agent
reporting that the port is ready.
The DHCP RPC API required a version bump to allow the port ready
notification.
This also adds a devref doc for the provisioning_blocks
module with a high-level overview of how it works in addition
to a detailed description of how it is used specifically with
ML2, the L2 agents, and the DHCP agents.
Closes-Bug: #1453350
Change-Id: Id85ff6de1a14a550ab50baf4f79d3130af3680c8
These will happen all of the time as networks are quickly
created/updated and then deleted. It's not any kind of
actionable warning condition so this patch downgrades them
to debug.
Change-Id: Idcfb185b9a0540c13101dceb3681132f38f1716c
Closes-Bug: #1555842