9550e90b8e
Change-Id: I061c0cfc68965607ea3eefbc8751965db40e087f
175 lines
5.0 KiB
ReStructuredText
175 lines
5.0 KiB
ReStructuredText
======================================
|
|
Multiple L3 and DHCP agents in Neutron
|
|
======================================
|
|
|
|
https://blueprints.launchpad.net/fuel/+spec/fuel-multiple-l3-agents
|
|
|
|
In FUEL 5.1 and before HA network solution was based on one neutron-l3-agent
|
|
and one DHCP agent, which were switchable between controllers.
|
|
|
|
This blueprint describes a way of using multiple L3 and DHCP agents instead of
|
|
single. It is required for network scalability and neutron performance
|
|
improvements.
|
|
|
|
Problem description
|
|
===================
|
|
|
|
When we created virtual router in Neutron, it was scheduled to the L3-agent
|
|
(to one of alive if we had multiple agents using random selection).
|
|
Before Juno Neutron server didn't monitor life cycle of agent serving
|
|
this router. If the L3-agent service stopped on a node containing this agent or
|
|
connectivity was lost, Neutron server didn't reschedule this router to
|
|
another L3-agent. So there was no HA network solution.
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
In Juno multiple solutions for this problem were introduced.
|
|
The easiest solution is to use the internal Neutron routers rescheduling
|
|
mechanism. In that case Neutron server automatically monitors L3 agents
|
|
lifecycle. If agent is marked as dead, all routers associated to the dead agent
|
|
will be safely moved by Neutron server to an alive agent on another node and
|
|
auxiliary resources created by the dead agent, such as additional interfaces
|
|
and iptables rules, will be removed.
|
|
There are some cases when auxiliary resources will be kept on the dead node and
|
|
potentially affect connection to instances. For example, when L3 agent is alive
|
|
but lost connection to a message broker. To avoid such problems additional
|
|
monitoring and clean up mechanism should be added. It must be easily usable
|
|
by Pacemaker. Current rescheduling script must be modified to match the
|
|
proposed changes.
|
|
|
|
This feature allows to have permanent and stable connection to instances
|
|
even in case of failure of one or more L3 agents. Also it allows to
|
|
effectively distribute routers between all available agents to improve
|
|
network performance.
|
|
|
|
For DHCP agent multiple-agent mode implemented as experimental feature
|
|
and disabled by default.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
In the Juno release DVR L3 agent is introduced. It looks like alternative
|
|
router solution. This solution serves only VMs with floating IP addresses and
|
|
doesn't change behavior for VMs without FIP.
|
|
Also this solution doesn't change behavior of DHCP agents.
|
|
|
|
There's another solution based on VRRP.
|
|
The problem is that this solution doesn't cover situation where both vrrp nodes
|
|
are down. This solution also needs external re-scheduling mechanism.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
None
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Upgrade impact
|
|
--------------
|
|
|
|
None
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
* Time delays when neutron agents go down will decrease.
|
|
* Network scalability will grow.
|
|
* Load on a separate controller will be decreased.
|
|
* Customers will get a possibility to add any number of nodes with started
|
|
neutron agents and network scalability will grow.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
None
|
|
|
|
Implementation
|
|
==============
|
|
|
|
* In astute.yaml we have following options:
|
|
|
|
* quantum_settings/L3/multiple_dhcp_agents (default=false)
|
|
* quantum_settings/L3/dhcp_agents_per_network (default=3)
|
|
* quantum_settings/L3/multiple_l3_agents (default=true)
|
|
|
|
* OCF scripts for L3 and DHCP agents got "multiple_agents" option, that allows
|
|
run agents in non-singletone mode
|
|
* cluster::neutron::l3 and cluster::neutron::dhcp classes got "multiple_agents"
|
|
option, that allows configure agents for running in multiple-agent mode
|
|
* cluster::neutron::dhcp got "agents_per_net" option (by default = 3), that
|
|
describe amount of dhcp-agents for serve each network. This default
|
|
justifyed by performance reasons.
|
|
|
|
Backward compatibility
|
|
----------------------
|
|
|
|
Using "multiple_agents" option for OCF scripts we can manipulate behavior
|
|
of L3 and DHCP agents. Moreover, for using old-style behavior of L3/DHCP
|
|
agents we should decrease clone size for corresponded Pacemaker
|
|
resources to "1".
|
|
|
|
Work Items
|
|
-------------
|
|
|
|
- Update Puppet manifests to enable multiple L3 agents
|
|
- Add necessary patches to Neutron for additional agents monitoring
|
|
- Edit the rescheduling script and Pacemaker OCF script
|
|
to support multiple agents behavior
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Sergey Vasilenko
|
|
Eugene Nikanorov
|
|
Oleg Bondarev
|
|
Sergey Kolekonov
|
|
|
|
Dependencies
|
|
============
|
|
|
|
None
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
New Neutron-server behavior in case of dead L3 agents should be reflected in
|
|
documentation to correctly debug possible problems.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
None
|
|
|
|
Testing
|
|
=======
|
|
|
|
- Deploy HA cluster
|
|
- All instances must be constantly available via floating ips and have Internet
|
|
access even in case of whole controller failure or particular cases such as
|
|
message broker failures
|