Blueprint: fuel-multiple-l3-agents
Change-Id: I061c0cfc68965607ea3eefbc8751965db40e087f
This commit is contained in:
parent
91fff69f1c
commit
9550e90b8e
174
specs/6.0/neutron-multiple-l3-agents.rst
Normal file
174
specs/6.0/neutron-multiple-l3-agents.rst
Normal file
@ -0,0 +1,174 @@
|
||||
======================================
|
||||
Multiple L3 and DHCP agents in Neutron
|
||||
======================================
|
||||
|
||||
https://blueprints.launchpad.net/fuel/+spec/fuel-multiple-l3-agents
|
||||
|
||||
In FUEL 5.1 and before HA network solution was based on one neutron-l3-agent
|
||||
and one DHCP agent, which were switchable between controllers.
|
||||
|
||||
This blueprint describes a way of using multiple L3 and DHCP agents instead of
|
||||
single. It is required for network scalability and neutron performance
|
||||
improvements.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
When we created virtual router in Neutron, it was scheduled to the L3-agent
|
||||
(to one of alive if we had multiple agents using random selection).
|
||||
Before Juno Neutron server didn't monitor life cycle of agent serving
|
||||
this router. If the L3-agent service stopped on a node containing this agent or
|
||||
connectivity was lost, Neutron server didn't reschedule this router to
|
||||
another L3-agent. So there was no HA network solution.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
In Juno multiple solutions for this problem were introduced.
|
||||
The easiest solution is to use the internal Neutron routers rescheduling
|
||||
mechanism. In that case Neutron server automatically monitors L3 agents
|
||||
lifecycle. If agent is marked as dead, all routers associated to the dead agent
|
||||
will be safely moved by Neutron server to an alive agent on another node and
|
||||
auxiliary resources created by the dead agent, such as additional interfaces
|
||||
and iptables rules, will be removed.
|
||||
There are some cases when auxiliary resources will be kept on the dead node and
|
||||
potentially affect connection to instances. For example, when L3 agent is alive
|
||||
but lost connection to a message broker. To avoid such problems additional
|
||||
monitoring and clean up mechanism should be added. It must be easily usable
|
||||
by Pacemaker. Current rescheduling script must be modified to match the
|
||||
proposed changes.
|
||||
|
||||
This feature allows to have permanent and stable connection to instances
|
||||
even in case of failure of one or more L3 agents. Also it allows to
|
||||
effectively distribute routers between all available agents to improve
|
||||
network performance.
|
||||
|
||||
For DHCP agent multiple-agent mode implemented as experimental feature
|
||||
and disabled by default.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
In the Juno release DVR L3 agent is introduced. It looks like alternative
|
||||
router solution. This solution serves only VMs with floating IP addresses and
|
||||
doesn't change behavior for VMs without FIP.
|
||||
Also this solution doesn't change behavior of DHCP agents.
|
||||
|
||||
There's another solution based on VRRP.
|
||||
The problem is that this solution doesn't cover situation where both vrrp nodes
|
||||
are down. This solution also needs external re-scheduling mechanism.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
* Time delays when neutron agents go down will decrease.
|
||||
* Network scalability will grow.
|
||||
* Load on a separate controller will be decreased.
|
||||
* Customers will get a possibility to add any number of nodes with started
|
||||
neutron agents and network scalability will grow.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
* In astute.yaml we have following options:
|
||||
|
||||
* quantum_settings/L3/multiple_dhcp_agents (default=false)
|
||||
* quantum_settings/L3/dhcp_agents_per_network (default=3)
|
||||
* quantum_settings/L3/multiple_l3_agents (default=true)
|
||||
|
||||
* OCF scripts for L3 and DHCP agents got "multiple_agents" option, that allows
|
||||
run agents in non-singletone mode
|
||||
* cluster::neutron::l3 and cluster::neutron::dhcp classes got "multiple_agents"
|
||||
option, that allows configure agents for running in multiple-agent mode
|
||||
* cluster::neutron::dhcp got "agents_per_net" option (by default = 3), that
|
||||
describe amount of dhcp-agents for serve each network. This default
|
||||
justifyed by performance reasons.
|
||||
|
||||
Backward compatibility
|
||||
----------------------
|
||||
|
||||
Using "multiple_agents" option for OCF scripts we can manipulate behavior
|
||||
of L3 and DHCP agents. Moreover, for using old-style behavior of L3/DHCP
|
||||
agents we should decrease clone size for corresponded Pacemaker
|
||||
resources to "1".
|
||||
|
||||
Work Items
|
||||
-------------
|
||||
|
||||
- Update Puppet manifests to enable multiple L3 agents
|
||||
- Add necessary patches to Neutron for additional agents monitoring
|
||||
- Edit the rescheduling script and Pacemaker OCF script
|
||||
to support multiple agents behavior
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Sergey Vasilenko
|
||||
Eugene Nikanorov
|
||||
Oleg Bondarev
|
||||
Sergey Kolekonov
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
New Neutron-server behavior in case of dead L3 agents should be reflected in
|
||||
documentation to correctly debug possible problems.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
- Deploy HA cluster
|
||||
- All instances must be constantly available via floating ips and have Internet
|
||||
access even in case of whole controller failure or particular cases such as
|
||||
message broker failures
|
Loading…
Reference in New Issue
Block a user