From 9550e90b8e6f148dfff68d7ed118d25a619fb00f Mon Sep 17 00:00:00 2001
From: Sergey Vasilenko <svasilenko@mirantis.com>
Date: Thu, 21 Aug 2014 17:14:37 +0400
Subject: [PATCH] Blueprint: fuel-multiple-l3-agents

Change-Id: I061c0cfc68965607ea3eefbc8751965db40e087f
---
 specs/6.0/neutron-multiple-l3-agents.rst | 174 +++++++++++++++++++++++
 1 file changed, 174 insertions(+)
 create mode 100644 specs/6.0/neutron-multiple-l3-agents.rst

diff --git a/specs/6.0/neutron-multiple-l3-agents.rst b/specs/6.0/neutron-multiple-l3-agents.rst
new file mode 100644
index 00000000..0c434a37
--- /dev/null
+++ b/specs/6.0/neutron-multiple-l3-agents.rst
@@ -0,0 +1,174 @@
+======================================
+Multiple L3 and DHCP agents in Neutron
+======================================
+
+https://blueprints.launchpad.net/fuel/+spec/fuel-multiple-l3-agents
+
+In FUEL 5.1 and before HA network solution was based on one neutron-l3-agent
+and one DHCP agent, which were switchable between controllers.
+
+This blueprint describes a way of using multiple L3 and DHCP agents instead of
+single. It is required for network scalability and neutron performance
+improvements.
+
+Problem description
+===================
+
+When we created virtual router in Neutron, it was scheduled to the L3-agent
+(to one of alive if we had multiple agents using random selection).
+Before Juno Neutron server didn't monitor life cycle of agent serving
+this router. If the L3-agent service stopped on a node containing this agent or
+connectivity was lost, Neutron server didn't reschedule this router to
+another L3-agent. So there was no HA network solution.
+
+Proposed change
+===============
+
+In Juno multiple solutions for this problem were introduced.
+The easiest solution is to use the internal Neutron routers rescheduling
+mechanism. In that case Neutron server automatically monitors L3 agents
+lifecycle. If agent is marked as dead, all routers associated to the dead agent
+will be safely moved by Neutron server to an alive agent on another node and
+auxiliary resources created by the dead agent, such as additional interfaces
+and iptables rules, will be removed.
+There are some cases when auxiliary resources will be kept on the dead node and
+potentially affect connection to instances. For example, when L3 agent is alive
+but lost connection to a message broker. To avoid such problems additional
+monitoring and clean up mechanism should be added. It must be easily usable
+by Pacemaker. Current rescheduling script must be modified to match the
+proposed changes.
+
+This feature allows to have permanent and stable connection to instances
+even in case of failure of one or more L3 agents. Also it allows to
+effectively distribute routers between all available agents to improve
+network performance.
+
+For DHCP agent multiple-agent mode implemented as experimental feature
+and disabled by default.
+
+Alternatives
+------------
+
+In the Juno release DVR L3 agent is introduced. It looks like alternative
+router solution. This solution serves only VMs with floating IP addresses and
+doesn't change behavior for VMs without FIP.
+Also this solution doesn't change behavior of DHCP agents.
+
+There's another solution based on VRRP.
+The problem is that this solution doesn't cover situation where both vrrp nodes
+are down. This solution also needs external re-scheduling mechanism.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Upgrade impact
+--------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+* Time delays when neutron agents go down will decrease.
+* Network scalability will grow.
+* Load on a separate controller will be decreased.
+* Customers will get a possibility to add any number of nodes with started
+  neutron agents and network scalability will grow.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+* In astute.yaml we have following options:
+
+  * quantum_settings/L3/multiple_dhcp_agents (default=false)
+  * quantum_settings/L3/dhcp_agents_per_network (default=3)
+  * quantum_settings/L3/multiple_l3_agents (default=true)
+
+* OCF scripts for L3 and DHCP agents got "multiple_agents" option, that allows
+  run agents in non-singletone mode
+* cluster::neutron::l3 and cluster::neutron::dhcp classes got "multiple_agents"
+  option, that allows  configure agents for running in multiple-agent mode
+* cluster::neutron::dhcp got "agents_per_net" option (by default = 3), that
+  describe amount of dhcp-agents for serve each network. This default
+  justifyed by performance reasons.
+
+Backward compatibility
+----------------------
+
+Using "multiple_agents" option for OCF scripts we can manipulate behavior
+of L3 and DHCP agents. Moreover, for using old-style behavior of L3/DHCP
+agents we should decrease clone size for corresponded Pacemaker
+resources to "1".
+
+Work Items
+-------------
+
+- Update Puppet manifests to enable multiple L3 agents
+- Add necessary patches to Neutron for additional agents monitoring
+- Edit the rescheduling script and Pacemaker OCF script
+  to support multiple agents behavior
+
+Assignee(s)
+-----------
+
+Sergey Vasilenko
+Eugene Nikanorov
+Oleg Bondarev
+Sergey Kolekonov
+
+Dependencies
+============
+
+None
+
+Documentation Impact
+====================
+
+New Neutron-server behavior in case of dead L3 agents should be reflected in
+documentation to correctly debug possible problems.
+
+
+References
+==========
+
+None
+
+Testing
+=======
+
+- Deploy HA cluster
+- All instances must be constantly available via floating ips and have Internet
+  access even in case of whole controller failure or particular cases such as
+  message broker failures