From 9550e90b8e6f148dfff68d7ed118d25a619fb00f Mon Sep 17 00:00:00 2001 From: Sergey Vasilenko Date: Thu, 21 Aug 2014 17:14:37 +0400 Subject: [PATCH] Blueprint: fuel-multiple-l3-agents Change-Id: I061c0cfc68965607ea3eefbc8751965db40e087f --- specs/6.0/neutron-multiple-l3-agents.rst | 174 +++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 specs/6.0/neutron-multiple-l3-agents.rst diff --git a/specs/6.0/neutron-multiple-l3-agents.rst b/specs/6.0/neutron-multiple-l3-agents.rst new file mode 100644 index 00000000..0c434a37 --- /dev/null +++ b/specs/6.0/neutron-multiple-l3-agents.rst @@ -0,0 +1,174 @@ +====================================== +Multiple L3 and DHCP agents in Neutron +====================================== + +https://blueprints.launchpad.net/fuel/+spec/fuel-multiple-l3-agents + +In FUEL 5.1 and before HA network solution was based on one neutron-l3-agent +and one DHCP agent, which were switchable between controllers. + +This blueprint describes a way of using multiple L3 and DHCP agents instead of +single. It is required for network scalability and neutron performance +improvements. + +Problem description +=================== + +When we created virtual router in Neutron, it was scheduled to the L3-agent +(to one of alive if we had multiple agents using random selection). +Before Juno Neutron server didn't monitor life cycle of agent serving +this router. If the L3-agent service stopped on a node containing this agent or +connectivity was lost, Neutron server didn't reschedule this router to +another L3-agent. So there was no HA network solution. + +Proposed change +=============== + +In Juno multiple solutions for this problem were introduced. +The easiest solution is to use the internal Neutron routers rescheduling +mechanism. In that case Neutron server automatically monitors L3 agents +lifecycle. If agent is marked as dead, all routers associated to the dead agent +will be safely moved by Neutron server to an alive agent on another node and +auxiliary resources created by the dead agent, such as additional interfaces +and iptables rules, will be removed. +There are some cases when auxiliary resources will be kept on the dead node and +potentially affect connection to instances. For example, when L3 agent is alive +but lost connection to a message broker. To avoid such problems additional +monitoring and clean up mechanism should be added. It must be easily usable +by Pacemaker. Current rescheduling script must be modified to match the +proposed changes. + +This feature allows to have permanent and stable connection to instances +even in case of failure of one or more L3 agents. Also it allows to +effectively distribute routers between all available agents to improve +network performance. + +For DHCP agent multiple-agent mode implemented as experimental feature +and disabled by default. + +Alternatives +------------ + +In the Juno release DVR L3 agent is introduced. It looks like alternative +router solution. This solution serves only VMs with floating IP addresses and +doesn't change behavior for VMs without FIP. +Also this solution doesn't change behavior of DHCP agents. + +There's another solution based on VRRP. +The problem is that this solution doesn't cover situation where both vrrp nodes +are down. This solution also needs external re-scheduling mechanism. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Upgrade impact +-------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +* Time delays when neutron agents go down will decrease. +* Network scalability will grow. +* Load on a separate controller will be decreased. +* Customers will get a possibility to add any number of nodes with started + neutron agents and network scalability will grow. + +Other deployer impact +--------------------- + +None + +Developer impact +---------------- + +None + +Implementation +============== + +* In astute.yaml we have following options: + + * quantum_settings/L3/multiple_dhcp_agents (default=false) + * quantum_settings/L3/dhcp_agents_per_network (default=3) + * quantum_settings/L3/multiple_l3_agents (default=true) + +* OCF scripts for L3 and DHCP agents got "multiple_agents" option, that allows + run agents in non-singletone mode +* cluster::neutron::l3 and cluster::neutron::dhcp classes got "multiple_agents" + option, that allows configure agents for running in multiple-agent mode +* cluster::neutron::dhcp got "agents_per_net" option (by default = 3), that + describe amount of dhcp-agents for serve each network. This default + justifyed by performance reasons. + +Backward compatibility +---------------------- + +Using "multiple_agents" option for OCF scripts we can manipulate behavior +of L3 and DHCP agents. Moreover, for using old-style behavior of L3/DHCP +agents we should decrease clone size for corresponded Pacemaker +resources to "1". + +Work Items +------------- + +- Update Puppet manifests to enable multiple L3 agents +- Add necessary patches to Neutron for additional agents monitoring +- Edit the rescheduling script and Pacemaker OCF script + to support multiple agents behavior + +Assignee(s) +----------- + +Sergey Vasilenko +Eugene Nikanorov +Oleg Bondarev +Sergey Kolekonov + +Dependencies +============ + +None + +Documentation Impact +==================== + +New Neutron-server behavior in case of dead L3 agents should be reflected in +documentation to correctly debug possible problems. + + +References +========== + +None + +Testing +======= + +- Deploy HA cluster +- All instances must be constantly available via floating ips and have Internet + access even in case of whole controller failure or particular cases such as + message broker failures