fuel-specs/specs/6.0/neutron-multiple-l3-agents.rst

======================================
Multiple L3 and DHCP agents in Neutron
======================================

https://blueprints.launchpad.net/fuel/+spec/fuel-multiple-l3-agents

In FUEL 5.1 and before HA network solution was based on one neutron-l3-agent
and one DHCP agent, which were switchable between controllers.

This blueprint describes a way of using multiple L3 and DHCP agents instead of
single. It is required for network scalability and neutron performance
improvements.

Problem description
===================

When we created virtual router in Neutron, it was scheduled to the L3-agent
(to one of alive if we had multiple agents using random selection).
Before Juno Neutron server didn't monitor life cycle of agent serving
this router. If the L3-agent service stopped on a node containing this agent or
connectivity was lost, Neutron server didn't reschedule this router to
another L3-agent. So there was no HA network solution.

Proposed change
===============

In Juno multiple solutions for this problem were introduced.
The easiest solution is to use the internal Neutron routers rescheduling
mechanism. In that case Neutron server automatically monitors L3 agents
lifecycle. If agent is marked as dead, all routers associated to the dead agent
will be safely moved by Neutron server to an alive agent on another node and
auxiliary resources created by the dead agent, such as additional interfaces
and iptables rules, will be removed.
There are some cases when auxiliary resources will be kept on the dead node and
potentially affect connection to instances. For example, when L3 agent is alive
but lost connection to a message broker. To avoid such problems additional
monitoring and clean up mechanism should be added. It must be easily usable
by Pacemaker. Current rescheduling script must be modified to match the
proposed changes.

This feature allows to have permanent and stable connection to instances
even in case of failure of one or more L3 agents. Also it allows to
effectively distribute routers between all available agents to improve
network performance.

For DHCP agent multiple-agent mode implemented as experimental feature
and disabled by default.

Alternatives
------------

In the Juno release DVR L3 agent is introduced. It looks like alternative
router solution. This solution serves only VMs with floating IP addresses and
doesn't change behavior for VMs without FIP.
Also this solution doesn't change behavior of DHCP agents.

There's another solution based on VRRP.
The problem is that this solution doesn't cover situation where both vrrp nodes
are down. This solution also needs external re-scheduling mechanism.

Data model impact
-----------------

None

REST API impact
---------------

None

Upgrade impact
--------------

None

Security impact
---------------

None

Notifications impact
--------------------

None

Other end user impact
---------------------

None

Performance Impact
------------------

* Time delays when neutron agents go down will decrease.
* Network scalability will grow.
* Load on a separate controller will be decreased.
* Customers will get a possibility to add any number of nodes with started
  neutron agents and network scalability will grow.

Other deployer impact
---------------------

None

Developer impact
----------------

None

Implementation
==============

* In astute.yaml we have following options:

  * quantum_settings/L3/multiple_dhcp_agents (default=false)
  * quantum_settings/L3/dhcp_agents_per_network (default=3)
  * quantum_settings/L3/multiple_l3_agents (default=true)

* OCF scripts for L3 and DHCP agents got "multiple_agents" option, that allows
  run agents in non-singletone mode
* cluster::neutron::l3 and cluster::neutron::dhcp classes got "multiple_agents"
  option, that allows  configure agents for running in multiple-agent mode
* cluster::neutron::dhcp got "agents_per_net" option (by default = 3), that
  describe amount of dhcp-agents for serve each network. This default
  justifyed by performance reasons.

Backward compatibility
----------------------

Using "multiple_agents" option for OCF scripts we can manipulate behavior
of L3 and DHCP agents. Moreover, for using old-style behavior of L3/DHCP
agents we should decrease clone size for corresponded Pacemaker
resources to "1".

Work Items
-------------

- Update Puppet manifests to enable multiple L3 agents
- Add necessary patches to Neutron for additional agents monitoring
- Edit the rescheduling script and Pacemaker OCF script
  to support multiple agents behavior

Assignee(s)
-----------

Sergey Vasilenko
Eugene Nikanorov
Oleg Bondarev
Sergey Kolekonov

Dependencies
============

None

Documentation Impact
====================

New Neutron-server behavior in case of dead L3 agents should be reflected in
documentation to correctly debug possible problems.


References
==========

None

Testing
=======

- Deploy HA cluster
- All instances must be constantly available via floating ips and have Internet
  access even in case of whole controller failure or particular cases such as
  message broker failures