neutron-specs/specs/kilo/agent-child-processes-statu...

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

============================
Agent child processes status
============================

https://blueprints.launchpad.net/neutron/+spec/agent-child-processes-status

Neutron agents spawn external detached processes which run unmonitored, if
anything happens to those processes neutron won't take any action,
failing to provide those services reliably.

We propose monitoring those processes, and taking a configurable action,
making neutron more resilient to external failures.

Problem Description
===================

When a ns-metadata-proxy dies inside an l3-agent [#liveness_bug]_,
subnets served by this ns-metadata-proxy will have no metadata until there
are any changes to the router, which will recheck the metadata agent
liveness.

Same thing happens with the dhcp-agent [#dhcp_agent_bug]_ and also
in lbaas and vpnaas agents.

This is a long known bug, which generally would be triggered
by bugs in dnsmasq, or the ns-metadata-proxy, and specially critical
on big clouds and HA environments.

Proposed Change
===============

I propose to monitor the spawned processes using the
neutron.agent.linux.external_process.ProcessMonitor class, which relies
on the ProcessManager to check liveness periodically.

If a process that should be active is not, it will be logged, and we
could take any of the following admin configured actions, in the
configuration specified order.

* Respawn the process: The failed external process will be respawned.
* Exit the agent: for use when an HA service manager is taking care of the
  agent and will respawn it, optionally in a different host. During exit
  action, all other external processes will be left running as for any
  other agent stop. So there is no downtime for the unaffected tenant network
  until the HA solution takes care of failing over the agent. In case of
  failover, responsibility for cleanup (processes and ports) lies on
  neutron-netns-cleanup and neutron-ovs-cleanup.

In future follow ups, we plan to implement a notify action to the process manager
when the corresponding piece lands in oslo [#oslo_service_status]_.

Examples of configurations could be:

* Disabled, external processes are not polled for liveness

::

  check_child_processes_period  = 0

* Log (implicit) and respawn

::

  check_child_processes_action = respawn
  check_child_processes_period = 60

* Log (implicit) and notify

::

  check_child_processes_action = notify
  check_child_processes_period = 60

* Log (implicit), and exit

::

  check_child_processes_action = exit
  check_child_processes_period = 60

This feature will be enabled by default (60 seconds), and default
action will be 'respawn'.

Data Model Impact
-----------------

None

REST API Impact
---------------

None

Security Impact
---------------

None

Notifications Impact
--------------------

None

Other End User Impact
---------------------

None

Performance Impact
------------------

Some extra periodic load will be added by checking the underlying
children. Locking of other green threads will be diminished by starting
a green thread pool for checking the children. A semaphore is introduced
to avoid several check cycles from starting concurrently.

As there were concerns on polling /proc/$pid/cmdline, I implemented a
simplistic benchmark:

::

  i=10000000
  while  i>0:
    f = open ('/proc/8125/cmdline','r')
    f.readlines()
    i = i - 1


Please note that the cmdline file is addressed by kernel functions [#kernel_cmdline]_
in memory and does not rely on any I/O over a block device, that means there is no
cache speeding up the read of this file which would invalidate this benchmark.

::

  root@ns316109:~# time python test.py
  real  0m59.836s
  user  0m23.681s
  sys 0m35.679s


That means, 170.000 reads/s using 1 core / 100% CPU on a 7400 bogomips machine.

If we had to check 1000 children processes we would need 1000/170000 = 0.0059
seconds plus the overhead of the intermediate method calls and the spawning
of greenthreads.

I believe ~ 6ms CPU usage to check 1000 children is rather acceptable, even
though the check interval is tunable, and it's disabled by default
to let the deployers balance the performance impact with the failure detection
latency.

Polling isn't ideal, but the alternatives aren't either, and
we need a solution for this problem, specially for HA environments.


IPv6 Impact
-----------

No effect on IPv6 expected here.


Other Deployer Impact
---------------------

People implementing their own external monitoring of the subprocesses, may
need to migrate into the new solution, taking advantage of the exit method,
or a later notify one when that's available.

Developer Impact
----------------

Developers which spawn external processes may start using ProcessMonitor
instead of using ProcessManager directly.

Community Impact
----------------

This change has been discussed several times on the mailing list, IRC,
and previously accepted for Juno, but didn't make it to the deadline
on time. It's something desired by the community, as it makes neutron
agents more resilient to external failures.

Alternatives
------------

* Use popen to start services in the foreground and wait on SIGCHLD
  instead of polling. It wouldn't be possible to reattach after
  we exit or restart an agent because the parent will detach from
  the child and it's not possible to reattach when agent restarts
  (without using ptrace which sounds too hackish). This is a
  POSIX limitation.
  In our design, when an agent exits, all the underlying children
  stay alive, detached from the parent and continue to run
  to make sure there is no service disruption during upgrades.
  When the agent starts again, it will check in /var/neutron/{$resource}/
  for the pid of the child that serves each resource, and it's
  configuration, and make sure that it's running (or restart it
  otherwise). This is the point we can't re-attach, or wait [#waitpid]_
  for an specific non-child PID [#waitpid_non_child]_.

* Changing the restart mechanism of agents to an execve from inside
  the agent itself (via signal capture). The execve system call
  retains original PID and children PID relationship, thus we
  could wait on children pid. But this prevents stop/start capability
  of agents which could be handy during maintenance and development.
  If we decide to change this in the future, ProcessMonitor implementation
  could be easily modified to non-polling-wait on pids without changing
  any of it's API.

* Use a intermediate daemon to start long running processes and
  monitor them via SIGCHLD as a workaround for the problems in the first
  alternative. This is very similar to the soon-to-be available
  functionality in oslo rootwrap daemon, but rootwrap daemon won't
  be supporting long running processes yet, even though the problem
  with this alternative is the case when the intermediate process
  manager dies or gets killed. In that case we lose control
  over the spawn children (that we would be monitoring via SIGCHLD).

* Instead of periodically checking all children, spread the load
  in several batches over time. That would be a more complicated
  implementation, which probably could be addressed on a second
  round or as a last work item if the initial implementation doesn't
  perform as expected for a high amount of resources (routers, dhcp
  services, lbaas..).

* Initially, the notification part was planned to be implemented
  within neutron itself, but the design has been modularized in
  oslo with drivers for different types (systemd, init.d, upstart..).


Implementation
==============

Assignee(s)
-----------

* https://launchpad.net/~mangelajo
* https://launchpad.net/~brian-haley

Adding brian-haley as I'm taking a few of his ideas, and reusing
partly his work on [#check_metadata]_.


Work Items
----------

* ProcessMonitor, and functional testing: done
* Implement in dhcp-agent, refactoring the code duplication
  with neutron.agent.linux.external_process. [#dhcp_impl]_
* Implement in l3-agent [#l3_impl]_
* Implement in lbaas-agent
* Implement in vpnaas-agent

Notes: a notify action was planned, but it's depending on a new oslo feature,
this action can be added later via bug process once the oslo feature is accepted
and implemented.

Dependencies
============

The notify action depends on the implementation of [#oslo_service_status]_,
but all the other features/actions can be acomplished without that.

Testing
=======

Tempest Tests
-------------

Tempest tests are not capable of doing arbitrary execution of command
in the network nodes (killing processes for example). So we can't use
tempest to check this without implementing some sort of fault injection
in tempest.

Functional Tests
----------------

Functional testing is used to verify the ProcessMonitor class, in charge
of the core functionality of this spec.

API Tests
---------
None

Documentation Impact
====================

User Documentation
------------------

The new configuration options will have to be documented per agent.

This are the proposed defaults:

::

  check_child_processes_action = respawn
  check_child_processes_period  = 0

Developer Documentation
-----------------------
None

References
==========

.. [#dhcp_impl] DHCP agent implementation:
   https://review.openstack.org/#/c/115935/

.. [#l3_impl] L3 agent implementation:
   https://review.openstack.org/#/c/114931/

.. [#dhcp_agent_bug] Dhcp agent dying children bug:
   https://bugs.launchpad.net/neutron/+bug/1257524

.. [#liveness_bug] L3 agent dying children bug:
   https://bugs.launchpad.net/neutron/+bug/1257775

.. [#check_metadata] Brian Haley's implementation for l3 agent
   https://review.openstack.org/#/c/59997/

.. [#oslo_service_status]  Oslo service manager status notification spec
   http://docs-draft.openstack.org/48/97748/3/check/gate-oslo-specs-docs/ef96358/doc/build/html/specs/juno/service-status-interface.html]

.. [#oslo_sn_review] Oslo spec review
   https://review.openstack.org/#/c/97748/

.. [#old_agent_service_status] Old agent service status blueprint
   https://blueprints.launchpad.net/neutron/+spec/agent-service-status

.. [#waitpid] http://linux.die.net/man/2/waitpid

.. [#waitpid_non_child] http://stackoverflow.com/questions/1058047/wait-for-any-process-to-finish

.. [#kernel_cmdline] https://github.com/torvalds/linux/blob/master/fs/proc/cmdline.c#L8