Add a document that describes profiling eventlet services

Profiling eventlet using services is a bit different from standard
situations so here is a document that tries to explain the basics
on how to get started doing it.

Change-Id: If8c34653285f07c5cc1abccabfec16f18daafdde
This commit is contained in:
Chris Dent 2019-08-06 17:29:30 +01:00
parent 23e7930b35
commit 886142b17a
3 changed files with 246 additions and 0 deletions

View File

@ -80,6 +80,10 @@ be Python code. All new code needs to be validated somehow.
* :doc:`/contributor/testing/down-cell`
* **Profiling Guides**: These are guides to profiling nova.
* :doc:`/contributor/testing/eventlet-profiling`
The Nova API
============

View File

@ -0,0 +1,241 @@
=======================
Profiling With Eventlet
=======================
When performance of one of the Nova services is worse than expected, and other
sorts of analysis do not lead to candidate fixes, profiling is an excellent
tool for producing detailed analysis of what methods in the code are called the
most and which consume the most time.
Because most Nova services use eventlet_, the standard profiling tool provided
with Python, cProfile_, will not work. Something is required to keep track of
changing tasks. Thankfully eventlet comes with
``eventlet.green.profile.Profile``, a mostly undocumented class that provides a
similar (but not identical) API to the one provided by Python's ``Profile``
while outputting the same format.
.. note:: The eventlet Profile outputs the ``prof`` format produced by
``profile``, which is not the same as that output by ``cProfile``.
Some analysis tools (for example, SnakeViz_) only read the latter
so the options for analyzing eventlet profiling are not always
deluxe (see below).
Setup
=====
This guide assumes the Nova service being profiled is running devstack, but
that is not necessary. What is necessary is that the code associated with the
service can be changed and the service restarted, in place.
Profiling the entire service will produce mostly noise and the output will be
confusing because different tasks will operate during the profile run. It is
better to begin the process with a candidate task or method *within* the
service that can be associated with an identifier. For example,
``select_destinations`` in the ``FilterScheduler`` can be associated with the
list of ``instance_uuids`` passed to it and it runs only once for that set of
instance uuids.
The process for profiling is:
#. Identify the method to be profiled.
#. Populate the environment with sufficient resources to exercise the code. For
example you may wish to use the FakeVirtDriver_ to have nova aware of
multiple ``nova-compute`` processes. Or you may wish to launch many
instances if you are evaluating a method that loops over instances.
#. At the start of that method, change the code to instantiate a ``Profile``
object and ``start()`` it.
#. At the end of that method, change the code to ``stop()`` profiling and write
the data (with ``dump_stats()``) to a reasonable location.
#. Restart the service.
#. Cause the method being evaluated to run.
#. Analyze the profile data with the pstats_ module.
.. note:: ``stop()`` and ``start()`` are two of the ways in which the eventlet
``Profile`` API differs from the stdlib. There the methods are
``enable()`` and ``disable()``.
Example
=======
For this example we will analyze ``select_destinations`` in the
``FilterScheduler``. A known problem is that it does excessive work when
presented with too many candidate results from the Placement service. We'd like
to know why.
We'll configure and run devstack_ with FakeVirtDriver_ so there are several
candidate hypervisors (the following ``local.conf`` is also useful for other
profiling and benchmarking scenarios so not all changes are relevant here):
.. code-block:: ini
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
USE_PYTHON3=True
VIRT_DRIVER=fake
# You may use different numbers of fake computes, but be careful: 100 will
# completely overwhelm a 16GB, 16VPCU server. In the test profiles below a
# value of 50 was used, on a 16GB, 16VCPU server.
NUMBER_FAKE_NOVA_COMPUTE=25
disable_service cinder
disable_service horizon
disable_service dstat
disable_service tempest
[[post-config|$NOVA_CONF]]
rpc_response_timeout = 300
# Disable filtering entirely. For some profiling this will not be what you
# want.
[filter_scheduler]
enabled_filters = '""'
# Send only one type of notifications to avoid notification overhead.
[notifications]
notification_format = unversioned
Change the code in ``nova/scheduler/filter_scheduler.py`` as follows:
.. code-block:: diff
diff --git a/nova/scheduler/filter_scheduler.py b/nova/scheduler/filter_scheduler.py
index 672f23077e..cb0f87fe48 100644
--- a/nova/scheduler/filter_scheduler.py
+++ b/nova/scheduler/filter_scheduler.py
@@ -49,92 +49,99 @@ class FilterScheduler(driver.Scheduler):
def select_destinations(self, context, spec_obj, instance_uuids,
alloc_reqs_by_rp_uuid, provider_summaries,
allocation_request_version=None, return_alternates=False):
"""Returns a list of lists of Selection objects, which represent the
hosts and (optionally) alternates for each instance.
:param context: The RequestContext object
:param spec_obj: The RequestSpec object
:param instance_uuids: List of UUIDs, one for each value of the spec
object's num_instances attribute
:param alloc_reqs_by_rp_uuid: Optional dict, keyed by resource provider
UUID, of the allocation_requests that may
be used to claim resources against
matched hosts. If None, indicates either
the placement API wasn't reachable or
that there were no allocation_requests
returned by the placement API. If the
latter, the provider_summaries will be an
empty dict, not None.
:param provider_summaries: Optional dict, keyed by resource provider
UUID, of information that will be used by
the filters/weighers in selecting matching
hosts for a request. If None, indicates that
the scheduler driver should grab all compute
node information locally and that the
Placement API is not used. If an empty dict,
indicates the Placement API returned no
potential matches for the requested
resources.
:param allocation_request_version: The microversion used to request the
allocations.
:param return_alternates: When True, zero or more alternate hosts are
returned with each selected host. The number
of alternates is determined by the
configuration option
`CONF.scheduler.max_attempts`.
"""
+ from eventlet.green import profile
+ pr = profile.Profile()
+ pr.start()
+
self.notifier.info(
context, 'scheduler.select_destinations.start',
dict(request_spec=spec_obj.to_legacy_request_spec_dict()))
compute_utils.notify_about_scheduler_action(
context=context, request_spec=spec_obj,
action=fields_obj.NotificationAction.SELECT_DESTINATIONS,
phase=fields_obj.NotificationPhase.START)
host_selections = self._schedule(context, spec_obj, instance_uuids,
alloc_reqs_by_rp_uuid, provider_summaries,
allocation_request_version, return_alternates)
self.notifier.info(
context, 'scheduler.select_destinations.end',
dict(request_spec=spec_obj.to_legacy_request_spec_dict()))
compute_utils.notify_about_scheduler_action(
context=context, request_spec=spec_obj,
action=fields_obj.NotificationAction.SELECT_DESTINATIONS,
phase=fields_obj.NotificationPhase.END)
+ pr.stop()
+ pr.dump_stats('/tmp/select_destinations/%s.prof' % ':'.join(instance_uuids))
+
return host_selections
Make a ``/tmp/select_destinations`` directory that is writable by the user
nova-scheduler will run as. This is where the profile output will go.
Restart the scheduler service. Note that ``systemctl restart`` may not kill
things sufficiently dead, so::
sudo systemctl stop devstack@n-sch
sleep 5
sudo systemctl start devstack@n-sch
Create a server (which will call ``select_destinations``)::
openstack server create --image cirros-0.4.0-x86_64-disk --flavor c1 x1
In ``/tmp/select_destinations`` there should be a file with a name using the
uuid of the created server with a ``.prof`` extension.
Change to that directory and view the profile using the pstats
`interactive mode`_::
python3 -m pstats ef044142-f3b8-409d-9af6-c60cea39b273.prof
Sort stats by their cumulative time::
ef044142-f3b8-409d-9af6-c60cea39b273.prof% sort cumtime
ef044142-f3b8-409d-9af6-c60cea39b273.prof% stats 10
Tue Aug 6 17:17:56 2019 ef044142-f3b8-409d-9af6-c60cea39b273.prof
603477 function calls (587772 primitive calls) in 2.294 seconds
Ordered by: cumulative time
List reduced from 2484 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.957 1.957 profile:0(start)
1 0.000 0.000 1.911 1.911 /mnt/share/opt/stack/nova/nova/scheduler/filter_scheduler.py:113(_schedule)
1 0.000 0.000 1.834 1.834 /mnt/share/opt/stack/nova/nova/scheduler/filter_scheduler.py:485(_get_all_host_states)
1 0.000 0.000 1.834 1.834 /mnt/share/opt/stack/nova/nova/scheduler/host_manager.py:757(get_host_states_by_uuids)
1 0.004 0.004 1.818 1.818 /mnt/share/opt/stack/nova/nova/scheduler/host_manager.py:777(_get_host_states)
104/103 0.001 0.000 1.409 0.014 /usr/local/lib/python3.6/dist-packages/oslo_versionedobjects/base.py:170(wrapper)
50 0.001 0.000 1.290 0.026 /mnt/share/opt/stack/nova/nova/scheduler/host_manager.py:836(_get_instance_info)
50 0.001 0.000 1.289 0.026 /mnt/share/opt/stack/nova/nova/scheduler/host_manager.py:820(_get_instances_by_host)
103 0.001 0.000 0.890 0.009 /usr/local/lib/python3.6/dist-packages/sqlalchemy/orm/query.py:3325(__iter__)
50 0.001 0.000 0.776 0.016 /mnt/share/opt/stack/nova/nova/objects/host_mapping.py:99(get_by_host)
From this we can make a couple of useful inferences about ``get_by_host``:
* It is called once for each of the 50 ``FakeVirtDriver`` hypervisors
configured for these tests.
* It (and the methods it calls internally) consumes about 40% of the entire
time spent running (``0.776 / 1.957``) the ``select_destinations`` method
(indicated by ``profile:0(start)``, above).
Several other sort modes can be used. List those that are available by entering
``sort`` without arguments.
.. _eventlet: https://eventlet.net/
.. _cProfile: https://docs.python.org/3/library/profile.html
.. _SnakeViz: https://jiffyclub.github.io/snakeviz/
.. _devstack: https://docs.openstack.org/devstack/latest/
.. _FakeVirtDriver: https://docs.openstack.org/devstack/latest/guides/nova.html#fake-virt-driver
.. _pstats: https://docs.python.org/3/library/profile.html#pstats.Stats
.. _interactive mode: https://www.stefaanlippens.net/python_profiling_with_pstats_interactive_mode/

View File

@ -217,6 +217,7 @@ looking parts of our architecture. These are collected below.
contributor/testing/serial-console
contributor/testing/zero-downtime-upgrade
contributor/testing/down-cell
contributor/testing/eventlet-profiling
contributor/how-to-get-involved
contributor/process
contributor/project-scope