Upate performance notes doc: +cython, -numbers
remove outdated, specific numbers Having numbers for contrived workloads sometimes confuses people or leads to impressions of false claims. Unified benchmarks at DataStax will replace these.
This commit is contained in:
@@ -2,290 +2,47 @@ Performance Notes
|
||||
=================
|
||||
The Python driver for Cassandra offers several methods for executing queries.
|
||||
You can synchronously block for queries to complete using
|
||||
:meth:`.Session.execute()`, you can use a future-like interface through
|
||||
:meth:`.Session.execute_async()`, or you can attach a callback to the future
|
||||
with :meth:`.ResponseFuture.add_callback()`. Each of these methods has
|
||||
different performance characteristics and behaves differently when
|
||||
multiple threads are used.
|
||||
:meth:`.Session.execute()`, you can obtain asynchronous request futures through
|
||||
:meth:`.Session.execute_async()`, and you can attach a callback to the future
|
||||
with :meth:`.ResponseFuture.add_callback()`.
|
||||
|
||||
Benchmark Notes
|
||||
---------------
|
||||
All benchmarks were executed using the
|
||||
`benchmark scripts <https://github.com/datastax/python-driver/tree/master/benchmarks>`_
|
||||
in the driver repository. They were executed on a laptop with 16 GiB of RAM, an SSD,
|
||||
and a 2 GHz, four core CPU with hyper-threading. The Cassandra cluster was a three
|
||||
node `ccm <https://github.com/pcmanus/ccm>`_ cluster running on the same laptop
|
||||
with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your
|
||||
own cluster when tuning the driver for optimal throughput or latency.
|
||||
Examples of multiple request patterns can be found in the benchmark scripts included in the driver project.
|
||||
|
||||
The 1.0.0 version of the driver was used with all default settings. For these
|
||||
benchmarks, the driver was configured to use the ``libev`` reactor. You can also run
|
||||
the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`)
|
||||
by using the ``--asyncore-only`` command line option.
|
||||
The choice of execution pattern will depend on the application context. For applications dealing with multiple
|
||||
requests in a given context, the recommended pattern is to use concurrent asynchronous
|
||||
requests with callbacks. For many use cases, you don't need to implement this pattern yourself.
|
||||
:meth:`cassandra.concurrent.execute_concurrent` and :meth:`cassandra.concurrent.execute_concurrent_with_args`
|
||||
provide this pattern with a synchronous API and tunable concurrency.
|
||||
|
||||
Each benchmark completes 100,000 small inserts. The replication factor for the
|
||||
keyspace was three, so all nodes were replicas for the inserted rows.
|
||||
|
||||
The benchmarks require the Python driver C extensions as well as a few additional
|
||||
Python packages. Follow these steps to install the prerequisites:
|
||||
|
||||
1. Install packages to support Python driver C extensions:
|
||||
|
||||
* Debian/Ubuntu: ``sudo apt-get install gcc python-dev libev4 libev-dev``
|
||||
* RHEL/CentOS/Fedora: ``sudo yum install gcc python-dev libev4 libev-dev``
|
||||
|
||||
2. Install Python packages: ``pip install scales twisted blist``
|
||||
3. Re-install the Cassandra driver: ``pip install --upgrade cassandra-driver``
|
||||
|
||||
Synchronous Execution (`sync.py <https://github.com/datastax/python-driver/blob/master/benchmarks/sync.py>`_)
|
||||
-------------------------------------------------------------------------------------------------------------
|
||||
Although this is the simplest way to make queries, it has low throughput
|
||||
in single threaded environments. This is basically what the benchmark
|
||||
is doing:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from cassandra.cluster import Cluster
|
||||
|
||||
cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3])
|
||||
session = cluster.connect()
|
||||
|
||||
for i in range(100000):
|
||||
session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')")
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 434.08/sec
|
||||
|
||||
|
||||
This technique does scale reasonably well as we add more threads:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
||||
Average throughput: 830.49/sec
|
||||
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
||||
Average throughput: 1078.27/sec
|
||||
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
||||
Average throughput: 1275.20/sec
|
||||
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16
|
||||
Average throughput: 1345.56/sec
|
||||
|
||||
|
||||
In my environment, throughput is maximized at about 20 threads.
|
||||
|
||||
|
||||
Batched Futures (`future_batches.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_batches.py>`_)
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
This is a simple way to work with futures for higher throughput. Essentially,
|
||||
we start 120 queries asynchronously at the same time and then wait for them
|
||||
all to complete. We then repeat this process until all 100,000 operations
|
||||
have completed:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
futures = Queue.Queue(maxsize=121)
|
||||
for i in range(100000):
|
||||
if i % 120 == 0:
|
||||
# clear the existing queue
|
||||
while True:
|
||||
try:
|
||||
futures.get_nowait().result()
|
||||
except Queue.Empty:
|
||||
break
|
||||
|
||||
future = session.execute_async(query)
|
||||
futures.put_nowait(future)
|
||||
|
||||
As expected, this improves throughput in a single-threaded environment:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 3477.56/sec
|
||||
|
||||
However, adding more threads may actually harm throughput:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
||||
Average throughput: 2360.52/sec
|
||||
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
||||
Average throughput: 2293.21/sec
|
||||
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
||||
Average throughput: 2244.85/sec
|
||||
|
||||
|
||||
Queued Futures (`future_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_pipeline.py>`_)
|
||||
--------------------------------------------------------------------------------------------------------------------------------------
|
||||
This pattern is similar to batched futures. The main difference is that
|
||||
every time we put a future on the queue, we pull the oldest future out
|
||||
and wait for it to complete:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
futures = Queue.Queue(maxsize=121)
|
||||
for i in range(100000):
|
||||
if i >= 120:
|
||||
old_future = futures.get_nowait()
|
||||
old_future.result()
|
||||
|
||||
future = session.execute_async(query)
|
||||
futures.put_nowait(future)
|
||||
|
||||
This gets slightly better throughput than the Batched Futures pattern:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 3635.76/sec
|
||||
|
||||
But this has the same throughput issues when multiple threads are used:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
||||
Average throughput: 2213.62/sec
|
||||
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
||||
Average throughput: 2707.62/sec
|
||||
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
||||
Average throughput: 2462.42/sec
|
||||
|
||||
Unthrottled Futures (`future_full_throttle.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_throttle.py>`_)
|
||||
-------------------------------------------------------------------------------------------------------------------------------------------
|
||||
What happens if we don't throttle our async requests at all?
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
futures = []
|
||||
for i in range(100000):
|
||||
future = session.execute_async(query)
|
||||
futures.append(future)
|
||||
|
||||
for future in futures:
|
||||
future.result()
|
||||
|
||||
Throughput is about the same as the previous pattern, but a lot of memory will
|
||||
be consumed by the list of Futures:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 3474.11/sec
|
||||
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
||||
Average throughput: 2389.61/sec
|
||||
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
||||
Average throughput: 2371.75/sec
|
||||
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
||||
Average throughput: 2165.29/sec
|
||||
|
||||
Callback Chaining (`callback_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/callback_full_pipeline.py>`_)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------
|
||||
This pattern is very different from the previous patterns. Here we're taking
|
||||
advantage of the :meth:`.ResponseFuture.add_callback()` function to start
|
||||
another request as soon as one finishes. Furthermore, we're starting 120
|
||||
of these callback chains, so we've always got about 120 operations in
|
||||
flight at any time:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from itertools import count
|
||||
from threading import Event
|
||||
|
||||
sentinel = object()
|
||||
num_queries = 100000
|
||||
num_started = count()
|
||||
num_finished = count()
|
||||
finished_event = Event()
|
||||
|
||||
def insert_next(previous_result=sentinel):
|
||||
if previous_result is not sentinel:
|
||||
if isinstance(previous_result, BaseException):
|
||||
log.error("Error on insert: %r", previous_result)
|
||||
if num_finished.next() >= num_queries:
|
||||
finished_event.set()
|
||||
|
||||
if num_started.next() <= num_queries:
|
||||
future = session.execute_async(query)
|
||||
# NOTE: this callback also handles errors
|
||||
future.add_callbacks(insert_next, insert_next)
|
||||
|
||||
for i in range(min(120, num_queries)):
|
||||
insert_next()
|
||||
|
||||
finished_event.wait()
|
||||
|
||||
This is a more complex pattern, but the throughput is excellent:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 7647.30/sec
|
||||
|
||||
Part of the reason why performance is so good is that everything is running on
|
||||
single thread: the internal event loop thread that powers the driver. The
|
||||
downside to this is that adding more threads doesn't improve anything:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
||||
Average throughput: 7704.58/sec
|
||||
|
||||
|
||||
What happens if we have more than 120 callback chains running?
|
||||
|
||||
With 250 chains:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 7794.22/sec
|
||||
|
||||
Things look pretty good with 250 chains. If we try 500 chains, we start to max out
|
||||
all of the connections in the connection pools. The problem is that the current
|
||||
version of the driver isn't very good at throttling these callback chains, so
|
||||
a lot of time gets spent waiting for new connections and performance drops
|
||||
dramatically:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
||||
Average throughput: 679.61/sec
|
||||
|
||||
When :attr:`.Cluster.protocol_version` is set to 1 or 2, you should limit the
|
||||
number of callback chains you run to roughly 100 per node in the cluster.
|
||||
When :attr:`~.Cluster.protocol_version` is 3 or higher, you can safely experiment
|
||||
with higher numbers of callback chains.
|
||||
|
||||
For many use cases, you don't need to implement this pattern yourself. You can
|
||||
simply use :meth:`cassandra.concurrent.execute_concurrent` and
|
||||
:meth:`cassandra.concurrent.execute_concurrent_with_args`, which implement
|
||||
this pattern for you with a synchronous API.
|
||||
Due to the GIL and limited concurrency, the driver can become CPU-bound pretty quickly. The sections below
|
||||
discuss further runtime and design considerations for mitigating this limitation.
|
||||
|
||||
PyPy
|
||||
----
|
||||
Almost all of these patterns become CPU-bound pretty quickly with CPython, the
|
||||
normal implementation of python. `PyPy <http://pypy.org>`_ is an alternative
|
||||
implementation of Python (written in Python) which uses a JIT compiler to
|
||||
reduce CPU consumption. This leads to a huge improvement in the driver
|
||||
performance:
|
||||
`PyPy <http://pypy.org>`_ is an alternative Python runtime which uses a JIT compiler to
|
||||
reduce CPU consumption. This leads to a huge improvement in the driver performance,
|
||||
more than doubling throughput for many workloads.
|
||||
|
||||
Cython Extensions
|
||||
-----------------
|
||||
`Cython <http://cython.org/>`_ is an optimizing compiler and language that can be used to compile the core files and
|
||||
optional extensions for the driver. Cython is not a strict dependency, but the extensions will be built by default
|
||||
if cython is present in the python path. To include Cython as a requirement, invoke with the extra name ``cython``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1
|
||||
Average throughput: 18782.00/sec
|
||||
|
||||
Eventually the driver may add C extensions to reduce CPU consumption, which
|
||||
would probably narrow the gap between the performance of CPython and PyPy.
|
||||
$ pip install cassandra-driver[cython]
|
||||
|
||||
multiprocessing
|
||||
---------------
|
||||
All of the patterns here may be used over multiple processes using the
|
||||
All of the patterns discussed above may be used over multiple processes using the
|
||||
`multiprocessing <http://docs.python.org/2/library/multiprocessing.html>`_
|
||||
module. Multiple processes will scale significantly better than multiple
|
||||
threads will, so if high throughput is your goal, consider this option.
|
||||
module. Multiple processes will scale better than multiple threads, so if high throughput is your goal,
|
||||
consider this option.
|
||||
|
||||
Just be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`,
|
||||
Be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`,
|
||||
**or** :class:`~.ResponseFuture` **objects across multiple processes**. These
|
||||
objects should all be created after forking the process, not before.
|
||||
|
||||
For further discussion and simple examples using the driver with ``multiprocessing``,
|
||||
see `this blog post <http://www.datastax.com/dev/blog/datastax-python-driver-multiprocessing-example-for-improved-bulk-data-throughput>`_.
|
||||
|
Reference in New Issue
Block a user