273 lines
12 KiB
ReStructuredText
273 lines
12 KiB
ReStructuredText
Performance Notes
|
|
=================
|
|
The python driver for Cassandra offers several methods for executing queries.
|
|
You can synchronously block for queries to complete using
|
|
:meth:`.Session.execute()`, you can use a future-like interface through
|
|
:meth:`.Session.execute_async()`, or you can attach a callback to the future
|
|
with :meth:`.ResponseFuture.add_callback()`. Each of these methods has
|
|
different performance characteristics and behaves differently when
|
|
multiple threads are used.
|
|
|
|
Benchmark Notes
|
|
---------------
|
|
All benchmarks were executed using the
|
|
`benchmark scripts <https://github.com/datastax/python-driver/tree/master/benchmarks>`_
|
|
in the driver repository. They were executed on a laptop with 16 GiB of RAM, an SSD,
|
|
and a 2 GHz, four core CPU with hyperthreading. The Cassandra cluster was a three
|
|
node `ccm <https://github.com/pcmanus/ccm>`_ cluster running on the same laptop
|
|
with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your
|
|
own cluster when tuning the driver for optimal throughput or latency.
|
|
|
|
The 1.0.0 version of the driver was used with all default settings. For these
|
|
benchmarks, the driver was configured to use the ``libev`` reactor. You can also run
|
|
the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`)
|
|
by using the ``--asyncore-only`` command line option.
|
|
|
|
Each benchmark completes 100,000 small inserts. The replication factor for the
|
|
keyspace was three, so all nodes were replicas for the inserted rows.
|
|
|
|
Synchronous Execution (`sync.py <https://github.com/datastax/python-driver/blob/master/benchmarks/sync.py>`_)
|
|
-------------------------------------------------------------------------------------------------------------
|
|
Although this is the simplest way to make queries, it has low throughput
|
|
in single threaded environments. This is basically what the benchmark
|
|
is doing:
|
|
|
|
.. code-block:: python
|
|
|
|
from cassandra.cluster import Cluster
|
|
|
|
cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3])
|
|
session = cluster.connect()
|
|
|
|
for i in range(100000):
|
|
session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')")
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 434.08/sec
|
|
|
|
|
|
This technique does scale reasonably well as we add more threads:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
|
Average throughput: 830.49/sec
|
|
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
|
Average throughput: 1078.27/sec
|
|
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
|
Average throughput: 1275.20/sec
|
|
~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16
|
|
Average throughput: 1345.56/sec
|
|
|
|
|
|
In my environment, throughput is maximized at about 20 threads.
|
|
|
|
|
|
Batched Futures (`future_batches.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_batches.py>`_)
|
|
---------------------------------------------------------------------------------------------------------------------------
|
|
This is a simple way to work with futures for higher throughput. Essentially,
|
|
we start 120 queries asynchronously at the same time and then wait for them
|
|
all to complete. We then repeat this process until all 100,000 operations
|
|
have completed:
|
|
|
|
.. code-block:: python
|
|
|
|
futures = Queue.Queue(maxsize=121)
|
|
for i in range(100000):
|
|
if i % 120 == 0:
|
|
# clear the existing queue
|
|
while True:
|
|
try:
|
|
futures.get_nowait().result()
|
|
except Queue.Empty:
|
|
break
|
|
|
|
future = session.execute_async(query)
|
|
futures.put_nowait(future)
|
|
|
|
As expected, this improves throughput in a single-threaded environment:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 3477.56/sec
|
|
|
|
However, adding more threads may actually harm throughput:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
|
Average throughput: 2360.52/sec
|
|
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
|
Average throughput: 2293.21/sec
|
|
~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
|
Average throughput: 2244.85/sec
|
|
|
|
|
|
Queued Futures (`future_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_pipeline.py>`_)
|
|
--------------------------------------------------------------------------------------------------------------------------------------
|
|
This pattern is similar to batched futures. The main difference is that
|
|
every time we put a future on the queue, we pull the oldest future out
|
|
and wait for it to complete:
|
|
|
|
.. code-block:: python
|
|
|
|
futures = Queue.Queue(maxsize=121)
|
|
for i in range(100000):
|
|
if i >= 120:
|
|
old_future = futures.get_nowait()
|
|
old_future.result()
|
|
|
|
future = session.execute_async(query)
|
|
futures.put_nowait(future)
|
|
|
|
This gets slightly better throughput than the Batched Futures pattern:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 3635.76/sec
|
|
|
|
But this has the same throughput issues when multiple threads are used:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
|
Average throughput: 2213.62/sec
|
|
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
|
Average throughput: 2707.62/sec
|
|
~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
|
Average throughput: 2462.42/sec
|
|
|
|
Unthrottled Futures (`future_full_throttle.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_throttle.py>`_)
|
|
-------------------------------------------------------------------------------------------------------------------------------------------
|
|
What happens if we don't throttle our async requests at all?
|
|
|
|
.. code-block:: python
|
|
|
|
futures = []
|
|
for i in range(100000):
|
|
future = session.execute_async(query)
|
|
futures.append(future)
|
|
|
|
for future in futures:
|
|
future.result()
|
|
|
|
Throughput is about the same as the previous pattern, but a lot of memory will
|
|
be consumed by the list of Futures:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 3474.11/sec
|
|
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
|
Average throughput: 2389.61/sec
|
|
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
|
|
Average throughput: 2371.75/sec
|
|
~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
|
|
Average throughput: 2165.29/sec
|
|
|
|
Callback Chaining (`callbacks_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/callbacks_full_pipeline.py>`_)
|
|
-----------------------------------------------------------------------------------------------------------------------------------------------
|
|
This pattern is very different from the previous patterns. Here we're taking
|
|
advantage of the :meth:`.ResponseFuture.add_callback()` function to start
|
|
another request as soon as one finishes. Futhermore, we're starting 120
|
|
of these callback chains, so we've always got about 120 operations in
|
|
flight at any time:
|
|
|
|
.. code-block:: python
|
|
|
|
from itertools import count
|
|
from threading import Event
|
|
|
|
sentinel = object()
|
|
num_queries = 100000
|
|
num_started = count()
|
|
num_finished = count()
|
|
finished_event = Event()
|
|
|
|
def insert_next(previous_result=sentinel):
|
|
if previous_result is not sentinel:
|
|
if isinstance(previous_result, BaseException):
|
|
log.error("Error on insert: %r", previous_result)
|
|
if num_finished.next() >= num_queries:
|
|
finished_event.set()
|
|
|
|
if num_started.next() <= num_queries:
|
|
future = session.execute_async(query)
|
|
# NOTE: this callback also handles errors
|
|
future.add_callbacks(insert_next, insert_next)
|
|
|
|
for i in range(min(120, num_queries)):
|
|
insert_next()
|
|
|
|
finished_event.wait()
|
|
|
|
This is a more complex pattern, but the throughput is excellent:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 7647.30/sec
|
|
|
|
Part of the reason why performance is so good is that everything is running on
|
|
single thread: the internal event loop thread that powers the driver. The
|
|
downside to this is that adding more threads doesn't improve anything:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
|
|
Average throughput: 7704.58/sec
|
|
|
|
|
|
What happens if we have more than 120 callback chains running?
|
|
|
|
With 250 chains:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 7794.22/sec
|
|
|
|
Things look pretty good with 250 chains. If we try 500 chains, we start to max out
|
|
all of the connections in the connection pools. The problem is that the current
|
|
version of the driver isn't very good at throttling these callback chains, so
|
|
a lot of time gets spent waiting for new connections and performance drops
|
|
dramatically:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
|
|
Average throughput: 679.61/sec
|
|
|
|
Until this is improved, you should limit the number of callback chains you run.
|
|
|
|
PyPy
|
|
----
|
|
Almost all of these patterns become CPU-bound pretty quickly with CPython, the
|
|
normal implementation of python. `PyPy <http://pypy.org>`_ is an alternative
|
|
implementation of Python (written in Python) which uses a JIT compiler to
|
|
reduce CPU consumption. This leads to a huge improvement in the driver
|
|
performance:
|
|
|
|
.. code-block:: bash
|
|
|
|
~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1
|
|
Average throughput: 18782.00/sec
|
|
|
|
Eventually the driver may add C extensions to reduce CPU consumption, which
|
|
would probably narrow the gap between the performance of CPython and PyPy.
|
|
|
|
multiprocessing
|
|
---------------
|
|
All of the patterns here may be used over multiple processes using the
|
|
`multiprocessing <http://docs.python.org/2/library/multiprocessing.html>`_
|
|
module. Multiple processes will scale significantly better than multiple
|
|
threads will, so if high throughput is your goal, consider this option.
|
|
|
|
Just be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`,
|
|
**or** :class:`~.ResponseFuture` **objects across multiple processes**. These
|
|
objects should all be created after forking the process, not before.
|