From f2e9bb36b96947739c29f9392daa69292f563df7 Mon Sep 17 00:00:00 2001 From: Adam Holmberg Date: Fri, 24 Jul 2015 16:26:46 -0500 Subject: [PATCH] Upate performance notes doc: +cython, -numbers remove outdated, specific numbers Having numbers for contrived workloads sometimes confuses people or leads to impressions of false claims. Unified benchmarks at DataStax will replace these. --- docs/performance.rst | 299 ++++--------------------------------------- 1 file changed, 28 insertions(+), 271 deletions(-) diff --git a/docs/performance.rst b/docs/performance.rst index c7a6b3b2..cf11e737 100644 --- a/docs/performance.rst +++ b/docs/performance.rst @@ -2,290 +2,47 @@ Performance Notes ================= The Python driver for Cassandra offers several methods for executing queries. You can synchronously block for queries to complete using -:meth:`.Session.execute()`, you can use a future-like interface through -:meth:`.Session.execute_async()`, or you can attach a callback to the future -with :meth:`.ResponseFuture.add_callback()`. Each of these methods has -different performance characteristics and behaves differently when -multiple threads are used. +:meth:`.Session.execute()`, you can obtain asynchronous request futures through +:meth:`.Session.execute_async()`, and you can attach a callback to the future +with :meth:`.ResponseFuture.add_callback()`. -Benchmark Notes ---------------- -All benchmarks were executed using the -`benchmark scripts `_ -in the driver repository. They were executed on a laptop with 16 GiB of RAM, an SSD, -and a 2 GHz, four core CPU with hyper-threading. The Cassandra cluster was a three -node `ccm `_ cluster running on the same laptop -with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your -own cluster when tuning the driver for optimal throughput or latency. +Examples of multiple request patterns can be found in the benchmark scripts included in the driver project. -The 1.0.0 version of the driver was used with all default settings. For these -benchmarks, the driver was configured to use the ``libev`` reactor. You can also run -the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`) -by using the ``--asyncore-only`` command line option. +The choice of execution pattern will depend on the application context. For applications dealing with multiple +requests in a given context, the recommended pattern is to use concurrent asynchronous +requests with callbacks. For many use cases, you don't need to implement this pattern yourself. +:meth:`cassandra.concurrent.execute_concurrent` and :meth:`cassandra.concurrent.execute_concurrent_with_args` +provide this pattern with a synchronous API and tunable concurrency. -Each benchmark completes 100,000 small inserts. The replication factor for the -keyspace was three, so all nodes were replicas for the inserted rows. - -The benchmarks require the Python driver C extensions as well as a few additional -Python packages. Follow these steps to install the prerequisites: - -1. Install packages to support Python driver C extensions: - - * Debian/Ubuntu: ``sudo apt-get install gcc python-dev libev4 libev-dev`` - * RHEL/CentOS/Fedora: ``sudo yum install gcc python-dev libev4 libev-dev`` - -2. Install Python packages: ``pip install scales twisted blist`` -3. Re-install the Cassandra driver: ``pip install --upgrade cassandra-driver`` - -Synchronous Execution (`sync.py `_) -------------------------------------------------------------------------------------------------------------- -Although this is the simplest way to make queries, it has low throughput -in single threaded environments. This is basically what the benchmark -is doing: - -.. code-block:: python - - from cassandra.cluster import Cluster - - cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3]) - session = cluster.connect() - - for i in range(100000): - session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')") - -.. code-block:: bash - - ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 434.08/sec - - -This technique does scale reasonably well as we add more threads: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 - Average throughput: 830.49/sec - ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 - Average throughput: 1078.27/sec - ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 - Average throughput: 1275.20/sec - ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16 - Average throughput: 1345.56/sec - - -In my environment, throughput is maximized at about 20 threads. - - -Batched Futures (`future_batches.py `_) ---------------------------------------------------------------------------------------------------------------------------- -This is a simple way to work with futures for higher throughput. Essentially, -we start 120 queries asynchronously at the same time and then wait for them -all to complete. We then repeat this process until all 100,000 operations -have completed: - -.. code-block:: python - - futures = Queue.Queue(maxsize=121) - for i in range(100000): - if i % 120 == 0: - # clear the existing queue - while True: - try: - futures.get_nowait().result() - except Queue.Empty: - break - - future = session.execute_async(query) - futures.put_nowait(future) - -As expected, this improves throughput in a single-threaded environment: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 3477.56/sec - -However, adding more threads may actually harm throughput: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 - Average throughput: 2360.52/sec - ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 - Average throughput: 2293.21/sec - ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 - Average throughput: 2244.85/sec - - -Queued Futures (`future_full_pipeline.py `_) --------------------------------------------------------------------------------------------------------------------------------------- -This pattern is similar to batched futures. The main difference is that -every time we put a future on the queue, we pull the oldest future out -and wait for it to complete: - -.. code-block:: python - - futures = Queue.Queue(maxsize=121) - for i in range(100000): - if i >= 120: - old_future = futures.get_nowait() - old_future.result() - - future = session.execute_async(query) - futures.put_nowait(future) - -This gets slightly better throughput than the Batched Futures pattern: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 3635.76/sec - -But this has the same throughput issues when multiple threads are used: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 - Average throughput: 2213.62/sec - ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 - Average throughput: 2707.62/sec - ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 - Average throughput: 2462.42/sec - -Unthrottled Futures (`future_full_throttle.py `_) -------------------------------------------------------------------------------------------------------------------------------------------- -What happens if we don't throttle our async requests at all? - -.. code-block:: python - - futures = [] - for i in range(100000): - future = session.execute_async(query) - futures.append(future) - - for future in futures: - future.result() - -Throughput is about the same as the previous pattern, but a lot of memory will -be consumed by the list of Futures: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 3474.11/sec - ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 - Average throughput: 2389.61/sec - ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4 - Average throughput: 2371.75/sec - ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8 - Average throughput: 2165.29/sec - -Callback Chaining (`callback_full_pipeline.py `_) ------------------------------------------------------------------------------------------------------------------------------------------------ -This pattern is very different from the previous patterns. Here we're taking -advantage of the :meth:`.ResponseFuture.add_callback()` function to start -another request as soon as one finishes. Furthermore, we're starting 120 -of these callback chains, so we've always got about 120 operations in -flight at any time: - -.. code-block:: python - - from itertools import count - from threading import Event - - sentinel = object() - num_queries = 100000 - num_started = count() - num_finished = count() - finished_event = Event() - - def insert_next(previous_result=sentinel): - if previous_result is not sentinel: - if isinstance(previous_result, BaseException): - log.error("Error on insert: %r", previous_result) - if num_finished.next() >= num_queries: - finished_event.set() - - if num_started.next() <= num_queries: - future = session.execute_async(query) - # NOTE: this callback also handles errors - future.add_callbacks(insert_next, insert_next) - - for i in range(min(120, num_queries)): - insert_next() - - finished_event.wait() - -This is a more complex pattern, but the throughput is excellent: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 7647.30/sec - -Part of the reason why performance is so good is that everything is running on -single thread: the internal event loop thread that powers the driver. The -downside to this is that adding more threads doesn't improve anything: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2 - Average throughput: 7704.58/sec - - -What happens if we have more than 120 callback chains running? - -With 250 chains: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 7794.22/sec - -Things look pretty good with 250 chains. If we try 500 chains, we start to max out -all of the connections in the connection pools. The problem is that the current -version of the driver isn't very good at throttling these callback chains, so -a lot of time gets spent waiting for new connections and performance drops -dramatically: - -.. code-block:: bash - - ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1 - Average throughput: 679.61/sec - -When :attr:`.Cluster.protocol_version` is set to 1 or 2, you should limit the -number of callback chains you run to roughly 100 per node in the cluster. -When :attr:`~.Cluster.protocol_version` is 3 or higher, you can safely experiment -with higher numbers of callback chains. - -For many use cases, you don't need to implement this pattern yourself. You can -simply use :meth:`cassandra.concurrent.execute_concurrent` and -:meth:`cassandra.concurrent.execute_concurrent_with_args`, which implement -this pattern for you with a synchronous API. +Due to the GIL and limited concurrency, the driver can become CPU-bound pretty quickly. The sections below +discuss further runtime and design considerations for mitigating this limitation. PyPy ---- -Almost all of these patterns become CPU-bound pretty quickly with CPython, the -normal implementation of python. `PyPy `_ is an alternative -implementation of Python (written in Python) which uses a JIT compiler to -reduce CPU consumption. This leads to a huge improvement in the driver -performance: +`PyPy `_ is an alternative Python runtime which uses a JIT compiler to +reduce CPU consumption. This leads to a huge improvement in the driver performance, +more than doubling throughput for many workloads. + +Cython Extensions +----------------- +`Cython `_ is an optimizing compiler and language that can be used to compile the core files and +optional extensions for the driver. Cython is not a strict dependency, but the extensions will be built by default +if cython is present in the python path. To include Cython as a requirement, invoke with the extra name ``cython``: .. code-block:: bash - ~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1 - Average throughput: 18782.00/sec - -Eventually the driver may add C extensions to reduce CPU consumption, which -would probably narrow the gap between the performance of CPython and PyPy. + $ pip install cassandra-driver[cython] multiprocessing --------------- -All of the patterns here may be used over multiple processes using the +All of the patterns discussed above may be used over multiple processes using the `multiprocessing `_ -module. Multiple processes will scale significantly better than multiple -threads will, so if high throughput is your goal, consider this option. +module. Multiple processes will scale better than multiple threads, so if high throughput is your goal, +consider this option. -Just be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`, +Be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`, **or** :class:`~.ResponseFuture` **objects across multiple processes**. These objects should all be created after forking the process, not before. + +For further discussion and simple examples using the driver with ``multiprocessing``, +see `this blog post `_.