Upate performance notes doc: +cython, -numbers

remove outdated, specific numbers Having numbers for contrived workloads sometimes confuses people or leads to impressions of false claims. Unified benchmarks at DataStax will replace these.
2015-07-24 16:26:46 -05:00
parent 4a79c74831
commit f2e9bb36b9
1 changed files with 28 additions and 271 deletions
--- a/docs/performance.rst
+++ b/docs/performance.rst
@@ -2,290 +2,47 @@ Performance Notes
 =================
 The Python driver for Cassandra offers several methods for executing queries.
 You can synchronously block for queries to complete using
-:meth:`.Session.execute()`, you can use a future-like interface through
-:meth:`.Session.execute_async()`, or you can attach a callback to the future
-with :meth:`.ResponseFuture.add_callback()`.  Each of these methods has
-different performance characteristics and behaves differently when
-multiple threads are used.
+:meth:`.Session.execute()`, you can obtain asynchronous request futures through
+:meth:`.Session.execute_async()`, and you can attach a callback to the future
+with :meth:`.ResponseFuture.add_callback()`.

-Benchmark Notes
---------------
-All benchmarks were executed using the
-`benchmark scripts <https://github.com/datastax/python-driver/tree/master/benchmarks>`_
-in the driver repository.  They were executed on a laptop with 16 GiB of RAM, an SSD,
-and a 2 GHz, four core CPU with hyper-threading.  The Cassandra cluster was a three
-node `ccm <https://github.com/pcmanus/ccm>`_ cluster running on the same laptop
-with version 1.2.13 of Cassandra. I suggest testing these benchmarks against your
-own cluster when tuning the driver for optimal throughput or latency.
+Examples of multiple request patterns can be found in the benchmark scripts included in the driver project.

-The 1.0.0 version of the driver was used with all default settings.  For these
-benchmarks, the driver was configured to use the ``libev`` reactor.  You can also run
-the benchmarks using the ``asyncore`` event loop (:class:`~.AsyncoreConnection`) 
-by using the ``--asyncore-only`` command line option. 
+The choice of execution pattern will depend on the application context. For applications dealing with multiple
+requests in a given context, the recommended pattern is to use concurrent asynchronous
+requests with callbacks. For many use cases, you don't need to implement this pattern yourself.
+:meth:`cassandra.concurrent.execute_concurrent` and :meth:`cassandra.concurrent.execute_concurrent_with_args`
+provide this pattern with a synchronous API and tunable concurrency.

-Each benchmark completes 100,000 small inserts. The replication factor for the
-keyspace was three, so all nodes were replicas for the inserted rows.
-
-The benchmarks require the Python driver C extensions as well as a few additional 
-Python packages. Follow these steps to install the prerequisites:
-
-1. Install packages to support Python driver C extensions:
-
-   * Debian/Ubuntu: ``sudo apt-get install gcc python-dev libev4 libev-dev``
-   * RHEL/CentOS/Fedora: ``sudo yum install gcc python-dev libev4 libev-dev``
-
-2. Install Python packages: ``pip install scales twisted blist``
-3. Re-install the Cassandra driver: ``pip install --upgrade cassandra-driver``
-
-Synchronous Execution (`sync.py <https://github.com/datastax/python-driver/blob/master/benchmarks/sync.py>`_)
-------------------------------------------------------------------------------------------------------------
-Although this is the simplest way to make queries, it has low throughput
-in single threaded environments.  This is basically what the benchmark
-is doing:
-
-.. code-block:: python
-
-    from cassandra.cluster import Cluster
-
-    cluster = Cluster([127.0.0.1, 127.0.0.2, 127.0.0.3])
-    session = cluster.connect()
-
-    for i in range(100000):
-        session.execute("INSERT INTO mykeyspace.mytable (key, b, c) VALUES (a, 'b', 'c')")
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 434.08/sec
-
-
-This technique does scale reasonably well as we add more threads:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
-    Average throughput: 830.49/sec
-    ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
-    Average throughput: 1078.27/sec
-    ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
-    Average throughput: 1275.20/sec
-    ~/python-driver $ python benchmarks/sync.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=16
-    Average throughput: 1345.56/sec
-
-
-In my environment, throughput is maximized at about 20 threads.
-
-
-Batched Futures (`future_batches.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_batches.py>`_)
---------------------------------------------------------------------------------------------------------------------------
-This is a simple way to work with futures for higher throughput.  Essentially,
-we start 120 queries asynchronously at the same time and then wait for them
-all to complete. We then repeat this process until all 100,000 operations
-have completed:
-
-.. code-block:: python
-
-    futures = Queue.Queue(maxsize=121)
-    for i in range(100000):
-        if i % 120 == 0:
-            # clear the existing queue
-            while True:
-                try:
-                    futures.get_nowait().result()
-                except Queue.Empty:
-                    break
-
-        future = session.execute_async(query)
-        futures.put_nowait(future)
-
-As expected, this improves throughput in a single-threaded environment:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 3477.56/sec
-
-However, adding more threads may actually harm throughput:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
-    Average throughput: 2360.52/sec
-    ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
-    Average throughput: 2293.21/sec
-    ~/python-driver $ python benchmarks/future_batches.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
-    Average throughput: 2244.85/sec
-
-
-Queued Futures (`future_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_pipeline.py>`_)
--------------------------------------------------------------------------------------------------------------------------------------
-This pattern is similar to batched futures.  The main difference is that
-every time we put a future on the queue, we pull the oldest future out
-and wait for it to complete:
-
-.. code-block:: python
-
-        futures = Queue.Queue(maxsize=121)
-        for i in range(100000):
-            if i >= 120:
-                old_future = futures.get_nowait()
-                old_future.result()
-
-            future = session.execute_async(query)
-            futures.put_nowait(future)
-
-This gets slightly better throughput than the Batched Futures pattern:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 3635.76/sec
-
-But this has the same throughput issues when multiple threads are used:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
-    Average throughput: 2213.62/sec
-    ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
-    Average throughput: 2707.62/sec
-    ~/python-driver $ python benchmarks/future_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
-    Average throughput: 2462.42/sec
-
-Unthrottled Futures (`future_full_throttle.py <https://github.com/datastax/python-driver/blob/master/benchmarks/future_full_throttle.py>`_)
-------------------------------------------------------------------------------------------------------------------------------------------
-What happens if we don't throttle our async requests at all?
-
-.. code-block:: python
-
-    futures = []
-    for i in range(100000):
-        future = session.execute_async(query)
-        futures.append(future)
-
-    for future in futures:
-        future.result()
-
-Throughput is about the same as the previous pattern, but a lot of memory will
-be consumed by the list of Futures:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 3474.11/sec
-    ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
-    Average throughput: 2389.61/sec
-    ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=4
-    Average throughput: 2371.75/sec
-    ~/python-driver $ python benchmarks/future_full_throttle.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=8
-    Average throughput: 2165.29/sec
-
-Callback Chaining (`callback_full_pipeline.py <https://github.com/datastax/python-driver/blob/master/benchmarks/callback_full_pipeline.py>`_)
-----------------------------------------------------------------------------------------------------------------------------------------------
-This pattern is very different from the previous patterns.  Here we're taking
-advantage of the :meth:`.ResponseFuture.add_callback()` function to start
-another request as soon as one finishes.  Furthermore, we're starting 120
-of these callback chains, so we've always got about 120 operations in
-flight at any time:
-
-.. code-block:: python
-
-    from itertools import count
-    from threading import Event
-
-    sentinel = object()
-    num_queries = 100000
-    num_started = count()
-    num_finished = count()
-    finished_event = Event()
-
-    def insert_next(previous_result=sentinel):
-        if previous_result is not sentinel:
-            if isinstance(previous_result, BaseException):
-                log.error("Error on insert: %r", previous_result)
-            if num_finished.next() >= num_queries:
-                finished_event.set()
-
-        if num_started.next() <= num_queries:
-            future = session.execute_async(query)
-            # NOTE: this callback also handles errors
-            future.add_callbacks(insert_next, insert_next)
-
-    for i in range(min(120, num_queries)):
-        insert_next()
-
-    finished_event.wait()
-
-This is a more complex pattern, but the throughput is excellent:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 7647.30/sec
-
-Part of the reason why performance is so good is that everything is running on
-single thread: the internal event loop thread that powers the driver.  The
-downside to this is that adding more threads doesn't improve anything:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=2
-    Average throughput: 7704.58/sec
-
-
-What happens if we have more than 120 callback chains running?
-
-With 250 chains:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 7794.22/sec
-
-Things look pretty good with 250 chains.  If we try 500 chains, we start to max out
-all of the connections in the connection pools.  The problem is that the current
-version of the driver isn't very good at throttling these callback chains, so
-a lot of time gets spent waiting for new connections and performance drops
-dramatically:
-
-.. code-block:: bash
-
-    ~/python-driver $ python benchmarks/callback_full_pipeline.py -n 100000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --libev-only --threads=1
-    Average throughput: 679.61/sec
-
-When :attr:`.Cluster.protocol_version` is set to 1 or 2, you should limit the
-number of callback chains you run to roughly 100 per node in the cluster.
-When :attr:`~.Cluster.protocol_version` is 3 or higher, you can safely experiment
-with higher numbers of callback chains.
-
-For many use cases, you don't need to implement this pattern yourself.  You can
-simply use :meth:`cassandra.concurrent.execute_concurrent` and
-:meth:`cassandra.concurrent.execute_concurrent_with_args`, which implement
-this pattern for you with a synchronous API.
+Due to the GIL and limited concurrency, the driver can become CPU-bound pretty quickly. The sections below
+discuss further runtime and design considerations for mitigating this limitation.

 PyPy
 ----
-Almost all of these patterns become CPU-bound pretty quickly with CPython, the
-normal implementation of python. `PyPy <http://pypy.org>`_ is an alternative
-implementation of Python (written in Python) which uses a JIT compiler to
-reduce CPU consumption.  This leads to a huge improvement in the driver
-performance:
+`PyPy <http://pypy.org>`_ is an alternative Python runtime which uses a JIT compiler to
+reduce CPU consumption. This leads to a huge improvement in the driver performance,
+more than doubling throughput for many workloads.
+
+Cython Extensions
+-----------------
+`Cython <http://cython.org/>`_ is an optimizing compiler and language that can be used to compile the core files and
+optional extensions for the driver. Cython is not a strict dependency, but the extensions will be built by default
+if cython is present in the python path. To include Cython as a requirement, invoke with the extra name ``cython``:

 .. code-block:: bash

-    ~/python-driver $ pypy benchmarks/callback_full_pipeline.py -n 500000 --hosts=127.0.0.1,127.0.0.2,127.0.0.3 --asyncore-only --threads=1
-    Average throughput: 18782.00/sec
-
-Eventually the driver may add C extensions to reduce CPU consumption, which
-would probably narrow the gap between the performance of CPython and PyPy.
+    $ pip install cassandra-driver[cython]

 multiprocessing
 ---------------
-All of the patterns here may be used over multiple processes using the
+All of the patterns discussed above may be used over multiple processes using the
 `multiprocessing <http://docs.python.org/2/library/multiprocessing.html>`_
-module.  Multiple processes will scale significantly better than multiple
-threads will, so if high throughput is your goal, consider this option.
+module.  Multiple processes will scale better than multiple threads, so if high throughput is your goal,
+consider this option.

-Just be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`,
+Be sure to **never share any** :class:`~.Cluster`, :class:`~.Session`,
 **or** :class:`~.ResponseFuture` **objects across multiple processes**. These
 objects should all be created after forking the process, not before.
+
+For further discussion and simple examples using the driver with ``multiprocessing``,
+see `this blog post <http://www.datastax.com/dev/blog/datastax-python-driver-multiprocessing-example-for-improved-bulk-data-throughput>`_.