431 lines
19 KiB
ReStructuredText
431 lines
19 KiB
ReStructuredText
===========
|
|
Usage Guide
|
|
===========
|
|
|
|
At its core, dogpile.core provides a locking interface around a "value creation" function.
|
|
|
|
The interface supports several levels of usage, starting from
|
|
one that is very rudimentary, then providing more intricate
|
|
usage patterns to deal with certain scenarios. The documentation here will attempt to
|
|
provide examples that use successively more and more of these features, as
|
|
we approach how a fully featured caching system might be constructed around
|
|
dogpile.core.
|
|
|
|
Do I Need to Learn the dogpile.core API Directly?
|
|
=================================================
|
|
|
|
It's anticipated that most users of dogpile.core will be using it indirectly via the
|
|
`dogpile.cache <http://bitbucket.org/zzzeek/dogpile.cache>`_ caching
|
|
front-end. If you fall into this category, then the short answer is no.
|
|
|
|
dogpile.core provides core internals to the
|
|
`dogpile.cache <http://bitbucket.org/zzzeek/dogpile.cache>`_
|
|
package, which provides a simple-to-use caching API, rudimental
|
|
backends for Memcached and others, and easy hooks to add new backends.
|
|
Users of dogpile.cache
|
|
don't need to know or access dogpile.core's APIs directly, though a rough understanding
|
|
the general idea is always helpful.
|
|
|
|
Using the core dogpile.core APIs described here directly implies you're building your own
|
|
resource-usage system outside, or in addition to, the one
|
|
`dogpile.cache <http://bitbucket.org/zzzeek/dogpile.cache>`_ provides.
|
|
|
|
Rudimentary Usage
|
|
==================
|
|
|
|
A simple example::
|
|
|
|
from dogpile.core import Dogpile
|
|
|
|
# store a reference to a "resource", some
|
|
# object that is expensive to create.
|
|
the_resource = [None]
|
|
|
|
def some_creation_function():
|
|
# create the resource here
|
|
the_resource[0] = create_some_resource()
|
|
|
|
def use_the_resource():
|
|
# some function that uses
|
|
# the resource. Won't reach
|
|
# here until some_creation_function()
|
|
# has completed at least once.
|
|
the_resource[0].do_something()
|
|
|
|
# create Dogpile with 3600 second
|
|
# expiry time
|
|
dogpile = Dogpile(3600)
|
|
|
|
with dogpile.acquire(some_creation_function):
|
|
use_the_resource()
|
|
|
|
Above, ``some_creation_function()`` will be called
|
|
when :meth:`.Dogpile.acquire` is first called. The
|
|
remainder of the ``with`` block then proceeds. Concurrent threads which
|
|
call :meth:`.Dogpile.acquire` during this initial period
|
|
will be blocked until ``some_creation_function()`` completes.
|
|
|
|
Once the creation function has completed successfully the first time,
|
|
new calls to :meth:`.Dogpile.acquire` will call ``some_creation_function()``
|
|
each time the "expiretime" has been reached, allowing only a single
|
|
thread to call the function. Concurrent threads
|
|
which call :meth:`.Dogpile.acquire` during this period will
|
|
fall through, and not be blocked. It is expected that
|
|
the "stale" version of the resource remain available at this
|
|
time while the new one is generated.
|
|
|
|
By default, :class:`.Dogpile` uses Python's ``threading.Lock()``
|
|
to synchronize among threads within a process. This can
|
|
be altered to support any kind of locking as we'll see in a
|
|
later section.
|
|
|
|
Using a Value Function with a Cache Backend
|
|
=============================================
|
|
|
|
The dogpile lock includes a more intricate mode of usage to optimize the
|
|
usage of a cache like Memcached. The difficulties :class:`.Dogpile` addresses
|
|
in this mode are:
|
|
|
|
* Values can disappear from the cache at any time, before our expiration
|
|
time is reached. :class:`.Dogpile` needs to be made aware of this and possibly
|
|
call the creation function ahead of schedule.
|
|
* There's no function in a Memcached-like system to "check" for a key without
|
|
actually retrieving it. If we need to "check" for a key each time,
|
|
we'd like to use that value instead of calling it twice.
|
|
* If we did end up generating the value on this get, we should return
|
|
that value instead of doing a cache round-trip.
|
|
|
|
To use this mode, the steps are as follows:
|
|
|
|
* Create the :class:`.Dogpile` lock with ``init=True``, to skip the initial
|
|
"force" of the creation function. This is assuming you'd like to
|
|
rely upon the "check the value" function for the initial generation.
|
|
Leave it at False if you'd like the application to regenerate the
|
|
value unconditionally when the :class:`.Dogpile` lock is first created
|
|
(i.e. typically application startup).
|
|
* The "creation" function should return the value it creates.
|
|
* An additional "getter" function is passed to ``acquire()`` which
|
|
should return the value to be passed to the context block. If
|
|
the value isn't available, raise ``NeedRegenerationException``.
|
|
|
|
Example::
|
|
|
|
from dogpile.core import Dogpile, NeedRegenerationException
|
|
|
|
def get_value_from_cache():
|
|
value = my_cache.get("some key")
|
|
if value is None:
|
|
raise NeedRegenerationException()
|
|
return value
|
|
|
|
def create_and_cache_value():
|
|
value = my_expensive_resource.create_value()
|
|
my_cache.put("some key", value)
|
|
return value
|
|
|
|
dogpile = Dogpile(3600, init=True)
|
|
|
|
with dogpile.acquire(create_and_cache_value, get_value_from_cache) as value:
|
|
return value
|
|
|
|
Note that ``get_value_from_cache()`` should not raise :class:`.NeedRegenerationException`
|
|
a second time directly after ``create_and_cache_value()`` has been called.
|
|
|
|
.. _caching_decorator:
|
|
|
|
Using dogpile.core for Caching
|
|
===============================
|
|
|
|
dogpile.core is part of an effort to "break up" the Beaker
|
|
package into smaller, simpler components (which also work better). Here, we
|
|
illustrate how to approximate Beaker's "cache decoration"
|
|
function, to decorate any function and store the value in
|
|
Memcached. We create a Python decorator function called ``cached()`` which
|
|
will provide caching for the output of a single function. It's given
|
|
the "key" which we'd like to use in Memcached, and internally it makes
|
|
usage of its own :class:`.Dogpile` object that is dedicated to managing
|
|
this one function/key::
|
|
|
|
import pylibmc
|
|
mc_pool = pylibmc.ThreadMappedPool(pylibmc.Client("localhost"))
|
|
|
|
from dogpile.core import Dogpile, NeedRegenerationException
|
|
|
|
def cached(key, expiration_time):
|
|
"""A decorator that will cache the return value of a function
|
|
in memcached given a key."""
|
|
|
|
def get_value():
|
|
with mc_pool.reserve() as mc:
|
|
value = mc.get(key)
|
|
if value is None:
|
|
raise NeedRegenerationException()
|
|
return value
|
|
|
|
dogpile = Dogpile(expiration_time, init=True)
|
|
|
|
def decorate(fn):
|
|
def gen_cached():
|
|
value = fn()
|
|
with mc_pool.reserve() as mc:
|
|
mc.put(key, value)
|
|
return value
|
|
|
|
def invoke():
|
|
with dogpile.acquire(gen_cached, get_value) as value:
|
|
return value
|
|
return invoke
|
|
|
|
return decorate
|
|
|
|
Above we can decorate any function as::
|
|
|
|
@cached("some key", 3600)
|
|
def generate_my_expensive_value():
|
|
return slow_database.lookup("stuff")
|
|
|
|
The :class:`.Dogpile` lock will ensure that only one thread at a time performs ``slow_database.lookup()``,
|
|
and only every 3600 seconds, unless Memcached has removed the value in which case it will
|
|
be called again as needed.
|
|
|
|
In particular, dogpile.core's system allows us to call the memcached get() function at most
|
|
once per access, instead of Beaker's system which calls it twice, and doesn't make us call
|
|
get() when we just created the value.
|
|
|
|
.. _scaling_on_keys:
|
|
|
|
Scaling dogpile.core against Many Keys
|
|
=======================================
|
|
|
|
The patterns so far have illustrated how to use a single, persistently held
|
|
:class:`.Dogpile` object which maintains a thread-based lock for the lifespan
|
|
of some particular value. The :class:`.Dogpile` also is responsible for
|
|
maintaining the last known "creation time" of the value; this is available
|
|
from a given :class:`.Dogpile` object from the :attr:`.Dogpile.createdtime`
|
|
attribute.
|
|
|
|
For an application that may deal with an arbitrary
|
|
number of cache keys retrieved from a remote service, this approach must be
|
|
revised so that we don't need to store a :class:`.Dogpile` object for every
|
|
possible key in our application's memory.
|
|
|
|
The two challenges here are:
|
|
|
|
* We need to create new :class:`.Dogpile` objects as needed, ideally
|
|
sharing the object for a given key with all concurrent threads,
|
|
but then not hold onto it afterwards.
|
|
* Since we aren't holding the :class:`.Dogpile` persistently, we
|
|
need to store the last known "creation time" of the value somewhere
|
|
else, i.e. in the cache itself, and ensure :class:`.Dogpile` uses
|
|
it.
|
|
|
|
The approach is another one derived from Beaker, where we will use a *registry*
|
|
that can provide a unique :class:`.Dogpile` object given a particular key,
|
|
ensuring that all concurrent threads use the same object, but then releasing
|
|
the object to the Python garbage collector when this usage is complete.
|
|
The :class:`.NameRegistry` object provides this functionality, again
|
|
constructed around the notion of a creation function that is only invoked
|
|
as needed. We also will instruct the :meth:`.Dogpile.acquire` method
|
|
to use a "creation time" value that we retrieve from the cache, via
|
|
the ``value_and_created_fn`` parameter, which supercedes the
|
|
``value_fn`` we used earlier. ``value_and_created_fn`` expects a function that will return a tuple
|
|
of ``(value, created_at)``, where it's assumed both have been retrieved from
|
|
the cache backend::
|
|
|
|
import pylibmc
|
|
import time
|
|
from dogpile.core import Dogpile, NeedRegenerationException, NameRegistry
|
|
|
|
mc_pool = pylibmc.ThreadMappedPool(pylibmc.Client("localhost"))
|
|
|
|
def create_dogpile(key, expiration_time):
|
|
return Dogpile(expiration_time)
|
|
|
|
dogpile_registry = NameRegistry(create_dogpile)
|
|
|
|
def get_or_create(key, expiration_time, creation_function):
|
|
def get_value():
|
|
with mc_pool.reserve() as mc:
|
|
value_plus_time = mc.get(key)
|
|
if value_plus_time is None:
|
|
raise NeedRegenerationException()
|
|
# return a tuple
|
|
# (value, createdtime)
|
|
return value_plus_time
|
|
|
|
def gen_cached():
|
|
value = creation_function()
|
|
with mc_pool.reserve() as mc:
|
|
# create a tuple
|
|
# (value, createdtime)
|
|
value_plus_time = (value, time.time())
|
|
mc.put(key, value_plus_time)
|
|
return value_plus_time
|
|
|
|
dogpile = dogpile_registry.get(key, expiration_time)
|
|
|
|
with dogpile.acquire(gen_cached, value_and_created_fn=get_value) as value:
|
|
return value
|
|
|
|
|
|
Stepping through the above code:
|
|
|
|
* After the imports, we set up the memcached backend using the ``pylibmc`` library's
|
|
recommended pattern for thread-safe access.
|
|
* We create a Python function that will, given a cache key and an expiration time,
|
|
produce a :class:`.Dogpile` object which will produce the dogpile mutex on an
|
|
as-needed basis. The function here doesn't actually need the key, even though
|
|
the :class:`.NameRegistry` will be passing it in. Later, we'll see the scenario
|
|
for which we'll need this value.
|
|
* We construct a :class:`.NameRegistry`, using our dogpile creator function, that
|
|
will generate for us new :class:`.Dogpile` locks for individual keys as needed.
|
|
* We define the ``get_or_create()`` function. This function will accept the cache
|
|
key, an expiration time value, and a function that is used to create a new value
|
|
if one does not exist or the current value is expired.
|
|
* The ``get_or_create()`` function defines two callables, ``get_value()`` and
|
|
``gen_cached()``. These two functions are exactly analogous to the the
|
|
functions of the same name in :ref:`caching_decorator` - ``get_value()``
|
|
retrieves the value from the cache, raising :class:`.NeedRegenerationException`
|
|
if not present; ``gen_cached()`` calls the creation function to generate a new
|
|
value, stores it in the cache, and returns it. The only difference here is that
|
|
instead of storing and retrieving the value alone from the cache, the value is
|
|
stored along with its creation time; when we make a new value, we set this
|
|
to ``time.time()``. While the value and creation time pair are stored here
|
|
as a tuple, it doesn't actually matter how the two are persisted;
|
|
only that the tuple value is returned from both functions.
|
|
* We acquire a new or existing :class:`.Dogpile` object from the registry using
|
|
:meth:`.NameRegistry.get`. We pass the identifying key as well as the expiration
|
|
time. A new :class:`.Dogpile` is created for the given key if one does not
|
|
exist. If a :class:`.Dogpile` lock already exists in memory for the given key,
|
|
we get that one back.
|
|
* We then call :meth:`.Dogpile.acquire` as we did in the previous cache examples,
|
|
except we use the ``value_and_created_fn`` keyword for our ``get_value()``
|
|
function. :class:`.Dogpile` uses the "created time" value we pull from our
|
|
cache to determine when the value was last created.
|
|
|
|
An example usage of the completed function::
|
|
|
|
import urllib2
|
|
|
|
def get_some_value(key):
|
|
"""retrieve a datafile from a slow site based on the given key."""
|
|
def get_data():
|
|
return urllib2.urlopen(
|
|
"http://someslowsite.com/some_important_datafile_%s.json" % key
|
|
).read()
|
|
return get_or_create(key, 3600, get_data)
|
|
|
|
my_data = get_some_value("somekey")
|
|
|
|
Using a File or Distributed Lock with Dogpile
|
|
==============================================
|
|
|
|
|
|
The final twist on the caching pattern is to fix the issue of the Dogpile mutex
|
|
itself being local to the current process. When a handful of threads all go
|
|
to access some key in our cache, they will access the same :class:`.Dogpile` object
|
|
which internally can synchronize their activity using a Python ``threading.Lock``.
|
|
But in this example we're talking to a Memcached cache. What if we have many
|
|
servers which all access this cache? We'd like all of these servers to coordinate
|
|
together so that we don't just prevent the dogpile problem within a single process,
|
|
we prevent it across all servers.
|
|
|
|
To accomplish this, we need an object that can coordinate processes. In this example
|
|
we'll use a file-based lock as provided by the `lockfile <http://pypi.python.org/pypi/lockfile>`_
|
|
package, which uses a unix-symlink concept to provide a filesystem-level lock (which also
|
|
has been made threadsafe). Another strategy may base itself directly off the Unix ``os.flock()``
|
|
call, and still another approach is to lock within Memcached itself, using a recipe
|
|
such as that described at `Using Memcached as a Distributed Locking Service <http://www.regexprn.com/2010/05/using-memcached-as-distributed-locking.html>`_.
|
|
The type of lock chosen here is based on a tradeoff between global availability
|
|
and reliable performance. The file-based lock will perform more reliably than the
|
|
memcached lock, but may be difficult to make accessible to multiple servers (with NFS
|
|
being the most likely option, which would eliminate the possibility of the ``os.flock()``
|
|
call). The memcached lock on the other hand will provide the perfect scope, being available
|
|
from the same memcached server that the cached value itself comes from; however the lock may
|
|
vanish in some cases, which means we still could get a cache-regeneration pileup in that case.
|
|
|
|
What all of these locking schemes have in common is that unlike the Python ``threading.Lock``
|
|
object, they all need access to an actual key which acts as the symbol that all processes
|
|
will coordinate upon. This is where the ``key`` argument to our ``create_dogpile()``
|
|
function introduced in :ref:`scaling_on_keys` comes in. The example can remain
|
|
the same, except for the changes below to just that function::
|
|
|
|
import lockfile
|
|
import os
|
|
from hashlib import sha1
|
|
|
|
# ... other imports and setup from the previous example
|
|
|
|
def create_dogpile(key, expiration_time):
|
|
lock_path = os.path.join("/tmp", "%s.lock" % sha1(key).hexdigest())
|
|
return Dogpile(
|
|
expiration_time,
|
|
lock=lockfile.FileLock(path)
|
|
)
|
|
|
|
# ... everything else from the previous example
|
|
|
|
Where above,the only change is the ``lock`` argument passed to the constructor of
|
|
:class:`.Dogpile`. For a given key "some_key", we generate a hex digest of it
|
|
first as a quick way to remove any filesystem-unfriendly characters, we then use
|
|
``lockfile.FileLock()`` to create a lock against the file
|
|
``/tmp/53def077a4264bd3183d4eb21b1f56f883e1b572.lock``. Any number of :class:`.Dogpile`
|
|
objects in various processes will now coordinate with each other, using this common
|
|
filename as the "baton" against which creation of a new value proceeds.
|
|
|
|
Locking the "write" phase against the "readers"
|
|
================================================
|
|
|
|
A less prominent feature of Dogpile ported from Beaker is the
|
|
ability to provide a mutex against the actual resource being read
|
|
and created, so that the creation function can perform
|
|
certain tasks only after all reader threads have finished.
|
|
The example of this is when the creation function has prepared a new
|
|
datafile to replace the old one, and would like to switch in the
|
|
new file only when other threads have finished using it.
|
|
|
|
To enable this feature, use :class:`.SyncReaderDogpile`.
|
|
:meth:`.SyncReaderDogpile.acquire_write_lock` then provides a safe-write lock
|
|
for the critical section where readers should be blocked::
|
|
|
|
|
|
from dogpile.core import SyncReaderDogpile
|
|
|
|
dogpile = SyncReaderDogpile(3600)
|
|
|
|
def some_creation_function(dogpile):
|
|
create_expensive_datafile()
|
|
with dogpile.acquire_write_lock():
|
|
replace_old_datafile_with_new()
|
|
|
|
# usage:
|
|
with dogpile.acquire(some_creation_function):
|
|
read_datafile()
|
|
|
|
With the above pattern, :class:`.SyncReaderDogpile` will
|
|
allow concurrent readers to read from the current version
|
|
of the datafile as
|
|
the ``create_expensive_datafile()`` function proceeds with its
|
|
job of generating the information for a new version.
|
|
When the data is ready to be written, the
|
|
:meth:`.SyncReaderDogpile.acquire_write_lock` call will
|
|
block until all current readers of the datafile have completed
|
|
(that is, they've finished their own :meth:`.Dogpile.acquire`
|
|
blocks). The ``some_creation_function()`` function
|
|
then proceeds, as new readers are blocked until
|
|
this function finishes its work of
|
|
rewriting the datafile.
|
|
|
|
Note that the :class:`.SyncReaderDogpile` approach is useful
|
|
for when working with a resource that itself does not support concurent
|
|
access while being written, namely flat files, possibly some forms of DBM file.
|
|
It is **not** needed when dealing with a datasource that already
|
|
provides a high level of concurrency, such as a relational database,
|
|
Memcached, or NoSQL store. Currently, the :class:`.SyncReaderDogpile` object
|
|
only synchronizes within the current process among multiple threads;
|
|
it won't at this time protect from concurrent access by multiple
|
|
processes. Beaker did support this behavior however using lock files,
|
|
and this functionality may be re-added in a future release.
|
|
|
|
|