19 KiB
Usage Guide
At its core, dogpile.core provides a locking interface around a "value creation" function.
The interface supports several levels of usage, starting from one that is very rudimentary, then providing more intricate usage patterns to deal with certain scenarios. The documentation here will attempt to provide examples that use successively more and more of these features, as we approach how a fully featured caching system might be constructed around dogpile.core.
Do I Need to Learn the dogpile.core API Directly?
It's anticipated that most users of dogpile.core will be using it indirectly via the dogpile.cache caching front-end. If you fall into this category, then the short answer is no.
dogpile.core provides core internals to the dogpile.cache package, which provides a simple-to-use caching API, rudimental backends for Memcached and others, and easy hooks to add new backends. Users of dogpile.cache don't need to know or access dogpile.core's APIs directly, though a rough understanding the general idea is always helpful.
Using the core dogpile.core APIs described here directly implies you're building your own resource-usage system outside, or in addition to, the one dogpile.cache provides.
Rudimentary Usage
A simple example:
from dogpile.core import Dogpile
# store a reference to a "resource", some
# object that is expensive to create.
the_resource = [None]
def some_creation_function():
# create the resource here
the_resource[0] = create_some_resource()
def use_the_resource():
# some function that uses
# the resource. Won't reach
# here until some_creation_function()
# has completed at least once.
the_resource[0].do_something()
# create Dogpile with 3600 second
# expiry time
dogpile = Dogpile(3600)
with dogpile.acquire(some_creation_function):
use_the_resource()
Above, some_creation_function()
will be called when
.Dogpile.acquire
is
first called. The remainder of the with
block then
proceeds. Concurrent threads which call .Dogpile.acquire
during this initial period will be
blocked until some_creation_function()
completes.
Once the creation function has completed successfully the first time,
new calls to .Dogpile.acquire
will call
some_creation_function()
each time the "expiretime" has
been reached, allowing only a single thread to call the function.
Concurrent threads which call .Dogpile.acquire
during this period will fall
through, and not be blocked. It is expected that the "stale" version of
the resource remain available at this time while the new one is
generated.
By default, .Dogpile
uses Python's threading.Lock()
to synchronize among threads within a process. This can be altered to
support any kind of locking as we'll see in a later section.
Using a Value Function with a Cache Backend
The dogpile lock includes a more intricate mode of usage to optimize
the usage of a cache like Memcached. The difficulties .Dogpile
addresses in this
mode are:
- Values can disappear from the cache at any time, before our
expiration time is reached.
.Dogpile
needs to be made aware of this and possibly call the creation function ahead of schedule. - There's no function in a Memcached-like system to "check" for a key without actually retrieving it. If we need to "check" for a key each time, we'd like to use that value instead of calling it twice.
- If we did end up generating the value on this get, we should return that value instead of doing a cache round-trip.
To use this mode, the steps are as follows:
- Create the
.Dogpile
lock withinit=True
, to skip the initial "force" of the creation function. This is assuming you'd like to rely upon the "check the value" function for the initial generation. Leave it at False if you'd like the application to regenerate the value unconditionally when the.Dogpile
lock is first created (i.e. typically application startup). - The "creation" function should return the value it creates.
- An additional "getter" function is passed to
acquire()
which should return the value to be passed to the context block. If the value isn't available, raiseNeedRegenerationException
.
Example:
from dogpile.core import Dogpile, NeedRegenerationException
def get_value_from_cache():
value = my_cache.get("some key")
if value is None:
raise NeedRegenerationException()
return value
def create_and_cache_value():
value = my_expensive_resource.create_value()
my_cache.put("some key", value)
return value
dogpile = Dogpile(3600, init=True)
with dogpile.acquire(create_and_cache_value, get_value_from_cache) as value:
return value
Note that get_value_from_cache()
should not raise .NeedRegenerationException
a second time directly after create_and_cache_value()
has
been called.
Using dogpile.core for Caching
dogpile.core is part of an effort to "break up" the Beaker package
into smaller, simpler components (which also work better). Here, we
illustrate how to approximate Beaker's "cache decoration" function, to
decorate any function and store the value in Memcached. We create a
Python decorator function called cached()
which will
provide caching for the output of a single function. It's given the
"key" which we'd like to use in Memcached, and internally it makes usage
of its own .Dogpile
object that is dedicated to managing this one function/key:
import pylibmc
mc_pool = pylibmc.ThreadMappedPool(pylibmc.Client("localhost"))
from dogpile.core import Dogpile, NeedRegenerationException
def cached(key, expiration_time):
"""A decorator that will cache the return value of a function
in memcached given a key."""
def get_value():
with mc_pool.reserve() as mc:
value = mc.get(key)
if value is None:
raise NeedRegenerationException()
return value
dogpile = Dogpile(expiration_time, init=True)
def decorate(fn):
def gen_cached():
value = fn()
with mc_pool.reserve() as mc:
mc.put(key, value)
return value
def invoke():
with dogpile.acquire(gen_cached, get_value) as value:
return value
return invoke
return decorate
Above we can decorate any function as:
@cached("some key", 3600)
def generate_my_expensive_value():
return slow_database.lookup("stuff")
The .Dogpile
lock
will ensure that only one thread at a time performs
slow_database.lookup()
, and only every 3600 seconds, unless
Memcached has removed the value in which case it will be called again as
needed.
In particular, dogpile.core's system allows us to call the memcached get() function at most once per access, instead of Beaker's system which calls it twice, and doesn't make us call get() when we just created the value.
Scaling dogpile.core against Many Keys
The patterns so far have illustrated how to use a single,
persistently held .Dogpile
object which maintains a thread-based lock
for the lifespan of some particular value. The .Dogpile
also is
responsible for maintaining the last known "creation time" of the value;
this is available from a given .Dogpile
object from the .Dogpile.createdtime
attribute.
For an application that may deal with an arbitrary number of cache
keys retrieved from a remote service, this approach must be revised so
that we don't need to store a .Dogpile
object for every possible key in our
application's memory.
The two challenges here are:
- We need to create new
.Dogpile
objects as needed, ideally sharing the object for a given key with all concurrent threads, but then not hold onto it afterwards. - Since we aren't holding the
.Dogpile
persistently, we need to store the last known "creation time" of the value somewhere else, i.e. in the cache itself, and ensure.Dogpile
uses it.
The approach is another one derived from Beaker, where we will use a
registry that can provide a unique .Dogpile
object given a
particular key, ensuring that all concurrent threads use the same
object, but then releasing the object to the Python garbage collector
when this usage is complete. The .NameRegistry
object provides this functionality,
again constructed around the notion of a creation function that is only
invoked as needed. We also will instruct the .Dogpile.acquire
method to
use a "creation time" value that we retrieve from the cache, via the
value_and_created_fn
parameter, which supercedes the
value_fn
we used earlier. value_and_created_fn
expects a function that will return a tuple of
(value, created_at)
, where it's assumed both have been
retrieved from the cache backend:
import pylibmc
import time
from dogpile.core import Dogpile, NeedRegenerationException, NameRegistry
mc_pool = pylibmc.ThreadMappedPool(pylibmc.Client("localhost"))
def create_dogpile(key, expiration_time):
return Dogpile(expiration_time)
dogpile_registry = NameRegistry(create_dogpile)
def get_or_create(key, expiration_time, creation_function):
def get_value():
with mc_pool.reserve() as mc:
value_plus_time = mc.get(key)
if value_plus_time is None:
raise NeedRegenerationException()
# return a tuple
# (value, createdtime)
return value_plus_time
def gen_cached():
value = creation_function()
with mc_pool.reserve() as mc:
# create a tuple
# (value, createdtime)
value_plus_time = (value, time.time())
mc.put(key, value_plus_time)
return value_plus_time
dogpile = dogpile_registry.get(key, expiration_time)
with dogpile.acquire(gen_cached, value_and_created_fn=get_value) as value:
return value
Stepping through the above code:
- After the imports, we set up the memcached backend using the
pylibmc
library's recommended pattern for thread-safe access. - We create a Python function that will, given a cache key and an
expiration time, produce a
.Dogpile
object which will produce the dogpile mutex on an as-needed basis. The function here doesn't actually need the key, even though the.NameRegistry
will be passing it in. Later, we'll see the scenario for which we'll need this value. - We construct a
.NameRegistry
, using our dogpile creator function, that will generate for us new.Dogpile
locks for individual keys as needed. - We define the
get_or_create()
function. This function will accept the cache key, an expiration time value, and a function that is used to create a new value if one does not exist or the current value is expired. - The
get_or_create()
function defines two callables,get_value()
andgen_cached()
. These two functions are exactly analogous to the the functions of the same name incaching_decorator
-get_value()
retrieves the value from the cache, raising.NeedRegenerationException
if not present;gen_cached()
calls the creation function to generate a new value, stores it in the cache, and returns it. The only difference here is that instead of storing and retrieving the value alone from the cache, the value is stored along with its creation time; when we make a new value, we set this totime.time()
. While the value and creation time pair are stored here as a tuple, it doesn't actually matter how the two are persisted; only that the tuple value is returned from both functions. - We acquire a new or existing
.Dogpile
object from the registry using.NameRegistry.get
. We pass the identifying key as well as the expiration time. A new.Dogpile
is created for the given key if one does not exist. If a.Dogpile
lock already exists in memory for the given key, we get that one back. - We then call
.Dogpile.acquire
as we did in the previous cache examples, except we use thevalue_and_created_fn
keyword for ourget_value()
function..Dogpile
uses the "created time" value we pull from our cache to determine when the value was last created.
An example usage of the completed function:
import urllib2
def get_some_value(key):
"""retrieve a datafile from a slow site based on the given key."""
def get_data():
return urllib2.urlopen(
"http://someslowsite.com/some_important_datafile_%s.json" % key
).read()
return get_or_create(key, 3600, get_data)
my_data = get_some_value("somekey")
Using a File or Distributed Lock with Dogpile
The final twist on the caching pattern is to fix the issue of the
Dogpile mutex itself being local to the current process. When a handful
of threads all go to access some key in our cache, they will access the
same .Dogpile
object
which internally can synchronize their activity using a Python
threading.Lock
. But in this example we're talking to a
Memcached cache. What if we have many servers which all access this
cache? We'd like all of these servers to coordinate together so that we
don't just prevent the dogpile problem within a single process, we
prevent it across all servers.
To accomplish this, we need an object that can coordinate processes.
In this example we'll use a file-based lock as provided by the lockfile package, which
uses a unix-symlink concept to provide a filesystem-level lock (which
also has been made threadsafe). Another strategy may base itself
directly off the Unix os.flock()
call, and still another
approach is to lock within Memcached itself, using a recipe such as that
described at Using
Memcached as a Distributed Locking Service. The type of lock chosen
here is based on a tradeoff between global availability and reliable
performance. The file-based lock will perform more reliably than the
memcached lock, but may be difficult to make accessible to multiple
servers (with NFS being the most likely option, which would eliminate
the possibility of the os.flock()
call). The memcached lock
on the other hand will provide the perfect scope, being available from
the same memcached server that the cached value itself comes from;
however the lock may vanish in some cases, which means we still could
get a cache-regeneration pileup in that case.
What all of these locking schemes have in common is that unlike the
Python threading.Lock
object, they all need access to an
actual key which acts as the symbol that all processes will coordinate
upon. This is where the key
argument to our
create_dogpile()
function introduced in scaling_on_keys
comes in. The
example can remain the same, except for the changes below to just that
function:
import lockfile
import os
from hashlib import sha1
# ... other imports and setup from the previous example
def create_dogpile(key, expiration_time):
lock_path = os.path.join("/tmp", "%s.lock" % sha1(key).hexdigest())
return Dogpile(
expiration_time,
lock=lockfile.FileLock(path)
)
# ... everything else from the previous example
Where above,the only change is the lock
argument passed
to the constructor of .Dogpile
. For a given key "some_key", we generate a
hex digest of it first as a quick way to remove any
filesystem-unfriendly characters, we then use
lockfile.FileLock()
to create a lock against the file
/tmp/53def077a4264bd3183d4eb21b1f56f883e1b572.lock
. Any
number of .Dogpile
objects in various processes will now coordinate with each other, using
this common filename as the "baton" against which creation of a new
value proceeds.
Locking the "write" phase against the "readers"
A less prominent feature of Dogpile ported from Beaker is the ability to provide a mutex against the actual resource being read and created, so that the creation function can perform certain tasks only after all reader threads have finished. The example of this is when the creation function has prepared a new datafile to replace the old one, and would like to switch in the new file only when other threads have finished using it.
To enable this feature, use .SyncReaderDogpile
. .SyncReaderDogpile.acquire_write_lock
then provides a
safe-write lock for the critical section where readers should be
blocked:
from dogpile.core import SyncReaderDogpile
dogpile = SyncReaderDogpile(3600)
def some_creation_function(dogpile):
create_expensive_datafile()
with dogpile.acquire_write_lock():
replace_old_datafile_with_new()
# usage:
with dogpile.acquire(some_creation_function):
read_datafile()
With the above pattern, .SyncReaderDogpile
will allow concurrent readers to
read from the current version of the datafile as the
create_expensive_datafile()
function proceeds with its job
of generating the information for a new version. When the data is ready
to be written, the .SyncReaderDogpile.acquire_write_lock
call will block
until all current readers of the datafile have completed (that is,
they've finished their own .Dogpile.acquire
blocks). The
some_creation_function()
function then proceeds, as new
readers are blocked until this function finishes its work of rewriting
the datafile.
Note that the .SyncReaderDogpile
approach is useful for when
working with a resource that itself does not support concurent access
while being written, namely flat files, possibly some forms of DBM file.
It is not needed when dealing with a datasource that
already provides a high level of concurrency, such as a relational
database, Memcached, or NoSQL store. Currently, the .SyncReaderDogpile
object
only synchronizes within the current process among multiple threads; it
won't at this time protect from concurrent access by multiple processes.
Beaker did support this behavior however using lock files, and this
functionality may be re-added in a future release.