Local write affinity for object PUT requests.

The proxy can now be configured to prefer local object servers for PUT
requests, where "local" is governed by the "write_affinity". The
"write_affinity_node_count" setting controls how many local object
servers to try before giving up and going on to remote ones.

I chose to simply re-order the object servers instead of filtering out
nonlocal ones so that, if all of the local ones are down, clients can
still get successful responses (just slower).

The goal is to trade availability for throughput. By writing to local
object servers across fast LAN links, clients get better throughput
than if the object servers were far away over slow WAN links. The
downside, of course, is that data availability (not durability) may
suffer when drives fail.

The default configuration has no write affinity in it, so the default
behavior is unchanged.

Added some words about these settings to the admin guide.

DocImpact

Change-Id: I09a0bd00524544ff627a3bccdcdc48f40720a86e
This commit is contained in:
Samuel Merritt 2013-06-13 11:24:29 -07:00
parent 75660a1e9e
commit d9f2a76973
12 changed files with 485 additions and 13 deletions

View File

@ -282,6 +282,127 @@ allows it to be more easily consumed by third party utilities::
{"object": {"retries:": 0, "missing_two": 0, "copies_found": 7863, "missing_one": 0, "copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "missing_all": 0}, "container": {"retries:": 0, "missing_two": 0, "copies_found": 12534, "missing_one": 0, "copies_expected": 12534, "pct_found": 100.0, "overlapping": 15, "missing_all": 0}}
-----------------------------------
Geographically Distributed Clusters
-----------------------------------
Swift's default configuration is currently designed to work in a
single region, where a region is defined as a group of machines with
high-bandwidth, low-latency links between them. However, configuration
options exist that make running a performant multi-region Swift
cluster possible.
For the rest of this section, we will assume a two-region Swift
cluster: region 1 in San Francisco (SF), and region 2 in New York
(NY). Each region shall contain within it 3 zones, numbered 1, 2, and
3, for a total of 6 zones.
~~~~~~~~~~~~~
read_affinity
~~~~~~~~~~~~~
This setting makes the proxy server prefer local backend servers for
GET and HEAD requests over non-local ones. For example, it is
preferable for an SF proxy server to service object GET requests
by talking to SF object servers, as the client will receive lower
latency and higher throughput.
By default, Swift randomly chooses one of the three replicas to give
to the client, thereby spreading the load evenly. In the case of a
geographically-distributed cluster, the administrator is likely to
prioritize keeping traffic local over even distribution of results.
This is where the read_affinity setting comes in.
Example::
[app:proxy-server]
read_affinity = r1=100
This will make the proxy attempt to service GET and HEAD requests from
backends in region 1 before contacting any backends in region 2.
However, if no region 1 backends are available (due to replica
placement, failed hardware, or other reasons), then the proxy will
fall back to backend servers in other regions.
Example::
[app:proxy-server]
read_affinity = r1z1=100, r1=200
This will make the proxy attempt to service GET and HEAD requests from
backends in region 1 zone 1, then backends in region 1, then any other
backends. If a proxy is physically close to a particular zone or
zones, this can provide bandwidth savings. For example, if a zone
corresponds to servers in a particular rack, and the proxy server is
in that same rack, then setting read_affinity to prefer reads from
within the rack will result in less traffic between the top-of-rack
switches.
The read_affinity setting may contain any number of region/zone
specifiers; the priority number (after the equals sign) determines the
ordering in which backend servers will be contacted. A lower number
means higher priority.
Note that read_affinity only affects the ordering of primary nodes
(see ring docs for definition of primary node), not the ordering of
handoff nodes.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
write_affinity and write_affinity_node_count
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This setting makes the proxy server prefer local backend servers for
object PUT requests over non-local ones. For example, it may be
preferable for an SF proxy server to service object PUT requests
by talking to SF object servers, as the client will receive lower
latency and higher throughput. However, if this setting is used, note
that a NY proxy server handling a GET request for an object that was
PUT using write affinity may have to fetch it across the WAN link, as
the object won't immediately have any replicas in NY. However,
replication will move the object's replicas to their proper homes in
both SF and NY.
Note that only object PUT requests are affected by the write_affinity
setting; POST, GET, HEAD, DELETE, OPTIONS, and account/container PUT
requests are not affected.
This setting lets you trade data distribution for throughput. If
write_affinity is enabled, then object replicas will initially be
stored all within a particular region or zone, thereby decreasing the
quality of the data distribution, but the replicas will be distributed
over fast WAN links, giving higher throughput to clients. Note that
the replicators will eventually move objects to their proper,
well-distributed homes.
The write_affinity setting is useful only when you don't typically
read objects immediately after writing them. For example, consider a
workload of mainly backups: if you have a bunch of machines in NY that
periodically write backups to Swift, then odds are that you don't then
immediately read those backups in SF. If your workload doesn't look
like that, then you probably shouldn't use write_affinity.
The write_affinity_node_count setting is only useful in conjunction
with write_affinity; it governs how many local object servers will be
tried before falling back to non-local ones.
Example::
[app:proxy-server]
write_affinity = r1
write_affinity_node_count = 2 * replicas
Assuming 3 replicas, this configuration will make object PUTs try
storing the object's replicas on up to 6 disks ("2 * replicas") in
region 1 ("r1").
You should be aware that, if you have data coming into SF faster than
your link to NY can transfer it, then your cluster's data distribution
will get worse and worse over time as objects pile up in SF. If this
happens, it is recommended to disable write_affinity and simply let
object PUTs traverse the WAN link, as that will naturally limit the
object growth rate to what your WAN link can handle.
--------------------------------
Cluster Telemetry and Monitoring
--------------------------------

View File

@ -147,6 +147,26 @@ use = egg:swift#proxy
# read_affinity = r1z1=100, r1z2=200, r2=300
# Default is empty, meaning no preference.
# read_affinity =
#
# Which backend servers to prefer on writes. Format is r<N> for region
# N or r<N>z<M> for region N, zone M. If this is set, then when
# handling an object PUT request, some number (see setting
# write_affinity_node_count) of local backend servers will be tried
# before any nonlocal ones.
#
# Example: try to write to regions 1 and 2 before writing to any other
# nodes:
# write_affinity = r1, r2
# Default is empty, meaning no preference.
# write_affinity =
#
# The number of local (as governed by the write_affinity setting)
# nodes to attempt to contact first, before any non-local ones. You
# can use '* replicas' at the end to have it use the number given
# times the number of replicas for the ring being used for the
# request.
# write_affinity_node_count = 2 * replicas
[filter:tempauth]
use = egg:swift#tempauth

View File

@ -1663,8 +1663,8 @@ def affinity_key_function(affinity_str):
:param affinity_str: affinity config value, e.g. "r1z2=3"
or "r1=1, r2z1=2, r2z2=2"
:returns: single-argument function, or None if argument invalid
:returns: single-argument function
:raises: ValueError if argument invalid
"""
affinity_str = affinity_str.strip()
@ -1701,6 +1701,56 @@ def affinity_key_function(affinity_str):
return keyfn
def affinity_locality_predicate(write_affinity_str):
"""
Turns a write-affinity config value into a predicate function for nodes.
The returned value will be a 1-arg function that takes a node dictionary
and returns a true value if it is "local" and a false value otherwise. The
definition of "local" comes from the affinity_str argument passed in here.
For example, if affinity_str is "r1, r2z2", then only nodes where region=1
or where (region=2 and zone=2) are considered local.
If affinity_str is empty or all whitespace, then the resulting function
will consider everything local
:param affinity_str: affinity config value, e.g. "r1z2"
or "r1, r2z1, r2z2"
:returns: single-argument function, or None if affinity_str is empty
:raises: ValueError if argument invalid
"""
affinity_str = write_affinity_str.strip()
if not affinity_str:
return None
matchers = []
pieces = [s.strip() for s in affinity_str.split(',')]
for piece in pieces:
# matches r<number> or r<number>z<number>
match = re.match("r(\d+)(?:z(\d+))?$", piece)
if match:
region, zone = match.groups()
region = int(region)
zone = int(zone) if zone else None
matcher = {'region': region}
if zone is not None:
matcher['zone'] = zone
matchers.append(matcher)
else:
raise ValueError("Invalid write-affinity value: %r" % affinity_str)
def is_local(ring_node):
for matcher in matchers:
if (matcher['region'] == ring_node['region']
and ('zone' not in matcher
or matcher['zone'] == ring_node['zone'])):
return True
return False
return is_local
def get_remote_client(req):
# remote host for zeus
client = req.headers.get('x-cluster-client-ip')

View File

@ -28,6 +28,7 @@ import os
import time
import functools
import inspect
import itertools
from urllib import quote
from eventlet import spawn_n, GreenPile
@ -599,7 +600,7 @@ class Controller(object):
info['nodes'] = nodes
return info
def iter_nodes(self, ring, partition):
def iter_nodes(self, ring, partition, node_iter=None):
"""
Yields nodes for a ring partition, skipping over error
limited nodes and stopping at the configurable number of
@ -615,9 +616,22 @@ class Controller(object):
:param ring: ring to get yield nodes from
:param partition: ring partition to yield nodes for
:param node_iter: optional iterable of nodes to try. Useful if you
want to filter or reorder the nodes.
"""
primary_nodes = self.app.sort_nodes(ring.get_part_nodes(partition))
part_nodes = ring.get_part_nodes(partition)
if node_iter is None:
node_iter = itertools.chain(part_nodes,
ring.get_more_nodes(partition))
num_primary_nodes = len(part_nodes)
# Use of list() here forcibly yanks the first N nodes (the primary
# nodes) from node_iter, so the rest of its values are handoffs.
primary_nodes = self.app.sort_nodes(
list(itertools.islice(node_iter, num_primary_nodes)))
handoff_nodes = node_iter
nodes_left = self.app.request_node_count(ring)
for node in primary_nodes:
if not self.error_limited(node):
yield node
@ -625,8 +639,9 @@ class Controller(object):
nodes_left -= 1
if nodes_left <= 0:
return
handoffs = 0
for node in ring.get_more_nodes(partition):
for node in handoff_nodes:
if not self.error_limited(node):
handoffs += 1
if self.app.log_handoffs:

View File

@ -364,6 +364,44 @@ class ObjectController(Controller):
except ListingIterNotAuthorized:
pass
def iter_nodes_local_first(self, ring, partition):
"""
Yields nodes for a ring partition.
If the 'write_affinity' setting is non-empty, then this will yield N
local nodes (as defined by the write_affinity setting) first, then the
rest of the nodes as normal. It is a re-ordering of the nodes such
that the local ones come first; no node is omitted. The effect is
that the request will be serviced by local object servers first, but
nonlocal ones will be employed if not enough local ones are available.
:param ring: ring to get nodes from
:param partition: ring partition to yield nodes for
"""
primary_nodes = ring.get_part_nodes(partition)
num_locals = self.app.write_affinity_node_count(ring)
is_local = self.app.write_affinity_is_local_fn
if is_local is None:
return self.iter_nodes(ring, partition)
all_nodes = itertools.chain(primary_nodes,
ring.get_more_nodes(partition))
first_n_local_nodes = list(itertools.islice(
itertools.ifilter(is_local, all_nodes), num_locals))
# refresh it; it moved when we computed first_n_local_nodes
all_nodes = itertools.chain(primary_nodes,
ring.get_more_nodes(partition))
local_first_node_iter = itertools.chain(
first_n_local_nodes,
itertools.ifilter(lambda node: node not in first_n_local_nodes,
all_nodes))
return self.iter_nodes(
ring, partition, node_iter=local_first_node_iter)
def is_good_source(self, src):
"""
Indicates whether or not the request made to the backend found
@ -881,7 +919,7 @@ class ObjectController(Controller):
delete_at_container = delete_at_part = delete_at_nodes = None
node_iter = GreenthreadSafeIterator(
self.iter_nodes(self.app.object_ring, partition))
self.iter_nodes_local_first(self.app.object_ring, partition))
pile = GreenPile(len(nodes))
te = req.headers.get('transfer-encoding', '')
chunked = ('chunked' in te)

View File

@ -35,7 +35,7 @@ from eventlet import Timeout
from swift.common.ring import Ring
from swift.common.utils import cache_from_env, get_logger, \
get_remote_client, split_path, config_true_value, generate_trans_id, \
affinity_key_function
affinity_key_function, affinity_locality_predicate
from swift.common.constraints import check_utf8
from swift.proxy.controllers import AccountController, ObjectController, \
ContainerController
@ -133,6 +133,25 @@ class Application(object):
# make the message a little more useful
raise ValueError("Invalid read_affinity value: %r (%s)" %
(read_affinity, err.message))
try:
write_affinity = conf.get('write_affinity', '')
self.write_affinity_is_local_fn \
= affinity_locality_predicate(write_affinity)
except ValueError as err:
# make the message a little more useful
raise ValueError("Invalid write_affinity value: %r (%s)" %
(write_affinity, err.message))
value = conf.get('write_affinity_node_count',
'2 * replicas').lower().split()
if len(value) == 1:
value = int(value[0])
self.write_affinity_node_count = lambda r: value
elif len(value) == 3 and value[1] == '*' and value[2] == 'replicas':
value = int(value[0])
self.write_affinity_node_count = lambda r: value * r.replica_count
else:
raise ValueError(
'Invalid write_affinity_node_count value: %r' % ''.join(value))
def get_controller(self, path):
"""

View File

@ -19,11 +19,11 @@ from httplib import HTTPException
class FakeRing(object):
def __init__(self, replicas=3):
def __init__(self, replicas=3, max_more_nodes=0):
# 9 total nodes (6 more past the initial 3) is the cap, no matter if
# this is set higher, or R^2 for R replicas
self.replicas = replicas
self.max_more_nodes = 0
self.max_more_nodes = max_more_nodes
self.devs = {}
def set_replicas(self, replicas):
@ -46,17 +46,24 @@ class FakeRing(object):
{'ip': '10.0.0.%s' % x,
'port': 1000 + x,
'device': 'sd' + (chr(ord('a') + x)),
'zone': x % 3,
'region': x % 2,
'id': x}
return 1, devs
def get_part_nodes(self, part):
return self.get_nodes('blah')[1]
def get_more_nodes(self, nodes):
def get_more_nodes(self, part):
# replicas^2 is the true cap
for x in xrange(self.replicas, min(self.replicas + self.max_more_nodes,
self.replicas * self.replicas)):
yield {'ip': '10.0.0.%s' % x, 'port': 1000 + x, 'device': 'sda'}
yield {'ip': '10.0.0.%s' % x,
'port': 1000 + x,
'device': 'sda',
'zone': x % 3,
'region': x % 2,
'id': x}
class FakeMemcache(object):

View File

@ -1663,6 +1663,50 @@ class TestAffinityKeyFunction(unittest.TestCase):
ids = [n['id'] for n in sorted(self.nodes, key=keyfn)]
self.assertEqual([3, 2, 0, 1, 4, 5, 6, 7], ids)
class TestAffinityLocalityPredicate(unittest.TestCase):
def setUp(self):
self.nodes = [dict(id=0, region=1, zone=1),
dict(id=1, region=1, zone=2),
dict(id=2, region=2, zone=1),
dict(id=3, region=2, zone=2),
dict(id=4, region=3, zone=1),
dict(id=5, region=3, zone=2),
dict(id=6, region=4, zone=0),
dict(id=7, region=4, zone=1)]
def test_empty(self):
pred = utils.affinity_locality_predicate('')
self.assert_(pred is None)
def test_region(self):
pred = utils.affinity_locality_predicate('r1')
self.assert_(callable(pred))
ids = [n['id'] for n in self.nodes if pred(n)]
self.assertEqual([0, 1], ids)
def test_zone(self):
pred = utils.affinity_locality_predicate('r1z1')
self.assert_(callable(pred))
ids = [n['id'] for n in self.nodes if pred(n)]
self.assertEqual([0], ids)
def test_multiple(self):
pred = utils.affinity_locality_predicate('r1, r3, r4z0')
self.assert_(callable(pred))
ids = [n['id'] for n in self.nodes if pred(n)]
self.assertEqual([0, 1, 4, 5, 6], ids)
def test_invalid(self):
self.assertRaises(ValueError,
utils.affinity_locality_predicate, 'falafel')
self.assertRaises(ValueError,
utils.affinity_locality_predicate, 'r8zQ')
self.assertRaises(ValueError,
utils.affinity_locality_predicate, 'r2d2')
self.assertRaises(ValueError,
utils.affinity_locality_predicate, 'r1z1=1')
class TestGreenthreadSafeIterator(unittest.TestCase):
def increment(self, iterable):
plus_ones = []

View File

@ -29,7 +29,7 @@ class TestAccountController(unittest.TestCase):
self.app = proxy_server.Application(None, FakeMemcache(),
account_ring=FakeRing(),
container_ring=FakeRing(),
object_ring=FakeRing)
object_ring=FakeRing())
def test_account_info_in_response_env(self):
controller = proxy_server.AccountController(self.app, 'AUTH_bob')

View File

@ -29,7 +29,7 @@ class TestContainerController(unittest.TestCase):
self.app = proxy_server.Application(None, FakeMemcache(),
account_ring=FakeRing(),
container_ring=FakeRing(),
object_ring=FakeRing)
object_ring=FakeRing())
def test_container_info_in_response_env(self):
controller = proxy_server.ContainerController(self.app, 'a', 'c')

View File

@ -0,0 +1,67 @@
#!/usr/bin/env python
# Copyright (c) 2010-2012 OpenStack, LLC.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import mock
import unittest
from swift.proxy import server as proxy_server
from test.unit import fake_http_connect, FakeRing, FakeMemcache
class TestObjControllerWriteAffinity(unittest.TestCase):
def setUp(self):
self.app = proxy_server.Application(
None, FakeMemcache(), account_ring=FakeRing(),
container_ring=FakeRing(), object_ring=FakeRing(max_more_nodes=9))
self.app.request_node_count = lambda ring: 10000000
self.app.sort_nodes = lambda l: l # stop shuffling the primary nodes
def test_iter_nodes_local_first_noops_when_no_affinity(self):
controller = proxy_server.ObjectController(self.app, 'a', 'c', 'o')
self.app.write_affinity_is_local_fn = None
all_nodes = self.app.object_ring.get_part_nodes(1)
all_nodes.extend(self.app.object_ring.get_more_nodes(1))
local_first_nodes = list(controller.iter_nodes_local_first(
self.app.object_ring, 1))
fr = FakeRing()
self.maxDiff = None
self.assertEqual(all_nodes, local_first_nodes)
def test_iter_nodes_local_first_moves_locals_first(self):
controller = proxy_server.ObjectController(self.app, 'a', 'c', 'o')
self.app.write_affinity_is_local_fn = (lambda node: node['region'] == 1)
self.app.write_affinity_node_count = lambda ring: 4
all_nodes = self.app.object_ring.get_part_nodes(1)
all_nodes.extend(self.app.object_ring.get_more_nodes(1))
local_first_nodes = list(controller.iter_nodes_local_first(
self.app.object_ring, 1))
# the local nodes move up in the ordering
self.assertEqual([1, 1, 1, 1],
[node['region'] for node in local_first_nodes[:4]])
# we don't skip any nodes
self.assertEqual(sorted(all_nodes), sorted(local_first_nodes))
if __name__ == '__main__':
unittest.main()

View File

@ -766,6 +766,78 @@ class TestObjectController(unittest.TestCase):
res = controller.PUT(req)
self.assertTrue(res.status.startswith('201 '))
def test_PUT_respects_write_affinity(self):
written_to = []
def test_connect(ipaddr, port, device, partition, method, path,
headers=None, query_string=None):
if path == '/a/c/o.jpg':
written_to.append((ipaddr, port, device))
with save_globals():
def is_r0(node):
return node['region'] == 0
self.app.object_ring.max_more_nodes = 100
self.app.write_affinity_is_local_fn = is_r0
self.app.write_affinity_node_count = lambda r: 3
controller = \
proxy_server.ObjectController(self.app, 'a', 'c', 'o.jpg')
set_http_connect(200, 200, 201, 201, 201,
give_connect=test_connect)
req = Request.blank('/a/c/o.jpg', {})
req.content_length = 1
req.body = 'a'
self.app.memcache.store = {}
res = controller.PUT(req)
self.assertTrue(res.status.startswith('201 '))
self.assertEqual(3, len(written_to))
for ip, port, device in written_to:
# this is kind of a hokey test, but in FakeRing, the port is even
# when the region is 0, and odd when the region is 1, so this test
# asserts that we only wrote to nodes in region 0.
self.assertEqual(0, port % 2)
def test_PUT_respects_write_affinity_with_507s(self):
written_to = []
def test_connect(ipaddr, port, device, partition, method, path,
headers=None, query_string=None):
if path == '/a/c/o.jpg':
written_to.append((ipaddr, port, device))
with save_globals():
def is_r0(node):
return node['region'] == 0
self.app.object_ring.max_more_nodes = 100
self.app.write_affinity_is_local_fn = is_r0
self.app.write_affinity_node_count = lambda r: 3
controller = \
proxy_server.ObjectController(self.app, 'a', 'c', 'o.jpg')
controller.error_limit(
self.app.object_ring.get_part_nodes(1)[0], 'test')
set_http_connect(200, 200, # account, container
201, 201, 201, # 3 working backends
give_connect=test_connect)
req = Request.blank('/a/c/o.jpg', {})
req.content_length = 1
req.body = 'a'
self.app.memcache.store = {}
res = controller.PUT(req)
self.assertTrue(res.status.startswith('201 '))
self.assertEqual(3, len(written_to))
# this is kind of a hokey test, but in FakeRing, the port is even when
# the region is 0, and odd when the region is 1, so this test asserts
# that we wrote to 2 nodes in region 0, then went to 1 non-r0 node.
self.assertEqual(0, written_to[0][1] % 2) # it's (ip, port, device)
self.assertEqual(0, written_to[1][1] % 2)
self.assertNotEqual(0, written_to[2][1] % 2)
def test_PUT_message_length_using_content_length(self):
prolis = _test_sockets[0]
sock = connect_tcp(('localhost', prolis.getsockname()[1]))
@ -2188,6 +2260,25 @@ class TestObjectController(unittest.TestCase):
self.assertEquals(len(first_nodes), 6)
self.assertEquals(len(second_nodes), 7)
def test_iter_nodes_with_custom_node_iter(self):
controller = proxy_server.ObjectController(self.app, 'a', 'c', 'o')
node_list = [dict(id=n) for n in xrange(10)]
with nested(
mock.patch.object(self.app, 'sort_nodes', lambda n: n),
mock.patch.object(self.app, 'request_node_count',
lambda r: 3)):
got_nodes = list(controller.iter_nodes(self.app.object_ring, 0,
node_iter=iter(node_list)))
self.assertEqual(node_list[:3], got_nodes)
with nested(
mock.patch.object(self.app, 'sort_nodes', lambda n: n),
mock.patch.object(self.app, 'request_node_count',
lambda r: 1000000)):
got_nodes = list(controller.iter_nodes(self.app.object_ring, 0,
node_iter=iter(node_list)))
self.assertEqual(node_list, got_nodes)
def test_best_response_sets_etag(self):
controller = proxy_server.ObjectController(self.app, 'account',
'container', 'object')