2013-09-20 01:00:54 +08:00
|
|
|
# Copyright (c) 2010-2012 OpenStack Foundation
|
2010-07-12 17:03:45 -05:00
|
|
|
#
|
|
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
# you may not use this file except in compliance with the License.
|
|
|
|
# You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
|
|
|
# implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
|
2013-05-16 19:38:42 +04:00
|
|
|
import mock
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
import operator
|
2010-07-12 17:03:45 -05:00
|
|
|
import os
|
|
|
|
import unittest
|
2012-09-05 12:37:09 -05:00
|
|
|
import cPickle as pickle
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
from collections import defaultdict
|
2014-01-16 00:49:28 -05:00
|
|
|
from tempfile import mkdtemp
|
2010-07-12 17:03:45 -05:00
|
|
|
from shutil import rmtree
|
|
|
|
|
2011-09-21 13:20:27 -07:00
|
|
|
from swift.common import exceptions
|
2010-07-12 17:03:45 -05:00
|
|
|
from swift.common import ring
|
2013-04-23 16:09:57 +09:00
|
|
|
from swift.common.ring.builder import MAX_BALANCE
|
2010-07-12 17:03:45 -05:00
|
|
|
|
2012-09-05 12:37:09 -05:00
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
class TestRingBuilder(unittest.TestCase):
|
|
|
|
|
|
|
|
def setUp(self):
|
2014-01-16 00:49:28 -05:00
|
|
|
self.testdir = mkdtemp()
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
def tearDown(self):
|
|
|
|
rmtree(self.testdir, ignore_errors=1)
|
|
|
|
|
|
|
|
def test_init(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
self.assertEquals(rb.part_power, 8)
|
|
|
|
self.assertEquals(rb.replicas, 3)
|
|
|
|
self.assertEquals(rb.min_part_hours, 1)
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(rb.parts, 2 ** 8)
|
2010-07-12 17:03:45 -05:00
|
|
|
self.assertEquals(rb.devs, [])
|
|
|
|
self.assertEquals(rb.devs_changed, False)
|
|
|
|
self.assertEquals(rb.version, 0)
|
|
|
|
|
2013-03-17 15:07:20 -07:00
|
|
|
def test_overlarge_part_powers(self):
|
|
|
|
ring.RingBuilder(32, 3, 1) # passes by not crashing
|
|
|
|
self.assertRaises(ValueError, ring.RingBuilder, 33, 3, 1)
|
|
|
|
|
|
|
|
def test_insufficient_replicas(self):
|
|
|
|
ring.RingBuilder(8, 1.0, 1) # passes by not crashing
|
|
|
|
self.assertRaises(ValueError, ring.RingBuilder, 8, 0.999, 1)
|
|
|
|
|
|
|
|
def test_negative_min_part_hours(self):
|
|
|
|
ring.RingBuilder(8, 3, 0) # passes by not crashing
|
|
|
|
self.assertRaises(ValueError, ring.RingBuilder, 8, 3, -1)
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
def test_get_ring(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10004, 'device': 'sda1'})
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.remove_dev(1)
|
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
self.assert_(isinstance(r, ring.RingData))
|
|
|
|
r2 = rb.get_ring()
|
|
|
|
self.assert_(r is r2)
|
|
|
|
rb.rebalance()
|
|
|
|
r3 = rb.get_ring()
|
|
|
|
self.assert_(r3 is not r2)
|
|
|
|
r4 = rb.get_ring()
|
|
|
|
self.assert_(r3 is r4)
|
|
|
|
|
2013-01-29 16:23:46 -08:00
|
|
|
def test_rebalance_with_seed(self):
|
|
|
|
devs = [(0, 10000), (1, 10001), (2, 10002), (1, 10003)]
|
|
|
|
ring_builders = []
|
|
|
|
for n in range(3):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
for idx, (zone, port) in enumerate(devs):
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': idx, 'region': 0, 'zone': zone, 'weight': 1,
|
2013-01-29 16:23:46 -08:00
|
|
|
'ip': '127.0.0.1', 'port': port, 'device': 'sda1'})
|
|
|
|
ring_builders.append(rb)
|
|
|
|
|
|
|
|
rb0 = ring_builders[0]
|
|
|
|
rb1 = ring_builders[1]
|
|
|
|
rb2 = ring_builders[2]
|
|
|
|
|
|
|
|
r0 = rb0.get_ring()
|
|
|
|
self.assertTrue(rb0.get_ring() is r0)
|
|
|
|
|
|
|
|
rb0.rebalance() # NO SEED
|
|
|
|
rb1.rebalance(seed=10)
|
|
|
|
rb2.rebalance(seed=10)
|
|
|
|
|
|
|
|
r1 = rb1.get_ring()
|
|
|
|
r2 = rb2.get_ring()
|
|
|
|
|
|
|
|
self.assertFalse(rb0.get_ring() is r0)
|
|
|
|
self.assertNotEquals(r0.to_dict(), r1.to_dict())
|
|
|
|
self.assertEquals(r1.to_dict(), r2.to_dict())
|
|
|
|
|
2014-03-11 11:25:04 -07:00
|
|
|
def test_rebalance_part_on_deleted_other_part_on_drained(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
rb.add_dev({'id': 0, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 4, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10004, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 5, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10005, 'device': 'sda1'})
|
|
|
|
|
|
|
|
rb.rebalance(seed=1)
|
|
|
|
# We want a partition where 1 replica is on a removed device, 1
|
|
|
|
# replica is on a 0-weight device, and 1 on a normal device. To
|
|
|
|
# guarantee we have one, we see where partition 123 is, then
|
|
|
|
# manipulate its devices accordingly.
|
|
|
|
zero_weight_dev_id = rb._replica2part2dev[1][123]
|
|
|
|
delete_dev_id = rb._replica2part2dev[2][123]
|
|
|
|
|
|
|
|
rb.set_dev_weight(zero_weight_dev_id, 0.0)
|
|
|
|
rb.remove_dev(delete_dev_id)
|
|
|
|
rb.rebalance()
|
|
|
|
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
def test_set_replicas(self):
|
|
|
|
rb = ring.RingBuilder(8, 3.2, 1)
|
|
|
|
rb.devs_changed = False
|
|
|
|
rb.set_replicas(3.25)
|
|
|
|
self.assertTrue(rb.devs_changed)
|
|
|
|
|
|
|
|
rb.devs_changed = False
|
|
|
|
rb.set_replicas(3.2500001)
|
|
|
|
self.assertFalse(rb.devs_changed)
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
def test_add_dev(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
dev = {'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000}
|
2013-05-16 18:44:41 +04:00
|
|
|
dev_id = rb.add_dev(dev)
|
2011-09-21 13:20:27 -07:00
|
|
|
self.assertRaises(exceptions.DuplicateDeviceError, rb.add_dev, dev)
|
2013-05-16 18:44:41 +04:00
|
|
|
self.assertEqual(dev_id, 0)
|
2012-09-14 16:09:23 -05:00
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2014-03-11 11:25:04 -07:00
|
|
|
# test add new dev with no id
|
2013-05-16 18:44:41 +04:00
|
|
|
dev_id = rb.add_dev({'zone': 0, 'region': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 6000})
|
2012-09-14 16:09:23 -05:00
|
|
|
self.assertEquals(rb.devs[0]['id'], 0)
|
2013-05-16 18:44:41 +04:00
|
|
|
self.assertEqual(dev_id, 0)
|
2012-09-14 16:09:23 -05:00
|
|
|
#test add another dev with no id
|
2013-05-16 18:44:41 +04:00
|
|
|
dev_id = rb.add_dev({'zone': 3, 'region': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 6000})
|
2012-09-14 16:09:23 -05:00
|
|
|
self.assertEquals(rb.devs[1]['id'], 1)
|
2013-05-16 18:44:41 +04:00
|
|
|
self.assertEqual(dev_id, 1)
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
def test_set_dev_weight(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 0.5,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 0, 'weight': 0.5,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 128, 1: 128, 2: 256, 3: 256})
|
|
|
|
rb.set_dev_weight(0, 0.75)
|
|
|
|
rb.set_dev_weight(1, 0.25)
|
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 192, 1: 64, 2: 256, 3: 256})
|
|
|
|
|
|
|
|
def test_remove_dev(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 3, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 192, 1: 192, 2: 192, 3: 192})
|
|
|
|
rb.remove_dev(1)
|
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 256, 2: 256, 3: 256})
|
|
|
|
|
2012-02-28 18:25:14 -08:00
|
|
|
def test_remove_a_lot(self):
|
|
|
|
rb = ring.RingBuilder(3, 3, 1)
|
|
|
|
rb.add_dev({'id': 0, 'device': 'd0', 'ip': '10.0.0.1',
|
2013-03-04 17:05:43 -08:00
|
|
|
'port': 6002, 'weight': 1000.0, 'region': 0, 'zone': 1})
|
2012-02-28 18:25:14 -08:00
|
|
|
rb.add_dev({'id': 1, 'device': 'd1', 'ip': '10.0.0.2',
|
2013-03-04 17:05:43 -08:00
|
|
|
'port': 6002, 'weight': 1000.0, 'region': 0, 'zone': 2})
|
2012-02-28 18:25:14 -08:00
|
|
|
rb.add_dev({'id': 2, 'device': 'd2', 'ip': '10.0.0.3',
|
2013-03-04 17:05:43 -08:00
|
|
|
'port': 6002, 'weight': 1000.0, 'region': 0, 'zone': 3})
|
2012-02-28 18:25:14 -08:00
|
|
|
rb.add_dev({'id': 3, 'device': 'd3', 'ip': '10.0.0.1',
|
2013-03-04 17:05:43 -08:00
|
|
|
'port': 6002, 'weight': 1000.0, 'region': 0, 'zone': 1})
|
2012-02-28 18:25:14 -08:00
|
|
|
rb.add_dev({'id': 4, 'device': 'd4', 'ip': '10.0.0.2',
|
2013-03-04 17:05:43 -08:00
|
|
|
'port': 6002, 'weight': 1000.0, 'region': 0, 'zone': 2})
|
2012-02-28 18:25:14 -08:00
|
|
|
rb.add_dev({'id': 5, 'device': 'd5', 'ip': '10.0.0.3',
|
2013-03-04 17:05:43 -08:00
|
|
|
'port': 6002, 'weight': 1000.0, 'region': 0, 'zone': 3})
|
2012-02-28 18:25:14 -08:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
# this has to put more than 1/3 of the partitions in the
|
|
|
|
# cluster on removed devices in order to ensure that at least
|
|
|
|
# one partition has multiple replicas that need to move.
|
|
|
|
#
|
|
|
|
# (for an N-replica ring, it's more than 1/N of the
|
|
|
|
# partitions, of course)
|
|
|
|
rb.remove_dev(3)
|
|
|
|
rb.remove_dev(4)
|
|
|
|
rb.remove_dev(5)
|
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
2011-01-13 09:05:44 -08:00
|
|
|
def test_shuffled_gather(self):
|
|
|
|
if self._shuffled_gather_helper() and \
|
2012-09-05 12:37:09 -05:00
|
|
|
self._shuffled_gather_helper():
|
2011-01-13 09:05:44 -08:00
|
|
|
raise AssertionError('It is highly likely the ring is no '
|
2012-09-05 12:37:09 -05:00
|
|
|
'longer shuffling the set of partitions '
|
|
|
|
'to reassign on a rebalance.')
|
2011-01-13 09:05:44 -08:00
|
|
|
|
|
|
|
def _shuffled_gather_helper(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
|
|
|
rb.rebalance()
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 3, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
2011-01-13 09:05:44 -08:00
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
parts = rb._gather_reassign_parts()
|
|
|
|
max_run = 0
|
|
|
|
run = 0
|
|
|
|
last_part = 0
|
2012-03-01 10:53:06 -08:00
|
|
|
for part, _ in parts:
|
2011-01-13 09:05:44 -08:00
|
|
|
if part > last_part:
|
|
|
|
run += 1
|
|
|
|
else:
|
|
|
|
if run > max_run:
|
|
|
|
max_run = run
|
|
|
|
run = 0
|
|
|
|
last_part = part
|
|
|
|
if run > max_run:
|
|
|
|
max_run = run
|
|
|
|
return max_run > len(parts) / 2
|
|
|
|
|
2013-08-13 14:54:16 -07:00
|
|
|
def test_initial_balance(self):
|
|
|
|
# 2 boxes, 2 drives each in zone 1
|
|
|
|
# 1 box, 2 drives in zone 2
|
|
|
|
#
|
|
|
|
# This is balanceable, but there used to be some nondeterminism in
|
|
|
|
# rebalance() that would sometimes give you an imbalanced ring.
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
rb.add_dev({'region': 1, 'zone': 1, 'weight': 4000.0,
|
|
|
|
'ip': '10.1.1.1', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'region': 1, 'zone': 1, 'weight': 4000.0,
|
|
|
|
'ip': '10.1.1.1', 'port': 10000, 'device': 'sdb'})
|
|
|
|
|
|
|
|
rb.add_dev({'region': 1, 'zone': 1, 'weight': 4000.0,
|
|
|
|
'ip': '10.1.1.2', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'region': 1, 'zone': 1, 'weight': 4000.0,
|
|
|
|
'ip': '10.1.1.2', 'port': 10000, 'device': 'sdb'})
|
|
|
|
|
|
|
|
rb.add_dev({'region': 1, 'zone': 2, 'weight': 4000.0,
|
|
|
|
'ip': '10.1.1.3', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'region': 1, 'zone': 2, 'weight': 4000.0,
|
|
|
|
'ip': '10.1.1.3', 'port': 10000, 'device': 'sdb'})
|
|
|
|
|
|
|
|
_, balance = rb.rebalance(seed=2)
|
|
|
|
|
|
|
|
# maybe not *perfect*, but should be close
|
|
|
|
self.assert_(balance <= 1)
|
|
|
|
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
def test_multitier_partial(self):
|
2013-03-04 17:05:43 -08:00
|
|
|
# Multitier test, nothing full
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 1, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 2, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 3, 'zone': 3, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdd'})
|
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
|
|
|
counts = defaultdict(lambda: defaultdict(int))
|
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
dev = rb.devs[rb._replica2part2dev[replica][part]]
|
|
|
|
counts['region'][dev['region']] += 1
|
|
|
|
counts['zone'][dev['zone']] += 1
|
|
|
|
|
|
|
|
if any(c > 1 for c in counts['region'].values()):
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not evenly region-distributed (got %r)" %
|
|
|
|
(part, counts['region']))
|
|
|
|
if any(c > 1 for c in counts['zone'].values()):
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not evenly zone-distributed (got %r)" %
|
|
|
|
(part, counts['zone']))
|
|
|
|
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
# Multitier test, zones full, nodes not full
|
|
|
|
rb = ring.RingBuilder(8, 6, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc'})
|
|
|
|
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdd'})
|
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sde'})
|
|
|
|
rb.add_dev({'id': 5, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdf'})
|
|
|
|
|
|
|
|
rb.add_dev({'id': 6, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sdg'})
|
|
|
|
rb.add_dev({'id': 7, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sdh'})
|
|
|
|
rb.add_dev({'id': 8, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sdi'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
2013-03-04 17:05:43 -08:00
|
|
|
counts = defaultdict(lambda: defaultdict(int))
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
dev = rb.devs[rb._replica2part2dev[replica][part]]
|
|
|
|
counts['zone'][dev['zone']] += 1
|
|
|
|
counts['dev_id'][dev['id']] += 1
|
|
|
|
if counts['zone'] != {0: 2, 1: 2, 2: 2}:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not evenly distributed (got %r)" %
|
|
|
|
(part, counts['zone']))
|
|
|
|
for dev_id, replica_count in counts['dev_id'].iteritems():
|
|
|
|
if replica_count > 1:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d is on device %d more than once (%r)" %
|
|
|
|
(part, dev_id, counts['dev_id']))
|
|
|
|
|
|
|
|
def test_multitier_full(self):
|
|
|
|
# Multitier test, #replicas == #devs
|
|
|
|
rb = ring.RingBuilder(8, 6, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdd'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sde'})
|
|
|
|
rb.add_dev({'id': 5, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdf'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
2013-03-04 17:05:43 -08:00
|
|
|
counts = defaultdict(lambda: defaultdict(int))
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
dev = rb.devs[rb._replica2part2dev[replica][part]]
|
|
|
|
counts['zone'][dev['zone']] += 1
|
|
|
|
counts['dev_id'][dev['id']] += 1
|
|
|
|
if counts['zone'] != {0: 2, 1: 2, 2: 2}:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not evenly distributed (got %r)" %
|
|
|
|
(part, counts['zone']))
|
|
|
|
for dev_id, replica_count in counts['dev_id'].iteritems():
|
|
|
|
if replica_count != 1:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d is on device %d %d times, not 1 (%r)" %
|
|
|
|
(part, dev_id, replica_count, counts['dev_id']))
|
|
|
|
|
|
|
|
def test_multitier_overfull(self):
|
|
|
|
# Multitier test, #replicas > #devs + 2 (to prove even distribution)
|
|
|
|
rb = ring.RingBuilder(8, 8, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdd'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sde'})
|
|
|
|
rb.add_dev({'id': 5, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdf'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
2013-03-04 17:05:43 -08:00
|
|
|
counts = defaultdict(lambda: defaultdict(int))
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
dev = rb.devs[rb._replica2part2dev[replica][part]]
|
|
|
|
counts['zone'][dev['zone']] += 1
|
|
|
|
counts['dev_id'][dev['id']] += 1
|
|
|
|
|
|
|
|
self.assertEquals(8, sum(counts['zone'].values()))
|
|
|
|
for zone, replica_count in counts['zone'].iteritems():
|
|
|
|
if replica_count not in (2, 3):
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not evenly distributed (got %r)" %
|
|
|
|
(part, counts['zone']))
|
|
|
|
for dev_id, replica_count in counts['dev_id'].iteritems():
|
|
|
|
if replica_count not in (1, 2):
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d is on device %d %d times, "
|
|
|
|
"not 1 or 2 (%r)" %
|
|
|
|
(part, dev_id, replica_count, counts['dev_id']))
|
|
|
|
|
|
|
|
def test_multitier_expansion_more_devices(self):
|
|
|
|
rb = ring.RingBuilder(8, 6, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdd'})
|
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sde'})
|
|
|
|
rb.add_dev({'id': 5, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdf'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
|
|
|
|
for _ in xrange(5):
|
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
2013-03-04 17:05:43 -08:00
|
|
|
counts = dict(zone=defaultdict(int),
|
|
|
|
dev_id=defaultdict(int))
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
dev = rb.devs[rb._replica2part2dev[replica][part]]
|
|
|
|
counts['zone'][dev['zone']] += 1
|
|
|
|
counts['dev_id'][dev['id']] += 1
|
|
|
|
|
|
|
|
self.assertEquals({0: 2, 1: 2, 2: 2}, dict(counts['zone']))
|
|
|
|
self.assertEquals({0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1},
|
|
|
|
dict(counts['dev_id']))
|
|
|
|
|
|
|
|
def test_multitier_part_moves_with_0_min_part_hours(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 0)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
# min_part_hours is 0, so we're clear to move 2 replicas to
|
|
|
|
# new devs
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc1'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
|
|
|
devs = set()
|
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
devs.add(rb._replica2part2dev[replica][part])
|
|
|
|
|
|
|
|
if len(devs) != 3:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not on 3 devs (got %r)" % (part, devs))
|
|
|
|
|
|
|
|
def test_multitier_part_moves_with_positive_min_part_hours(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 99)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
# min_part_hours is >0, so we'll only be able to move 1
|
|
|
|
# replica to a new home
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdc1'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
|
|
|
devs = set()
|
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
devs.add(rb._replica2part2dev[replica][part])
|
|
|
|
|
|
|
|
if len(devs) != 2:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not on 2 devs (got %r)" % (part, devs))
|
|
|
|
|
|
|
|
def test_multitier_dont_move_too_many_replicas(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 0)
|
|
|
|
# there'll be at least one replica in z0 and z1
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdb1'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
# only 1 replica should move
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdd1'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 3, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sde1'})
|
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 4, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sdf1'})
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
for part in xrange(rb.parts):
|
|
|
|
zones = set()
|
|
|
|
for replica in xrange(rb.replicas):
|
|
|
|
zones.add(rb.devs[rb._replica2part2dev[replica][part]]['zone'])
|
|
|
|
|
|
|
|
if len(zones) != 3:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not in 3 zones (got %r)" % (part, zones))
|
|
|
|
if 0 not in zones or 1 not in zones:
|
|
|
|
raise AssertionError(
|
|
|
|
"Partition %d not in zones 0 and 1 (got %r)" %
|
|
|
|
(part, zones))
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
def test_rerebalance(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 256, 1: 256, 2: 256})
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 3, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 192, 1: 192, 2: 192, 3: 192})
|
|
|
|
rb.set_dev_weight(3, 100)
|
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts[3], 256)
|
|
|
|
|
2011-09-21 13:17:50 -07:00
|
|
|
def test_add_rebalance_add_rebalance_delete_rebalance(self):
|
2013-08-31 22:36:58 -04:00
|
|
|
# Test for https://bugs.launchpad.net/swift/+bug/845952
|
2011-09-21 13:17:50 -07:00
|
|
|
# min_part of 0 to allow for rapid rebalancing
|
|
|
|
rb = ring.RingBuilder(8, 3, 0)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
2011-09-21 13:17:50 -07:00
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10004, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 5, 'region': 0, 'zone': 2, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10005, 'device': 'sda1'})
|
2011-09-21 13:17:50 -07:00
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
|
|
|
|
rb.remove_dev(1)
|
|
|
|
|
|
|
|
rb.rebalance()
|
|
|
|
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
def test_set_replicas_increase(self):
|
|
|
|
rb = ring.RingBuilder(8, 2, 0)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
rb.replicas = 2.1
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
self.assertEqual([len(p2d) for p2d in rb._replica2part2dev],
|
|
|
|
[256, 256, 25])
|
|
|
|
|
|
|
|
rb.replicas = 2.2
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
self.assertEqual([len(p2d) for p2d in rb._replica2part2dev],
|
|
|
|
[256, 256, 51])
|
|
|
|
|
|
|
|
def test_set_replicas_decrease(self):
|
|
|
|
rb = ring.RingBuilder(4, 5, 0)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
rb.replicas = 4.9
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
self.assertEqual([len(p2d) for p2d in rb._replica2part2dev],
|
|
|
|
[16, 16, 16, 16, 14])
|
|
|
|
|
|
|
|
# cross a couple of integer thresholds (4 and 3)
|
|
|
|
rb.replicas = 2.5
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate()
|
|
|
|
|
|
|
|
self.assertEqual([len(p2d) for p2d in rb._replica2part2dev],
|
|
|
|
[16, 16, 8])
|
|
|
|
|
|
|
|
def test_fractional_replicas_rebalance(self):
|
|
|
|
rb = ring.RingBuilder(8, 2.5, 0)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
rb.rebalance() # passes by not crashing
|
|
|
|
rb.validate() # also passes by not crashing
|
|
|
|
self.assertEqual([len(p2d) for p2d in rb._replica2part2dev],
|
|
|
|
[256, 256, 128])
|
|
|
|
|
2012-09-05 12:37:09 -05:00
|
|
|
def test_load(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.0', 'port': 10000, 'device': 'sda1',
|
|
|
|
'meta': 'meta0'},
|
|
|
|
{'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdb1',
|
|
|
|
'meta': 'meta1'},
|
|
|
|
{'id': 2, 'region': 0, 'zone': 2, 'weight': 2,
|
|
|
|
'ip': '127.0.0.2', 'port': 10002, 'device': 'sdc1',
|
|
|
|
'meta': 'meta2'},
|
|
|
|
{'id': 3, 'region': 0, 'zone': 3, 'weight': 2,
|
|
|
|
'ip': '127.0.0.3', 'port': 10003, 'device': 'sdd1'}]
|
2012-09-05 12:37:09 -05:00
|
|
|
for d in devs:
|
|
|
|
rb.add_dev(d)
|
|
|
|
rb.rebalance()
|
|
|
|
|
|
|
|
real_pickle = pickle.load
|
|
|
|
try:
|
|
|
|
#test a legit builder
|
2013-05-16 19:38:42 +04:00
|
|
|
fake_pickle = mock.Mock(return_value=rb)
|
|
|
|
fake_open = mock.Mock(return_value=None)
|
2012-09-05 12:37:09 -05:00
|
|
|
pickle.load = fake_pickle
|
2013-03-26 20:42:26 +00:00
|
|
|
builder = ring.RingBuilder.load('fake.builder', open=fake_open)
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(fake_pickle.call_count, 1)
|
2013-05-16 19:38:42 +04:00
|
|
|
fake_open.assert_has_calls([mock.call('fake.builder', 'rb')])
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(builder, rb)
|
|
|
|
fake_pickle.reset_mock()
|
|
|
|
fake_open.reset_mock()
|
|
|
|
|
|
|
|
#test old style builder
|
|
|
|
fake_pickle.return_value = rb.to_dict()
|
|
|
|
pickle.load = fake_pickle
|
2013-03-26 20:42:26 +00:00
|
|
|
builder = ring.RingBuilder.load('fake.builder', open=fake_open)
|
2013-05-16 19:38:42 +04:00
|
|
|
fake_open.assert_has_calls([mock.call('fake.builder', 'rb')])
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(builder.devs, rb.devs)
|
|
|
|
fake_pickle.reset_mock()
|
|
|
|
fake_open.reset_mock()
|
|
|
|
|
|
|
|
#test old devs but no meta
|
|
|
|
no_meta_builder = rb
|
|
|
|
for dev in no_meta_builder.devs:
|
|
|
|
del(dev['meta'])
|
|
|
|
fake_pickle.return_value = no_meta_builder
|
|
|
|
pickle.load = fake_pickle
|
2013-03-26 20:42:26 +00:00
|
|
|
builder = ring.RingBuilder.load('fake.builder', open=fake_open)
|
2013-05-16 19:38:42 +04:00
|
|
|
fake_open.assert_has_calls([mock.call('fake.builder', 'rb')])
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(builder.devs, rb.devs)
|
|
|
|
fake_pickle.reset_mock()
|
|
|
|
finally:
|
|
|
|
pickle.load = real_pickle
|
|
|
|
|
2013-05-16 19:38:42 +04:00
|
|
|
def test_save_load(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.0', 'port': 10000,
|
|
|
|
'replication_ip': '127.0.0.0', 'replication_port': 10000,
|
|
|
|
'device': 'sda1', 'meta': 'meta0'},
|
|
|
|
{'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001,
|
|
|
|
'replication_ip': '127.0.0.1', 'replication_port': 10001,
|
|
|
|
'device': 'sdb1', 'meta': 'meta1'},
|
|
|
|
{'id': 2, 'region': 0, 'zone': 2, 'weight': 2,
|
|
|
|
'ip': '127.0.0.2', 'port': 10002,
|
|
|
|
'replication_ip': '127.0.0.2', 'replication_port': 10002,
|
|
|
|
'device': 'sdc1', 'meta': 'meta2'},
|
|
|
|
{'id': 3, 'region': 0, 'zone': 3, 'weight': 2,
|
|
|
|
'ip': '127.0.0.3', 'port': 10003,
|
|
|
|
'replication_ip': '127.0.0.3', 'replication_port': 10003,
|
|
|
|
'device': 'sdd1', 'meta': ''}]
|
|
|
|
for d in devs:
|
|
|
|
rb.add_dev(d)
|
|
|
|
rb.rebalance()
|
|
|
|
builder_file = os.path.join(self.testdir, 'test_save.builder')
|
|
|
|
rb.save(builder_file)
|
|
|
|
loaded_rb = ring.RingBuilder.load(builder_file)
|
|
|
|
self.maxDiff = None
|
|
|
|
self.assertEquals(loaded_rb.to_dict(), rb.to_dict())
|
|
|
|
|
|
|
|
@mock.patch('__builtin__.open', autospec=True)
|
|
|
|
@mock.patch('swift.common.ring.builder.pickle.dump', autospec=True)
|
|
|
|
def test_save(self, mock_pickle_dump, mock_open):
|
2013-07-22 14:59:30 -07:00
|
|
|
mock_open.return_value = mock_fh = mock.MagicMock()
|
2013-05-16 19:38:42 +04:00
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
|
|
|
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.0', 'port': 10000, 'device': 'sda1',
|
|
|
|
'meta': 'meta0'},
|
|
|
|
{'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdb1',
|
|
|
|
'meta': 'meta1'},
|
|
|
|
{'id': 2, 'region': 0, 'zone': 2, 'weight': 2,
|
|
|
|
'ip': '127.0.0.2', 'port': 10002, 'device': 'sdc1',
|
|
|
|
'meta': 'meta2'},
|
|
|
|
{'id': 3, 'region': 0, 'zone': 3, 'weight': 2,
|
|
|
|
'ip': '127.0.0.3', 'port': 10003, 'device': 'sdd1'}]
|
|
|
|
for d in devs:
|
|
|
|
rb.add_dev(d)
|
|
|
|
rb.rebalance()
|
|
|
|
rb.save('some.builder')
|
|
|
|
mock_open.assert_called_once_with('some.builder', 'wb')
|
2013-08-31 23:13:15 -04:00
|
|
|
mock_pickle_dump.assert_called_once_with(rb.to_dict(),
|
|
|
|
mock_fh.__enter__(),
|
2013-05-16 19:38:42 +04:00
|
|
|
protocol=2)
|
|
|
|
|
2012-09-05 12:37:09 -05:00
|
|
|
def test_search_devs(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.0', 'port': 10000, 'device': 'sda1',
|
|
|
|
'meta': 'meta0'},
|
|
|
|
{'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdb1',
|
|
|
|
'meta': 'meta1'},
|
|
|
|
{'id': 2, 'region': 1, 'zone': 2, 'weight': 2,
|
|
|
|
'ip': '127.0.0.2', 'port': 10002, 'device': 'sdc1',
|
|
|
|
'meta': 'meta2'},
|
|
|
|
{'id': 3, 'region': 1, 'zone': 3, 'weight': 2,
|
2012-12-17 06:39:25 -05:00
|
|
|
'ip': '127.0.0.3', 'port': 10003, 'device': 'sdd1',
|
|
|
|
'meta': 'meta3'},
|
|
|
|
{'id': 4, 'region': 2, 'zone': 4, 'weight': 1,
|
|
|
|
'ip': '127.0.0.4', 'port': 10004, 'device': 'sde1',
|
|
|
|
'meta': 'meta4', 'replication_ip': '127.0.0.10',
|
|
|
|
'replication_port': 20000},
|
|
|
|
{'id': 5, 'region': 2, 'zone': 5, 'weight': 2,
|
|
|
|
'ip': '127.0.0.5', 'port': 10005, 'device': 'sdf1',
|
|
|
|
'meta': 'meta5', 'replication_ip': '127.0.0.11',
|
|
|
|
'replication_port': 20001},
|
|
|
|
{'id': 6, 'region': 2, 'zone': 6, 'weight': 2,
|
|
|
|
'ip': '127.0.0.6', 'port': 10006, 'device': 'sdg1',
|
|
|
|
'meta': 'meta6', 'replication_ip': '127.0.0.12',
|
|
|
|
'replication_port': 20002}]
|
2012-09-05 12:37:09 -05:00
|
|
|
for d in devs:
|
|
|
|
rb.add_dev(d)
|
|
|
|
rb.rebalance()
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'region': 0})
|
2013-03-04 17:05:43 -08:00
|
|
|
self.assertEquals(res, [devs[0], devs[1]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'region': 1})
|
2013-03-04 17:05:43 -08:00
|
|
|
self.assertEquals(res, [devs[2], devs[3]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'region': 1, 'zone': 2})
|
2013-03-04 17:05:43 -08:00
|
|
|
self.assertEquals(res, [devs[2]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'id': 1})
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(res, [devs[1]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'zone': 1})
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(res, [devs[1]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'ip': '127.0.0.1'})
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(res, [devs[1]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'ip': '127.0.0.1', 'port': 10001})
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(res, [devs[1]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'port': 10001})
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(res, [devs[1]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'replication_ip': '127.0.0.10'})
|
2012-12-17 06:39:25 -05:00
|
|
|
self.assertEquals(res, [devs[4]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'replication_ip': '127.0.0.10',
|
|
|
|
'replication_port': 20000})
|
2012-12-17 06:39:25 -05:00
|
|
|
self.assertEquals(res, [devs[4]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'replication_port': 20000})
|
2012-12-17 06:39:25 -05:00
|
|
|
self.assertEquals(res, [devs[4]])
|
2013-05-16 16:58:03 +04:00
|
|
|
res = rb.search_devs({'device': 'sdb1'})
|
|
|
|
self.assertEquals(res, [devs[1]])
|
|
|
|
res = rb.search_devs({'meta': 'meta1'})
|
2012-09-05 12:37:09 -05:00
|
|
|
self.assertEquals(res, [devs[1]])
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
def test_validate(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 2, 'region': 0, 'zone': 2, 'weight': 2,
|
|
|
|
'ip': '127.0.0.1', 'port': 10002, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 3, 'region': 0, 'zone': 3, 'weight': 2,
|
|
|
|
'ip': '127.0.0.1', 'port': 10003, 'device': 'sda1'})
|
2013-06-08 10:55:41 -07:00
|
|
|
|
|
|
|
# Degenerate case: devices added but not rebalanced yet
|
|
|
|
self.assertRaises(exceptions.RingValidationError, rb.validate)
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.rebalance()
|
|
|
|
r = rb.get_ring()
|
|
|
|
counts = {}
|
|
|
|
for part2dev_id in r._replica2part2dev_id:
|
|
|
|
for dev_id in part2dev_id:
|
|
|
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
|
|
|
self.assertEquals(counts, {0: 128, 1: 128, 2: 256, 3: 256})
|
|
|
|
|
|
|
|
dev_usage, worst = rb.validate()
|
|
|
|
self.assert_(dev_usage is None)
|
|
|
|
self.assert_(worst is None)
|
|
|
|
|
|
|
|
dev_usage, worst = rb.validate(stats=True)
|
|
|
|
self.assertEquals(list(dev_usage), [128, 128, 256, 256])
|
|
|
|
self.assertEquals(int(worst), 0)
|
|
|
|
|
|
|
|
rb.set_dev_weight(2, 0)
|
|
|
|
rb.rebalance()
|
2013-04-23 16:09:57 +09:00
|
|
|
self.assertEquals(rb.validate(stats=True)[1], MAX_BALANCE)
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
# Test not all partitions doubly accounted for
|
|
|
|
rb.devs[1]['parts'] -= 1
|
2011-09-21 13:20:27 -07:00
|
|
|
self.assertRaises(exceptions.RingValidationError, rb.validate)
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.devs[1]['parts'] += 1
|
|
|
|
|
2013-01-22 18:56:48 -08:00
|
|
|
# Test non-numeric port
|
|
|
|
rb.devs[1]['port'] = '10001'
|
|
|
|
self.assertRaises(exceptions.RingValidationError, rb.validate)
|
|
|
|
rb.devs[1]['port'] = 10001
|
|
|
|
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
# Test partition on nonexistent device
|
|
|
|
rb.pretend_min_part_hours_passed()
|
2010-07-12 17:03:45 -05:00
|
|
|
orig_dev_id = rb._replica2part2dev[0][0]
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
rb._replica2part2dev[0][0] = len(rb.devs)
|
2011-09-21 13:20:27 -07:00
|
|
|
self.assertRaises(exceptions.RingValidationError, rb.validate)
|
2010-07-12 17:03:45 -05:00
|
|
|
rb._replica2part2dev[0][0] = orig_dev_id
|
|
|
|
|
|
|
|
# Tests that validate can handle 'holes' in .devs
|
|
|
|
rb.remove_dev(2)
|
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
|
|
|
rb.validate(stats=True)
|
|
|
|
|
As-unique-as-possible partition replica placement.
This commit introduces a new algorithm for assigning partition
replicas to devices. Basically, the ring builder organizes the devices
into tiers (first zone, then IP/port, then device ID). When placing a
replica, the ring builder looks for the emptiest device (biggest
parts_wanted) in the furthest-away tier.
In the case where zone-count >= replica-count, the new algorithm will
give the same results as the one it replaces. Thus, no migration is
needed.
In the case where zone-count < replica-count, the new algorithm
behaves differently from the old algorithm. The new algorithm will
distribute things evenly at each tier so that the replication is as
high-quality as possible, given the circumstances. The old algorithm
would just crash, so again, no migration is needed.
Handoffs have also been updated to use the new algorithm. When
generating handoff nodes, first the ring looks for nodes in other
zones, then other ips/ports, then any other drive. The first handoff
nodes (the ones in other zones) will be the same as before; this
commit just extends the list of handoff nodes.
The proxy server and replicators have been altered to avoid looking at
the ring's replica count directly. Previously, with a replica count of
C, RingData.get_nodes() and RingData.get_part_nodes() would return
lists of length C, so some other code used the replica count when it
needed the number of nodes. If two of a partition's replicas are on
the same device (e.g. with 3 replicas, 2 devices), then that
assumption is no longer true. Fortunately, all the proxy server and
replicators really needed was the number of nodes returned, which they
already had. (Bonus: now the only code that mentions replica_count
directly is in the ring and the ring builder.)
Change-Id: Iba2929edfc6ece89791890d0635d4763d821a3aa
2012-04-23 10:41:44 -07:00
|
|
|
# Test partition assigned to a hole
|
|
|
|
if rb.devs[2]:
|
|
|
|
rb.remove_dev(2)
|
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
orig_dev_id = rb._replica2part2dev[0][0]
|
|
|
|
rb._replica2part2dev[0][0] = 2
|
|
|
|
self.assertRaises(exceptions.RingValidationError, rb.validate)
|
|
|
|
rb._replica2part2dev[0][0] = orig_dev_id
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
# Validate that zero weight devices with no partitions don't count on
|
|
|
|
# the 'worst' value.
|
2013-04-23 16:09:57 +09:00
|
|
|
self.assertNotEquals(rb.validate(stats=True)[1], MAX_BALANCE)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 4, 'region': 0, 'zone': 0, 'weight': 0,
|
|
|
|
'ip': '127.0.0.1', 'port': 10004, 'device': 'sda1'})
|
2010-07-12 17:03:45 -05:00
|
|
|
rb.pretend_min_part_hours_passed()
|
|
|
|
rb.rebalance()
|
2013-04-23 16:09:57 +09:00
|
|
|
self.assertNotEquals(rb.validate(stats=True)[1], MAX_BALANCE)
|
2010-07-12 17:03:45 -05:00
|
|
|
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
def test_get_part_devices(self):
|
|
|
|
rb = ring.RingBuilder(8, 3, 1)
|
2013-03-14 18:49:30 -07:00
|
|
|
self.assertEqual(rb.get_part_devices(0), [])
|
|
|
|
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
rb.rebalance()
|
|
|
|
|
|
|
|
part_devs = sorted(rb.get_part_devices(0),
|
|
|
|
key=operator.itemgetter('id'))
|
|
|
|
self.assertEqual(part_devs, [rb.devs[0], rb.devs[1]])
|
|
|
|
|
|
|
|
def test_get_part_devices_partial_replicas(self):
|
|
|
|
rb = ring.RingBuilder(8, 2.5, 1)
|
2013-03-04 17:05:43 -08:00
|
|
|
rb.add_dev({'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda1'})
|
|
|
|
rb.add_dev({'id': 1, 'region': 0, 'zone': 1, 'weight': 1,
|
|
|
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sda1'})
|
Make rings' replica counts adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder rebalance
This is a prerequisite for supporting globally-distributed clusters,
as operators of such clusters will probably want at least as many
replicas as they have regions. Therefore, adding a region requires
adding a replica. Similarly, removing a region lets an operator remove
a replica and save some money on disks.
In order to not hose clusters with lots of data, swift-ring-builder
now allows for setting of fractional replicas. Thus, one can gradually
increase the replica count at a rate that does not adversely affect
cluster performance.
Example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Obviously, fractional replicas are nonsensical for a single
partition. A fractional replica count is for the whole ring, not for
any individual partition, and indicates the average number of replicas
of each partition. For example, a replica count of 3.2 means that 20%
of partitions have 4 replicas and 80% have 3 replicas.
Changes do not take effect until after the ring is rebalanced. Thus,
if you mean to go from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, 'swift-ring-builder X.builder create' can now take a
decimal argument for the number of replicas.
DocImpact
Change-Id: I12b34dacf60350a297a46be493d5d171580243ff
2012-11-09 10:32:07 -08:00
|
|
|
rb.rebalance()
|
|
|
|
|
|
|
|
# note: partition 255 will only have 2 replicas
|
|
|
|
part_devs = sorted(rb.get_part_devices(255),
|
|
|
|
key=operator.itemgetter('id'))
|
|
|
|
self.assertEqual(part_devs, [rb.devs[0], rb.devs[1]])
|
|
|
|
|
2010-07-12 17:03:45 -05:00
|
|
|
|
|
|
|
if __name__ == '__main__':
|
|
|
|
unittest.main()
|