diff --git a/doc/source/index.rst b/doc/source/index.rst index b8ac111..3038771 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -10,6 +10,7 @@ Contents: readme journey/index contributor/index + large-scale/index Indices and tables ================== diff --git a/doc/source/journey/index.rst b/doc/source/journey/index.rst index 1b791bb..5701e28 100644 --- a/doc/source/journey/index.rst +++ b/doc/source/journey/index.rst @@ -15,6 +15,5 @@ Contents: scale_up scale_out upgrade_and_maintain - large_scale_scaling_stories WIP: Transfer the content from https://wiki.openstack.org/wiki/Large_Scale_SIG diff --git a/doc/source/journey/large_scale_scaling_stories.rst b/doc/source/journey/large_scale_scaling_stories.rst deleted file mode 100644 index ce9b7c1..0000000 --- a/doc/source/journey/large_scale_scaling_stories.rst +++ /dev/null @@ -1,5 +0,0 @@ -=========================== -Large Scale Scaling Stories -=========================== - -# WIP diff --git a/doc/source/large-scale/index.rst b/doc/source/large-scale/index.rst new file mode 100644 index 0000000..0e2d7b9 --- /dev/null +++ b/doc/source/large-scale/index.rst @@ -0,0 +1,11 @@ +================ + Scaling Stories +================ + +Contents: + +.. toctree:: + :maxdepth: 2 + + stories + stories/2020-01-29 diff --git a/doc/source/large-scale/stories.rst b/doc/source/large-scale/stories.rst new file mode 100644 index 0000000..3db82ce --- /dev/null +++ b/doc/source/large-scale/stories.rst @@ -0,0 +1,21 @@ +=========================== +Large Scale Scaling Stories +=========================== + +As part of its goal of further pushing back scaling limits within a given cluster, the Large Scale SIG collects scaling stories from OpenStack users. + +There is a size/load limit for single clusters past which things in OpenStack start to break, and we need to start using multiple clusters or cells to scale out. The SIG is interested in hearing: + +* what broke first for you, is it RabbitMQ or something else +* what were the first symptoms +* what size/load did it start to break +* things you did to fix it + +This will be a great help to document expected limits, and identify where improvements should be focused. + +You can submit your story directly here, or on this `etherpad `_. + +Stories +------- + +* :doc:`2020-01-29-AlbertBraden ` diff --git a/doc/source/large-scale/stories/2020-01-29.rst b/doc/source/large-scale/stories/2020-01-29.rst new file mode 100644 index 0000000..6bdcbce --- /dev/null +++ b/doc/source/large-scale/stories/2020-01-29.rst @@ -0,0 +1,70 @@ +=================================================== +Large Scale Scaling Stories/2020-01-29-AlbertBraden +=================================================== + +Here are the scaling issues I've encountered recently at Synopsys, in reverse chronological order: + + +Thursday 12/19/2019: openstack server list –all-projects does not return all VMs. +--------------------------------------------------------------------------------- + +In /etc/nova/nova.conf we have default: # max_limit = 1000 + +The recordset cleanup script depends on correct output from “openstack server list –all-projects" + +Fix: Increased max_limit to 2000 + +The recordset cleanup script will run “openstack server list –all-projects|wc –l" and compare the output to max_limit, and refuse to run if max_limit is too low. If this happens, increase max_limit so that it is greater than the number of VMs in the cluster. + +As time permits we need to look into paging results: https://docs.openstack.org/api-guide/compute/paginated_collections.html + + +Friday 12/13/2019: Arp table got full on pod2 controllers +--------------------------------------------------------- + +https://www.cyberciti.biz/faq/centos-redhat-debian-linux-neighbor-table-overflow/ + +Fix: Increase sysctl values: + +.. code-block:: console + + --- a/roles/openstack/controller/neutron/tasks/main.yml + +++ b/roles/openstack/controller/neutron/tasks/main.yml + @@ -243,6 +243,9 @@ + with_items: + - { name: 'net.bridge.bridge-nf-call-iptables', value: '1' } + - { name: 'net.bridge.bridge-nf-call-ip6tables', value: '1' } + + - { name: 'net.ipv4.neigh.default.gc_thresh3', value: '4096' } + + - { name: 'net.ipv4.neigh.default.gc_thresh2', value: '2048' } + + - { name: 'net.ipv4.neigh.default.gc_thresh1', value: '1024' } + + +12/10/2019: RPC workers were overloaded +--------------------------------------- + +http://lists.openstack.org/pipermail/openstack-discuss/2019-December/011465.html + +Fix: increase number of RPC workers. modify /etc/neutron/neutron.conf on controllers: + +.. code-block:: console + + 148c148 + < #rpc_workers = 1 + --- + > rpc_workers = 8 + + +October 2019: Rootwrap +---------------------- + +Neutron was timing out because rootwrap was taking too long to spawn. + +Fix: Run rootwrap daemon: + +Add line to /etc/neutron/neutron.conf on the controllers: + +root_helper_daemon = "sudo /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf" + +Add line to /etc/sudoers.d/neutron_sudoers on the controllers: + +neutron ALL = (root) NOPASSWD: /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf