This is primarily a writing style cleanup with some clarifications and link additions. It does not make any great effort to bring the information up to date with current conditions. Change-Id: I665dc5afc952cf93c095cdd1421b64d7b0716b2a
		
			
				
	
	
		
			168 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			168 lines
		
	
	
		
			6.8 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
..
 | 
						|
      Licensed under the Apache License, Version 2.0 (the "License"); you may
 | 
						|
      not use this file except in compliance with the License. You may obtain
 | 
						|
      a copy of the License at
 | 
						|
 | 
						|
          http://www.apache.org/licenses/LICENSE-2.0
 | 
						|
 | 
						|
      Unless required by applicable law or agreed to in writing, software
 | 
						|
      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 | 
						|
      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 | 
						|
      License for the specific language governing permissions and limitations
 | 
						|
      under the License.
 | 
						|
 | 
						|
===================
 | 
						|
Scheduler Evolution
 | 
						|
===================
 | 
						|
 | 
						|
Evolving the scheduler has been a priority item over several
 | 
						|
releases: http://specs.openstack.org/openstack/nova-specs/#priorities
 | 
						|
 | 
						|
The scheduler has become tightly coupled with the rest of nova,
 | 
						|
limiting its capabilities, accuracy, flexibility and maintainability.
 | 
						|
The goal of scheduler evolution is to bring about a better separation of
 | 
						|
concerns between scheduling functionality and the rest of nova.
 | 
						|
 | 
						|
Once this effort has completed, its conceivable that the nova-scheduler could
 | 
						|
become a separate git repo, outside of nova but within the compute project.
 | 
						|
This is not the current focus.
 | 
						|
 | 
						|
Problem Use Cases
 | 
						|
==================
 | 
						|
 | 
						|
Many users are wanting to do more advanced things with the scheduler, but the
 | 
						|
current architecture is not ready to support those use cases in a maintainable way.
 | 
						|
A few examples will help to illustrate where the scheduler falls
 | 
						|
short:
 | 
						|
 | 
						|
Cross Project Affinity
 | 
						|
-----------------------
 | 
						|
 | 
						|
It can be desirable, when booting from a volume, to use a compute node
 | 
						|
that is close to the shared storage where that volume is. Similarly, for
 | 
						|
the sake of performance, it can be desirable to use a compute node that
 | 
						|
is in a particular location in relation to a pre-created port.
 | 
						|
 | 
						|
Accessing Aggregates in Filters and Weights
 | 
						|
--------------------------------------------
 | 
						|
 | 
						|
Any DB access in a filter or weight slows down the scheduler. Until the
 | 
						|
end of kilo, there was no way to deal with the scheduler accessing
 | 
						|
information about aggregates without querying the DB in every call to
 | 
						|
host_passes() in a filter.
 | 
						|
 | 
						|
Filter Scheduler Alternatives
 | 
						|
------------------------------
 | 
						|
 | 
						|
For certain use cases, radically different schedulers may perform much better
 | 
						|
than the filter scheduler. We should not block this innovation. It is
 | 
						|
unreasonable to assume a single scheduler will work for all use cases.
 | 
						|
 | 
						|
However, to enable this kind of innovation in a maintainable way, a
 | 
						|
single strong scheduler interface is required.
 | 
						|
 | 
						|
Project Scale issues
 | 
						|
---------------------
 | 
						|
 | 
						|
There are many interesting ideas for new schedulers, like the solver scheduler,
 | 
						|
and frequent requests to add new filters and weights to the scheduling system.
 | 
						|
The current nova team does not have the bandwidth to deal with all these
 | 
						|
requests. A dedicated scheduler team could work on these items independently
 | 
						|
of the rest of nova.
 | 
						|
 | 
						|
The tight coupling that currently exists makes it impossible to work
 | 
						|
on the scheduler in isolation. A stable interface is required before
 | 
						|
the code can be split out.
 | 
						|
 | 
						|
Key areas we are evolving
 | 
						|
==========================
 | 
						|
 | 
						|
Here we discuss, at a high level, areas that are being addressed as part of
 | 
						|
the scheduler evolution work.
 | 
						|
 | 
						|
Fixing the Scheduler DB model
 | 
						|
------------------------------
 | 
						|
 | 
						|
We need the nova and scheduler data models to be independent of each other.
 | 
						|
 | 
						|
The first step is breaking the link between the ComputeNode and Service
 | 
						|
DB tables. In theory where the Service information is stored should be
 | 
						|
pluggable through the service group API, and should be independent of the
 | 
						|
scheduler service. For example, it could be managed via zookeeper rather
 | 
						|
than polling the nova DB.
 | 
						|
 | 
						|
There are also places where filters and weights call into the nova DB to
 | 
						|
find out information about aggregates. This needs to be sent to the
 | 
						|
scheduler, rather than reading directly from the nova database.
 | 
						|
 | 
						|
Versioning Scheduler Placement Interfaces
 | 
						|
------------------------------------------
 | 
						|
 | 
						|
At the start of kilo, the scheduler is passed a set of dictionaries across
 | 
						|
a versioned RPC interface. The dictionaries can create problems with the
 | 
						|
backwards compatibility needed for live-upgrades.
 | 
						|
 | 
						|
Luckily we already have the oslo.versionedobjects infrastructure we can use
 | 
						|
to model this data in a way that can be versioned across releases.
 | 
						|
 | 
						|
This effort is mostly focusing around the request_spec. See, for
 | 
						|
example, `this spec`_.
 | 
						|
 | 
						|
 | 
						|
Sending host and node stats to the scheduler
 | 
						|
---------------------------------------------
 | 
						|
 | 
						|
Periodically nova-compute updates the scheduler state stored in
 | 
						|
the database.
 | 
						|
 | 
						|
We need a good way to model the data that is being sent from the compute
 | 
						|
nodes into the scheduler, so over time, the scheduler can move to having
 | 
						|
its own database.
 | 
						|
 | 
						|
This is linked to the work on the resource tracker.
 | 
						|
 | 
						|
Updating the Scheduler about other data
 | 
						|
----------------------------------------
 | 
						|
 | 
						|
For things like host aggregates, we need the scheduler to cache information
 | 
						|
about those, and know when there are changes so it can update its cache.
 | 
						|
 | 
						|
Over time, its possible that we need to send cinder and neutron data, so
 | 
						|
the scheduler can use that data to help pick a nova-compute host.
 | 
						|
 | 
						|
Resource Tracker
 | 
						|
-----------------
 | 
						|
 | 
						|
The recent work to add support for NUMA and PCI pass through have shown we
 | 
						|
have no good pattern to extend the resource tracker. Ideally we want to keep
 | 
						|
the innovation inside the nova tree, but we also need it to be easier.
 | 
						|
 | 
						|
This is very related to the effort to re-think how we model resources, as
 | 
						|
covered by discussion about `resource providers`_.
 | 
						|
 | 
						|
Parallelism and Concurrency
 | 
						|
----------------------------
 | 
						|
 | 
						|
The current design of the nova-scheduler is very racy, and can lead to
 | 
						|
excessive numbers of build retries before the correct host is found. The
 | 
						|
recent NUMA features are particularly impacted by how the scheduler
 | 
						|
works. All this has lead to many people running only a single
 | 
						|
nova-scheduler process configured to use a very small greenthread pool.
 | 
						|
 | 
						|
The work on cells v2 will mean that we soon need the scheduler to scale for
 | 
						|
much larger problems. The current scheduler works best with less than 1k nodes
 | 
						|
but we will need the scheduler to work with at least 10k nodes.
 | 
						|
 | 
						|
Various ideas have been discussed to reduce races when running multiple
 | 
						|
nova-scheduler processes. One idea is to use two-phase commit "style"
 | 
						|
resource tracker claims. Another idea involves using incremental updates
 | 
						|
so it is more efficient to keep the scheduler's state up to date,
 | 
						|
potentially using Kafka.
 | 
						|
 | 
						|
For more details, see the `backlog spec`_ that describes more of the details
 | 
						|
around this problem.
 | 
						|
 | 
						|
.. _this spec: http://specs.openstack.org/openstack/nova-specs/specs/kilo/approved/sched-select-destinations-use-request-spec-object.html
 | 
						|
.. _resource providers: https://blueprints.launchpad.net/nova/+spec/resource-providers
 | 
						|
.. _backlog spec: http://specs.openstack.org/openstack/nova-specs/specs/backlog/approved/parallel-scheduler.html
 |