Merge "Consolidate capacity planning and scaling."
This commit is contained in:
commit
4ffa62763f
@ -1,9 +1,244 @@
|
||||
============
|
||||
Introduction
|
||||
============
|
||||
=============================
|
||||
Capacity planning and scaling
|
||||
=============================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
An important consideration in running a cloud over time is projecting growth
|
||||
and utilization trends in order to plan capital expenditures for the short and
|
||||
long term. Gather utilization meters for compute, network, and storage, along
|
||||
with historical records of these meters. While securing major anchor tenants
|
||||
can lead to rapid jumps in the utilization of resources, the average rate of
|
||||
adoption of cloud services through normal usage also needs to be carefully
|
||||
monitored.
|
||||
|
||||
|
||||
General storage considerations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
A wide variety of operator-specific requirements dictates the nature of the
|
||||
storage back end. Examples of such requirements are as follows:
|
||||
|
||||
* Public or private cloud, and associated SLA requirements
|
||||
* The need for encryption-at-rest, for data on storage nodes
|
||||
* Whether live migration will be offered
|
||||
|
||||
We recommend that data be encrypted both in transit and at-rest.
|
||||
If you plan to use live migration, a shared storage configuration is highly
|
||||
recommended.
|
||||
|
||||
Block Storage
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Configure Block Storage resource nodes with advanced RAID controllers
|
||||
and high-performance disks to provide fault tolerance at the hardware
|
||||
level.
|
||||
|
||||
Deploy high performing storage solutions such as SSD drives or
|
||||
flash storage systems for applications requiring additional performance out
|
||||
of Block Storage devices.
|
||||
|
||||
In environments that place substantial demands on Block Storage, we
|
||||
recommend using multiple storage pools. In this case, each pool of
|
||||
devices should have a similar hardware design and disk configuration
|
||||
across all hardware nodes in that pool. This allows for a design that
|
||||
provides applications with access to a wide variety of Block Storage
|
||||
pools, each with their own redundancy, availability, and performance
|
||||
characteristics. When deploying multiple pools of storage, it is also
|
||||
important to consider the impact on the Block Storage scheduler which is
|
||||
responsible for provisioning storage across resource nodes. Ideally,
|
||||
ensure that applications can schedule volumes in multiple regions, each with
|
||||
their own network, power, and cooling infrastructure. This will give tenants
|
||||
the option of building fault-tolerant applications that are distributed
|
||||
across multiple availability zones.
|
||||
|
||||
In addition to the Block Storage resource nodes, it is important to
|
||||
design for high availability and redundancy of the APIs, and related
|
||||
services that are responsible for provisioning and providing access to
|
||||
storage. We recommend designing a layer of hardware or software load
|
||||
balancers in order to achieve high availability of the appropriate REST
|
||||
API services to provide uninterrupted service. In some cases, it may
|
||||
also be necessary to deploy an additional layer of load balancing to
|
||||
provide access to back-end database services responsible for servicing
|
||||
and storing the state of Block Storage volumes. It is imperative that a
|
||||
highly available database cluster is used to store the Block
|
||||
Storage metadata.
|
||||
|
||||
In a cloud with significant demands on Block Storage, the network
|
||||
architecture should take into account the amount of East-West bandwidth
|
||||
required for instances to make use of the available storage resources.
|
||||
The selected network devices should support jumbo frames for
|
||||
transferring large blocks of data, and utilize a dedicated network for
|
||||
providing connectivity between instances and Block Storage.
|
||||
|
||||
Scaling Block Storage
|
||||
---------------------
|
||||
|
||||
You can upgrade Block Storage pools to add storage capacity without
|
||||
interrupting the overall Block Storage service. Add nodes to the pool by
|
||||
installing and configuring the appropriate hardware and software and
|
||||
then allowing that node to report in to the proper storage pool through the
|
||||
message bus. Block Storage nodes generally report into the scheduler
|
||||
service advertising their availability. As a result, after the node is
|
||||
online and available, tenants can make use of those storage resources
|
||||
instantly.
|
||||
|
||||
In some cases, the demand on Block Storage may exhaust the available
|
||||
network bandwidth. As a result, design network infrastructure that
|
||||
services Block Storage resources in such a way that you can add capacity
|
||||
and bandwidth easily. This often involves the use of dynamic routing
|
||||
protocols or advanced networking solutions to add capacity to downstream
|
||||
devices easily. Both the front-end and back-end storage network designs
|
||||
should encompass the ability to quickly and easily add capacity and
|
||||
bandwidth.
|
||||
|
||||
.. note::
|
||||
|
||||
Sufficient monitoring and data collection should be in-place
|
||||
from the start, such that timely decisions regarding capacity,
|
||||
input/output metrics (IOPS) or storage-associated bandwidth can
|
||||
be made.
|
||||
|
||||
Object Storage
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
While consistency and partition tolerance are both inherent features of
|
||||
the Object Storage service, it is important to design the overall
|
||||
storage architecture to ensure that the implemented system meets those
|
||||
goals. The OpenStack Object Storage service places a specific number of
|
||||
data replicas as objects on resource nodes. Replicas are distributed
|
||||
throughout the cluster, based on a consistent hash ring also stored on
|
||||
each node in the cluster.
|
||||
|
||||
Design the Object Storage system with a sufficient number of zones to
|
||||
provide quorum for the number of replicas defined. For example, with
|
||||
three replicas configured in the swift cluster, the recommended number
|
||||
of zones to configure within the Object Storage cluster in order to
|
||||
achieve quorum is five. While it is possible to deploy a solution with
|
||||
fewer zones, the implied risk of doing so is that some data may not be
|
||||
available and API requests to certain objects stored in the cluster
|
||||
might fail. For this reason, ensure you properly account for the number
|
||||
of zones in the Object Storage cluster.
|
||||
|
||||
Each Object Storage zone should be self-contained within its own
|
||||
availability zone. Each availability zone should have independent access
|
||||
to network, power, and cooling infrastructure to ensure uninterrupted
|
||||
access to data. In addition, a pool of Object Storage proxy servers
|
||||
providing access to data stored on the object nodes should service each
|
||||
availability zone. Object proxies in each region should leverage local
|
||||
read and write affinity so that local storage resources facilitate
|
||||
access to objects wherever possible. We recommend deploying upstream
|
||||
load balancing to ensure that proxy services are distributed across the
|
||||
multiple zones and, in some cases, it may be necessary to make use of
|
||||
third-party solutions to aid with geographical distribution of services.
|
||||
|
||||
A zone within an Object Storage cluster is a logical division. Any of
|
||||
the following may represent a zone:
|
||||
|
||||
* A disk within a single node
|
||||
* One zone per node
|
||||
* Zone per collection of nodes
|
||||
* Multiple racks
|
||||
* Multiple data centers
|
||||
|
||||
Selecting the proper zone design is crucial for allowing the Object
|
||||
Storage cluster to scale while providing an available and redundant
|
||||
storage system. It may be necessary to configure storage policies that
|
||||
have different requirements with regards to replicas, retention, and
|
||||
other factors that could heavily affect the design of storage in a
|
||||
specific zone.
|
||||
|
||||
Scaling Object Storage
|
||||
----------------------
|
||||
|
||||
Adding back-end storage capacity to an Object Storage cluster requires
|
||||
careful planning and forethought. In the design phase, it is important
|
||||
to determine the maximum partition power required by the Object Storage
|
||||
service, which determines the maximum number of partitions which can
|
||||
exist. Object Storage distributes data among all available storage, but
|
||||
a partition cannot span more than one disk, so the maximum number of
|
||||
partitions can only be as high as the number of disks.
|
||||
|
||||
For example, a system that starts with a single disk and a partition
|
||||
power of 3 can have 8 (2^3) partitions. Adding a second disk means that
|
||||
each has 4 partitions. The one-disk-per-partition limit means that this
|
||||
system can never have more than 8 disks, limiting its scalability.
|
||||
However, a system that starts with a single disk and a partition power
|
||||
of 10 can have up to 1024 (2^10) disks.
|
||||
|
||||
As you add back-end storage capacity to the system, the partition maps
|
||||
redistribute data amongst the storage nodes. In some cases, this
|
||||
involves replication of extremely large data sets. In these cases, we
|
||||
recommend using back-end replication links that do not contend with
|
||||
tenants' access to data.
|
||||
|
||||
As more tenants begin to access data within the cluster and their data
|
||||
sets grow, it is necessary to add front-end bandwidth to service data
|
||||
access requests. Adding front-end bandwidth to an Object Storage cluster
|
||||
requires careful planning and design of the Object Storage proxies that
|
||||
tenants use to gain access to the data, along with the high availability
|
||||
solutions that enable easy scaling of the proxy layer. We recommend
|
||||
designing a front-end load balancing layer that tenants and consumers
|
||||
use to gain access to data stored within the cluster. This load
|
||||
balancing layer may be distributed across zones, regions or even across
|
||||
geographic boundaries, which may also require that the design encompass
|
||||
geo-location solutions.
|
||||
|
||||
In some cases, you must add bandwidth and capacity to the network
|
||||
resources servicing requests between proxy servers and storage nodes.
|
||||
For this reason, the network architecture used for access to storage
|
||||
nodes and proxy servers should make use of a design which is scalable.
|
||||
|
||||
|
||||
Network
|
||||
~~~~~~~
|
||||
.. TODO(unassigned): consolidate and update existing network sub-chapters.
|
||||
|
||||
|
||||
Compute
|
||||
~~~~~~~
|
||||
A relationship exists between the size of the compute environment and
|
||||
the supporting OpenStack infrastructure controller nodes requiring
|
||||
support.
|
||||
|
||||
Increasing the size of the supporting compute environment increases the
|
||||
network traffic and messages, adding load to the controller or
|
||||
networking nodes. Effective monitoring of the environment will help with
|
||||
capacity decisions on scaling.
|
||||
|
||||
Compute nodes automatically attach to OpenStack clouds, resulting in a
|
||||
horizontally scaling process when adding extra compute capacity to an
|
||||
OpenStack cloud. Additional processes are required to place nodes into
|
||||
appropriate availability zones and host aggregates. When adding
|
||||
additional compute nodes to environments, ensure identical or functional
|
||||
compatible CPUs are used, otherwise live migration features will break.
|
||||
It is necessary to add rack capacity or network switches as scaling out
|
||||
compute hosts directly affects network and data center resources.
|
||||
|
||||
Compute host components can also be upgraded to account for increases in
|
||||
demand. This is also known as vertical scaling. Upgrading CPUs with more
|
||||
cores, or increasing the overall server memory, can add extra needed
|
||||
capacity depending on whether the running applications are more CPU
|
||||
intensive or memory intensive.
|
||||
|
||||
Another option is to assess the average workloads and increase the
|
||||
number of instances that can run within the compute environment by
|
||||
adjusting the overcommit ratio.
|
||||
|
||||
.. note::
|
||||
|
||||
Changing the CPU overcommit ratio can have a detrimental effect
|
||||
and cause a potential increase in a noisy neighbor.
|
||||
|
||||
Insufficient disk capacity could also have a negative effect on overall
|
||||
performance including CPU and memory usage. Depending on the back-end
|
||||
architecture of the OpenStack Block Storage layer, capacity includes
|
||||
adding disk shelves to enterprise storage systems or installing
|
||||
additional block storage nodes. Upgrading directly attached storage
|
||||
installed in compute hosts, and adding capacity to the shared storage
|
||||
for additional ephemeral storage to instances, may be necessary.
|
||||
|
||||
For a deeper discussion on many of these topics, refer to the `OpenStack
|
||||
Operations Guide <http://docs.openstack.org/ops>`_.
|
||||
|
||||
|
||||
Control plane API services and Horizon
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
.. TODO(unassigned): consolidate existing control plane sub-chapters.
|
||||
|
Loading…
x
Reference in New Issue
Block a user