Update Erasure Coding appendix

As of 20.10 the OpenStack charms fully support use of Erasure Coded pools with applications that can support this feature. Update the Erasure Coding appendix to detail usage and configuration. Drop the post-deployment reconfiguration steps previously documented. Change-Id: Ibd5417333944f5afa276c3060fa2d1edb3b1e9ee
2020-09-02 11:37:53 +01:00 · 2020-09-02 11:37:53 +01:00 · ea5f2051e2
commit ea5f2051e2
parent a746ee4a9d
2 changed files with 317 additions and 135 deletions
--- a/deploy-guide/source/1
+++ b/deploy-guide/source/1
@ -0,0 +1,152 @@
+Appendix M: Ceph Erasure Coding and Device Classing
+===================================================
+
+Overview
++++++++
+
+Ceph pools supporting applications within an OpenStack deployment are
+by default configured as replicated pools which means that every object
+stored is copied to multiple hosts or zones to allow the pool to survive
+the loss of an OSD.
+
+Ceph also supports Erasure Coded pools which can be used instead to save
+raw space within the storage cluster.  The following charms can be
+configured to use Erasure Coded pools:
+
+  * glance
+  * cinder-ceph
+  * nova-compute
+  * ceph-fs
+  * ceph-radosgw
+
+Configuring charms for Erasure Coding
+++++++++++++++++++++++++++++++++++++
+
+All charms that support use with Erasure Coded pools support a consistent
+set of configuration options to enable and tune the Erasure Coding profile
+used to configure the Erasure Coded pool.
+
+Erasure Coding is enabled by setting the 'pool-type' option to 'erasure-coded'.
+By default the JErasure plugin is used with K=1 and M=2.  This does not
+actually save any raw storage compared to a replicated pool with 3 replicas
+(and is designd to allow use on a three node Ceph cluster) so most deployments
+using Erasure Coded pools will need to tune the K and M values based on either
+the number of hosts deployed or the number of zones in the deployment if
+the 'customize-failure-domain' option is enabled on the ceph-osd and ceph-mon
+charms.
+
+The K value defines the number of data chunks that will be used for each object
+and the M value defines the number of coding chunks generated for each object.
+The M value also defines the number of OSD's that may be lost before the pool
+goes into a degraded state.
+
+K + M must always be less than or equal to the number of hosts or zones in the
+deployment (depending on the configuration of 'customize-failure-domain'.
+
+In the example below, the Erasure Coded pool used by the glance application
+will sustain the loss of two hosts or zones while only consuming 2TB instead
+of 3TB of storage to store 1TB of data.
+
+.. code-block:: yaml
+
+    glance:
+      options:
+        pool-type: erasure-coded
+        ec-profile-k: 2
+        ec-profile-m: 2
+
+The full list of Erasure Coding configuration options is detailed below.
+
+.. list-table:: Erasure Coding charm options
+   :widths: 25 5 15 55
+   :header-rows: 1
+
+   * - Option
+     - Type
+     - Default Value
+     - Description
+   * - pool-type
+     - string
+     - replicated
+     - Ceph pool type to use for storage - valid values are 'replicated' and 'erasure-coded'.
+   * - ec-profile-name
+     - string
+     -
+     - Name for the EC profile to be created for the EC pools. If not defined a profile name will be generated based on the name of the pool used by the application.
+   * - ec-rbd-metadata-pool
+     - string
+     -
+     - Name of the metadata pool to be created (for RBD use-cases).  If not defined a metadata pool name will be generated based on the name of the data pool used by the application.  The metadata pool is always replicated (not erasure coded).
+   * - ec-profile-k
+     - int
+     - 1
+     - Number of data chunks that will be used for EC data pool. K+M factors should never be greater than the number of available AZs for balancing.
+   * - ec-profile-m
+     - int
+     - 2
+     - Number of coding chunks that will be used for EC data pool. K+M factors should never be greater than number of available AZs for balancing.
+   * - ec-profile-locality
+     - int
+     -
+     - (lrc plugin - l) Group the coding and data chunks into sets of size l. For instance, for k=4 and m=2, when l=3 two groups of three are created. Each set can be recovered without reading chunks from another set.  Note that using the lrc plugin does incur more raw storage usage than isa or jerasure in order to reduce the cost of recovery operations.
+   * - ec-profile-crush-locality
+     - string
+     -
+     - (lrc plugin) The type of the crush bucket in which each set of chunks defined by l will be stored. For instance, if it is set to rack, each group of l chunks will be placed in a different rack. It is used to create a CRUSH rule step such as 'step choose rack'. If it is not set, no such grouping is done.
+   * - ec-profile-durability-estimator
+     - int
+     -
+     - (shec plugin - c) The number of parity chunks each of which includes each data chunk in its calculation range. The number is used as a durability estimator. For instance, if c=2, 2 OSDs can be down without losing data.
+   * - ec-profile-helper-chunks
+     - int
+     -
+     - (clay plugin - d) Number of OSDs requested to send data during recovery of a single chunk. d needs to be chosen such that k+1 <= d <= k+m-1. The larger the d, the better the savings.
+   * - ec-profile-scalar-mds
+     - string
+     -
+     - (clay plugin) Specifies the plugin that is used as a building block in the layered construction. It can be one of: jerasure, isa or shec.
+   * - ec-profile-plugin
+     - string
+     - jerasure
+     - EC plugin to use for this applications pool. These plugins are available: jerasure, lrc, isa, shec, clay.
+   * - ec-profile-technique
+     - string
+     - reed_sol_van
+     - EC profile technique used for this applications pool - will be validated based on the plugin configured via ec-profile-plugin. Supported techniques are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', 'cauchy' for isa and 'single', 'multiple' for shec.
+   * - ec-profile-device-class
+     - string
+     -
+     - Device class from CRUSH map to use for placement groups for erasure profile - valid values: ssd, hdd or nvme (or leave unset to not use a device class).
+
+
+Ceph automatic device classing
++++++++++++++++++++++++++++++
+
+Newer versions of Ceph do automatic classing of OSD devices. Each OSD
+will be placed into ‘nvme’, ‘ssd’ or ‘hdd’ device classes.  These can
+be used when creating erasure profiles or new CRUSH rules (see following
+sections).
+
+The classes can be inspected using:
+
+.. code::
+
+    sudo ceph osd crush tree
+
+    ID CLASS WEIGHT  TYPE NAME
+    -1       8.18729 root default
+    -5       2.72910     host node-laveran
+     2  nvme 0.90970         osd.2
+     5   ssd 0.90970         osd.5
+     7   ssd 0.90970         osd.7
+    -7       2.72910     host node-mees
+     1  nvme 0.90970         osd.1
+     6   ssd 0.90970         osd.6
+     8   ssd 0.90970         osd.8
+    -3       2.72910     host node-pytheas
+     0  nvme 0.90970         osd.0
+     3   ssd 0.90970         osd.3
+     4   ssd 0.90970         osd.4
+
+The device class for an Erasure Coded pool can be configured in the
+consuming charm using the 'ec-device-class' configuration option.
--- a/deploy-guide/source/app-erasure-coding.rst
+++ b/deploy-guide/source/app-erasure-coding.rst
@ -1,47 +1,173 @@
-Appendix M: Ceph Erasure Coding and Device Classing
-===================================================
+===============================
+Appendix M: Ceph Erasure Coding
+===============================

 Overview
-++++++++
+--------

-This appendix is intended as a post deployment guide to re-configuring RADOS
-gateway pools to use erasure coding rather than replication.  It also covers
-use of a specific device class (NVMe, SSD or HDD) when creating the erasure
-coding profile as well as other configuration options that need to be
-considered during deployment.
+Ceph pools supporting applications within an OpenStack deployment are
+by default configured as replicated pools which means that every stored
+object is copied to multiple hosts or zones to allow the pool to survive
+the loss of an OSD.
+
+Ceph also supports Erasure Coded pools which can be used to save
+raw space within the Ceph cluster.  The following charms can be configured
+to use Erasure Coded pools:
+
+* `ceph-fs`_
+* `ceph-radosgw`_
+* `cinder-ceph`_
+* `glance`_
+* `nova-compute`_
+
+.. warning::
+
+   Enabling the use of Erasure Coded pools will effect the IO performance
+   of the pool and will incur additional CPU and memory overheads on the
+   Ceph OSD nodes due to calculation of coding chunks during read and
+   write operations and during recovery of data chunks from failed OSDs.

 .. note::

-    Any existing data is maintained by following this process, however
-    reconfiguration should take place immediately post deployment to avoid
-    prolonged ‘copy-pool’ operations.
+   The mirroring of RBD images stored in Erasure Coded pools is not currently
+   supported by the ceph-rbd-mirror charm due to limitations in the functionality
+   of the Ceph rbd-mirror application.

-RADOS Gateway bucket weighting
-++++++++++++++++++++++++++++++
+Configuring charms for Erasure Coding
+-------------------------------------

-The weighting of the various pools in a deployment drives the number of
-placement groups (PG’s) created to support each pool.  In the ceph-radosgw
-charm this is configured for the data bucket using:
+Charms that support Erasure Coded pools have a consistent set of configuration
+options to enable and tune the Erasure Coding profile used to configure
+the Erasure Coded pools created for each application.

-.. code::
+Erasure Coding is enabled by setting the ``pool-type`` option to 'erasure-coded'.

-  	juju config ceph-radosgw rgw-buckets-pool-weight=20
+Ceph supports multiple `Erasure Code`_ plugins. A plugin may provide support for
+multiple Erasure Code techniques - for example the JErasure plugin provides
+support for Cauchy and Reed-Solomon Vandermonde (and others).
+
+For the default JErasure plugin, the K value defines the number of data chunks
+that will be used for each object and the M value defines the number of coding
+chunks generated for each object. The M value also defines the number of hosts
+or zones that may be lost before the pool goes into a degraded state.
+
+K + M must always be less than or equal to the number of hosts or zones in the
+deployment (depending on the configuration of ``customize-failure-domain``).
+
+By default the JErasure plugin is used with K=1 and M=2.  This does not
+actually save any raw storage compared to a replicated pool with 3 replicas
+(and is to allow use on a three node Ceph cluster) so most deployments
+using Erasure Coded pools will need to tune the K and M values based on either
+the number of hosts deployed or the number of zones in the deployment (if
+the ``customize-failure-domain`` option is enabled on the ceph-osd and ceph-mon
+charms).
+
+In the example below, the Erasure Coded pool used by the glance application
+will sustain the loss of two hosts or zones while only consuming 2TB instead
+of 3TB of storage to store 1TB of data when compared to a replicated pool. This
+configuration requires a minimum of 4 hosts (or zones).
+
+.. code-block:: yaml
+
+    glance:
+      options:
+        pool-type: erasure-coded
+        ec-profile-k: 2
+        ec-profile-m: 2
+
+The full list of Erasure Coding configuration options is detailed below.
+Full descriptions of each plugin and its configuration options can also
+be found in the `Ceph Erasure Code`_ documention for the Ceph project.
+
+.. list-table:: Erasure Coding charm options
+   :widths: 20 15 5 15 45
+   :header-rows: 1
+
+   * - Option
+     - Charm
+     - Type
+     - Default Value
+     - Description
+   * - pool-type
+     - all
+     - string
+     - replicated
+     - Ceph pool type to use for storage - valid values are 'replicated' and 'erasure-coded'.
+   * - ec-rbd-metadata-pool
+     - glance, cinder-ceph, nova-compute
+     - string
+     -
+     - Name of the metadata pool to be created (for RBD use-cases). If not defined a metadata pool name will be generated based on the name of the data pool used by the application.  The metadata pool is always replicated (not erasure coded).
+   * - metadata-pool
+     - ceph-fs
+     - string
+     -
+     - Name of the metadata pool to be created for the CephFS filesystem. If not defined a metadata pool name will be generated based on the name of the data pool used by the application.  The metadata pool is always replicated (not erasure coded).
+   * - ec-profile-name
+     - all
+     - string
+     -
+     - Name for the EC profile to be created for the EC pools. If not defined a profile name will be generated based on the name of the pool used by the application.
+   * - ec-profile-k
+     - all
+     - int
+     - 1
+     - Number of data chunks that will be used for EC data pool. K+M factors should never be greater than the number of available AZs for balancing.
+   * - ec-profile-m
+     - all
+     - int
+     - 2
+     - Number of coding chunks that will be used for EC data pool. K+M factors should never be greater than number of available AZs for balancing.
+   * - ec-profile-locality
+     - all
+     - int
+     -
+     - (lrc plugin - l) Group the coding and data chunks into sets of size l. For instance, for k=4 and m=2, when l=3 two groups of three are created. Each set can be recovered without reading chunks from another set.  Note that using the lrc plugin does incur more raw storage usage than isa or jerasure in order to reduce the cost of recovery operations.
+   * - ec-profile-crush-locality
+     - all
+     - string
+     -
+     - (lrc plugin) The type of the crush bucket in which each set of chunks defined by l will be stored. For instance, if it is set to rack, each group of l chunks will be placed in a different rack. It is used to create a CRUSH rule step such as 'step choose rack'. If it is not set, no such grouping is done.
+   * - ec-profile-durability-estimator
+     - all
+     - int
+     -
+     - (shec plugin - c) The number of parity chunks each of which includes each data chunk in its calculation range. The number is used as a durability estimator. For instance, if c=2, 2 OSDs can be down without losing data.
+   * - ec-profile-helper-chunks
+     - all
+     - int
+     -
+     - (clay plugin - d) Number of OSDs requested to send data during recovery of a single chunk. d needs to be chosen such that k+1 <= d <= k+m-1. The larger the d, the better the savings.
+   * - ec-profile-scalar-mds
+     - all
+     - string
+     -
+     - (clay plugin) Specifies the plugin that is used as a building block in the layered construction. It can be one of: jerasure, isa or shec.
+   * - ec-profile-plugin
+     - all
+     - string
+     - jerasure
+     - EC plugin to use for this applications pool. These plugins are available: jerasure, lrc, isa, shec, clay.
+   * - ec-profile-technique
+     - all
+     - string
+     - reed_sol_van
+     - EC profile technique used for this applications pool - will be validated based on the plugin configured via ec-profile-plugin. Supported techniques are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', 'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van', 'cauchy' for isa and 'single', 'multiple' for shec.
+   * - ec-profile-device-class
+     - all
+     - string
+     -
+     - Device class from CRUSH map to use for placement groups for erasure profile - valid values: ssd, hdd or nvme (or leave unset to not use a device class).

-Note the default of 20% - if the deployment is a pure ceph-radosgw
-deployment this value should be increased to the expected % use of
-storage.  The device class also needs to be taken into account (but
-for erasure coding this needs to be specified post deployment via action
-execution).

 Ceph automatic device classing
-++++++++++++++++++++++++++++++
+------------------------------

-Newer versions of Ceph do automatic classing of OSD devices. Each OSD
+Newer versions of Ceph perform automatic classing of OSD devices. Each OSD
 will be placed into ‘nvme’, ‘ssd’ or ‘hdd’ device classes.  These can
-be used when creating erasure profiles or new CRUSH rules (see following
-sections).
+be used when enabling Erasure Coded pools.

-The classes can be inspected using:
+Device classes can be inspected using:

 .. code::

@ -62,112 +188,16 @@ The classes can be inspected using:
     3   ssd 0.90970         osd.3
     4   ssd 0.90970         osd.4

+The device class for an Erasure Coded pool can be configured in the
+consuming charm using the ``ec-device-class`` configuration option.

-Configuring erasure coding
-++++++++++++++++++++++++++
+If this option is not provided devices of any class will be used.

-The RADOS gateway makes use of a number of pools, but the only pool
-that should be converted to use erasure coding (EC) is the data pool:
-
-.. code::
-
-    default.rgw.buckets.data
-
-All other pools should be replicated as they are by default.
-
-To create a new EC profile and pool:
-
-.. code::
-
-    juju run-action --wait ceph-mon/0 create-erasure-profile \
-        name=nvme-ec device-class=nvme
-
-    juju run-action --wait ceph-mon/0 create-pool \
-      	name=default.rgw.buckets.data.new \
-    	pool-type=erasure \
-    	erasure-profile-name=nvme-ec \
-    	percent-data=90
-
-The percent-data option should be set based on the type of deployment
-but if the RADOS gateway is the only target for the NVMe storage class,
-then 90% is appropriate (other RADOS gateway pools are tiny and use
-between 0.10% and 3% of storage)
-
-.. note::
-
-    The create-erasure-profile action has a number of other
-    options including adjustment of the K/M values which affect the
-    computational overhead and underlying storage consumed per MB stored.
-    Sane defaults are provided but they require a minimum of five hosts
-    with block devices of the right class.
-
-To avoid any creation/mutation of stored data during migration,
-shutdown all RADOS gateway instances:
-
-.. code::
-
-    juju run --application ceph-radosgw \
-        "sudo systemctl stop ceph-radosgw.target"
-
-The existing buckets.data pool can then be copied and switched:
-
-.. code::
-
-    juju run-action --wait ceph-mon/0 rename-pool \
-    	name=default.rgw.buckets.data \
-    	new-name=default.rgw.buckets.data.old
-
-    juju run-action --wait ceph-mon/0 rename-pool \
-    	name=default.rgw.buckets.data.new \
-	    new-name=default.rgw.buckets.data
-
-At this point the RADOS gateway instances can be restarted:
-
-.. code::
-
-    juju run --application ceph-radosgw \
-        "sudo systemctl start ceph-radosgw.target"
-
-Once successful operation of the deployment has been confirmed,
-the old pool can be deleted:
-
-.. code::
-
-    juju run-action --wait ceph-mon/0 delete-pool \
-        name=default.rgw.buckets.data.old
-
-Moving other RADOS gateway pools to NVMe storage
-++++++++++++++++++++++++++++++++++++++++++++++++
-
-The buckets.data pool is the largest pool and the one that can make
-use of EC; other pools could also be migrated to the same storage
-class for consistent performance:
-
-.. code::
-
-    juju run-action --wait ceph-mon/0 create-crush-rule \
-        name=replicated_nvme device-class=nvme
-
-The CRUSH rule for the other RADOS gateway pools can then be updated:
-
-.. code::
-
-    pools=".rgw.root
-    default.rgw.control
-    default.rgw.data.root
-    default.rgw.gc
-    default.rgw.log
-    default.rgw.intent-log
-    default.rgw.meta
-    default.rgw.usage
-    default.rgw.users.keys
-    default.rgw.users.uid
-    default.rgw.buckets.extra
-    default.rgw.buckets.index
-    default.rgw.users.email
-    default.rgw.users.swift"
-
-    for pool in $pools; do
-        juju run-action --wait ceph-mon/0 pool-set \
-            name=$pool key=crush_rule value=replicated_nvme
-    done
+.. LINKS
+.. _Ceph Erasure Code: https://docs.ceph.com/docs/master/rados/operations/erasure-code/
+.. _ceph-fs: https://jaas.ai/ceph-fs
+.. _ceph-radosgw: https://jaas.ai/ceph-radosgw
+.. _cinder-ceph: https://jaas.ai/cinder-ceph
+.. _glance: https://jaas.ai/glance
+.. _nova-compute: https://jaas.ai/nova-compute
+.. _Erasure Code: https://en.wikipedia.org/wiki/Erasure_code