Docs supporting deployment groups

Adds documentation regarding the usage of deployment groups in shipyard. Change-Id: I5554c93b428cdfa4cb28a8b9d8f7d37b4596ae8c
2018-06-07 15:41:49 -05:00 · 2018-06-07 15:41:49 -05:00 · 648ce3f990
commit 648ce3f990
parent d558bd6218
6 changed files with 726 additions and 233 deletions
--- a/docs/source/API-action-commands.rst
+++ b/docs/source/API-action-commands.rst
@ -0,0 +1,83 @@
+..
+      Copyright 2017 AT&T Intellectual Property.
+      All Rights Reserved.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+.. _shipyard_action_commands:
+
+Action Commands
+===============
+
+Supported actions
+-----------------
+
+These actions are currently supported using the Action API
+
+.. _deploy_site:
+
+deploy_site
+~~~~~~~~~~~
+
+Triggers the initial deployment of a site, using the latest committed
+configuration documents. Steps, conceptually:
+
+#. Concurrency check
+    Prevents concurrent site modifications by conflicting
+    actions/workflows.
+#. Preflight checks
+    Ensures all Airship components are in a responsive state.
+#. Validate design
+    Asks each involved Airship component to validate the design. This ensures
+    that the previously committed design is valid at the present time.
+#. Drydock build
+    Orchestrates the Drydock component to configure hardware and the
+    Kubernetes environment (Drydock -> Promenade)
+#. Armada build
+    Orchestrates Armada to configure software on the nodes as designed.
+
+.. _update_site:
+
+update_site
+~~~~~~~~~~~
+
+Applies a new committed configuration to the environment. The steps of
+update_site mirror those of :ref:`deploy_site`.
+
+Actions under development
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These actions are under active development
+
+-  redeploy_server
+
+  Using parameters to indicate which server(s) triggers a redeployment of those
+  servers to the last-known-good design and secrets
+
+Future actions
+~~~~~~~~~~~~~~
+
+These actions are anticipated for development
+
+-  test region
+
+  Invoke site validation testing - perhaps a baseline is an invocation of all
+  component's exposed tests or extended health checks. This test would be used
+  as a preflight-style test to ensure all components are in a working state.
+
+-  test component
+
+  Invoke a particular platform component to test it. This test would be
+  used to interrogate a particular platform component to ensure it is in a
+  working state, and that its own downstream dependencies are also
+  operational
--- a/docs/source/API_action_commands.rst
+++ b/docs/source/API_action_commands.rst
@ -1,216 +0,0 @@
-..
-      Copyright 2017 AT&T Intellectual Property.
-      All Rights Reserved.
-
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-.. _shipyard_action_commands:
-
-Action Commands
-===============
-
-Supported actions
-----------------
-
-These actions are currently supported using the Action API
-
-deploy_site
-~~~~~~~~~~~
-
-Triggers the initial deployment of a site, using the latest committed
-configuration documents. Steps, conceptually:
-
-#. Concurrency check
-    Prevents concurrent site modifications by conflicting
-    actions/workflows.
-#. Preflight checks
-    Ensures all Airship components are in a responsive state.
-#. Validate design
-    Asks each involved Airship component to validate the design. This ensures
-    that the previously committed design is valid at the present time.
-#. Drydock build
-    Orchestrates the Drydock component to configure hardware and the
-    Kubernetes environment (Drydock -> Promenade)
-#. Armada build
-    Orchestrates Armada to configure software on the nodes as designed.
-
-update_site
-~~~~~~~~~~~
-
-Applies a new committed configuration to the environment. The steps of
-update_site mirror those of deploy_site.
-
-Actions under development
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-These actions are under active development
-
-  redeploy_server
-
-  Using parameters to indicate which server(s), triggers a redeployment of
-  server to the last known good design and secrets
-
-Future actions
-~~~~~~~~~~~~~~
-
-These actions are anticipated for development
-
-  test region
-
-  Invoke site validation testing - perhaps baseline is a invocation of all
-  components regular “component” tests. This test would be used as a
-  preflight-style test to ensure all components are in a working state.
-
-  test component
-
-  Invoke a particular platform component to test it. This test would be
-  used to interrogate a particular platform component to ensure it is in a
-  working state, and that its own downstream dependencies are also
-  operational
-
-Configuration Documents
-----------------------
-Shipyard requires some configuration documents to be loaded into the
-environment for the deploy_site and update_site as well as other workflows
-that directly deal with site deployments.
-
-Schemas
-~~~~~~~
-DeploymentConfiguration_ schema - Provides for validation of the
-deployment-configuration documents
-
-Deployment Configuration
-~~~~~~~~~~~~~~~~~~~~~~~~
-Allows for specification of configurable options used by the site deployment
-related workflows, including the timeouts used for various steps, and the name
-of the armada manifest that will be used during the deployment/update.
-
-A `sample deployment-configuration`_ shows a completely specified example.
-
-`Default configuration values`_ are provided for most values.
-
-Supported values:
-'''''''''''''''''
-
-  physical_provisioner:
-
-  Values in the physical_provisioner section apply to the interactions with
-  Drydock in the various steps taken to deploy or update bare-metal servers
-  and networking.
-
-  -  deployment_strategy:
-
-    The name of the deployment strategy document to be used. There is a default
-    deployment strategy that is used if this field is not present.
-
-  -  deploy_interval:
-
-    The seconds delayed between checks for progress of the step that performs
-    deployment of servers.
-
-  -  deploy_timeout:
-
-    The maximum seconds allowed for the step that performs deployment of all
-    servers.
-
-  -  destroy_interval:
-
-    The seconds delayed between checks for progress of destroying hardware
-    nodes.
-
-  -  destroy_timeout:
-
-    The maximum seconds allowed for destroying hardware nodes.
-
-  -  join_wait:
-
-    The number of seconds allowed for a node to join the Kubernetes cluster.
-
-  -  prepare_node_interval:
-
-    The seconds delayed between checks for progress of preparing nodes.
-
-  -  prepare_node_timeout:
-
-    The maximum seconds allowed for preparing nodes.
-
-  -  prepare_site_interval:
-
-    The seconds delayed between checks for progress of preparing the site.
-
-  -  prepare_site_timeout:
-
-    The maximum seconds allowed for preparing the site.
-
-  -  verify_interval:
-
-    The seconds delayed between checks for progress of verification.
-
-  -  verify_timeout:
-
-    The maximum seconds allowed for verification by Drydock.
-
-  kubernetes_provisioner:
-
-  Values in the kubernetes_provisioner section apply to interactions with
-  Promenade in the various steps of redeploying servers.
-
-  -  drain_timeout:
-
-    The maximum seconds allowed for draining a node.
-
-  -  drain_grace_period:
-
-    The seconds provided to Promenade as a grace period for pods to cease.
-
-  -  clear_labels_timeout:
-
-    The maximum seconds provided to Promenade to clear labels on a node.
-
-  -  remove_etcd_timeout:
-
-    The maximum seconds provided to Promenade to allow for removing etcd from
-    a node.
-
-  -  etcd_ready_timeout:
-
-    The maximum seconds allowed for etcd to reach a healthy state after
-    a node is removed.
-
-  armada:
-
-  The armada section provides configuration for the workflow interactions with
-  Armada.
-
-  -  manifest:
-
-    The name of the Armada manifest document that the workflow will use during
-    site deployment activities. e.g.:'full-site'
-
-Deployment Strategy
-~~~~~~~~~~~~~~~~~~~
-The deployment strategy document is optionally specified in the Deployment
-Configuration and provides a way to group, sequence, and test the deployments
-of groups of hosts deployed using `Drydock`_. The `deployment strategy design`_
-provides details for the structures and usage of the deployment strategy.
-A `sample deployment-strategy`_ shows one possible strategy, in the context of
-the Shipyard unit testing.
-The `DeploymentStrategy`_ schema is a more formal definition of this document.
-
-.. _`Default configuration values`: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/shipyard_airflow/plugins/deployment_configuration_operator.py
-.. _DeploymentConfiguration: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/shipyard_airflow/schemas/deploymentConfiguration.yaml
-.. _DeploymentStrategy: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/shipyard_airflow/schemas/deploymentStrategy.yaml
-.. _`deployment strategy design`: https://airshipit.readthedocs.io/en/latest/blueprints/deployment-grouping-baremetal.html
-.. _Drydock: https://git.airshipit.org/cgit/airship-drydock
-.. _`sample deployment-configuration`: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/tests/unit/yaml_samples/deploymentConfiguration_full_valid.yaml
-.. _`sample deployment-strategy`: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/tests/unit/yaml_samples/deploymentStrategy_full_valid.yaml
--- a/docs/source/client-user-guide.rst
+++ b/docs/source/client-user-guide.rst
@ -174,7 +174,7 @@ response::

 Running Shipyard CLI with Docker Container
 ------------------------------------------
-It is also possible to execute Shipyard CLI using docker container
+It is also possible to execute Shipyard CLI using a docker container.

 Note that we will need to pass the relevant environment information as well
 as the Shipyard command that we wish to execute as part of the ``docker run``
@ -197,7 +197,7 @@ The output will resemble the following::

 Use Case: Ingest Site Design
 ----------------------------
-Shipyard serves as the entrypoint for a deployment of Airship. One can imagine
+Shipyard serves as the entry point for a deployment of Airship. One can imagine
 the following activities representing part of the lifecycle of a group of
 servers for which Airship would serve as the control plane:

@ -211,8 +211,8 @@ Preparation
  (Ubuntu 16.04) image. Airship is deployed; See
  :ref:`shipyard_deployment_guide`

-  At this point, Airship is ready for use. This is the when the Shipyard API
-  is available for use.
+  At this point, Airship is ready for use. This is when the Shipyard API is
+  available for use.

 Load Configuration Documents
  A user, deployment engineer, or automation -- i.e. the operator interacts
@ -258,7 +258,7 @@ designs in Deckhand.  If the validations are not successful, Shipyard will not
 mark the revision as committed.

 .. important::
-  It is not necessary to load all configuration documents in one step but each
+  It is not necessary to load all configuration documents in one step, but each
  named collection may only exist as a complete set of documents (i.e. must be
  loaded together).

--- a/docs/source/deployment-guide.rst
+++ b/docs/source/deployment-guide.rst
@ -38,8 +38,8 @@ This approach sets up an 'All-In-One' Airship environment that allows
 developers to bring up Shipyard and the rest of the Airship components on a
 single Ubuntu Virtual Machine.

-The deployment is fully automated and can take a while to complete (it can take
-30 minutes to an hour for a full deployment to complete)
+The deployment is fully automated and can take a while to complete. It can take
+30 minutes to an hour for a full deployment to complete.

 Post Deployment
 ---------------
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -21,21 +21,23 @@ Welcome to Shipyard's documentation!
 Shipyard is a directed acyclic graph controller for Kubernetes and OpenStack
 control plane life-cycle management, and is part of the `Airship`_ platform.

-User's Guide
-============
-
-Shipyard Configuration Guide
----------------------------
-
 .. toctree::
   :maxdepth: 2

   sampleconf
-   policy-enforcement
   API
-   API_action_commands
+   API-action-commands
   CLI
-   client_user_guide
-   deployment_guide
+   site-definition-documents
+   client-user-guide
+   deployment-guide
+   policy-enforcement
+
+Building this Documentation
+---------------------------
+
+Use ``make docs`` or ``tox -e docs`` to generate these docs. This will and
+build an html version of this documentation that can be viewed using a browser
+at docs/build/index.html on the local filesystem.

 .. _Airship: https://airshipit.org
--- a/docs/source/site-definition-documents.rst
+++ b/docs/source/site-definition-documents.rst
@ -0,0 +1,624 @@
+..
+      Copyright 2018 AT&T Intellectual Property.
+      All Rights Reserved.
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+.. _site_definition_documents:
+
+Site Definition Documents
+=========================
+Shipyard requires some documents to be loaded as part of the site definition
+for the :ref:`deploy_site` and :ref:`update_site` as well as other workflows
+that directly deal with site deployments.
+
+Schemas
+-------
+-  `DeploymentConfiguration`_ schema
+-  `DeploymentStrategy`_ schema
+
+.. _deployment_configuration:
+
+Deployment Configuration
+------------------------
+Allows for specification of configurable options used by the site deployment
+related workflows, including the timeouts used for various steps, and the name
+of the Armada manifest that will be used during the deployment/update.
+
+A `sample deployment-configuration`_ shows a completely specified example.
+
+`Default configuration values`_ are provided for most values.
+
+Supported values
+~~~~~~~~~~~~~~~~
+
+-  Section: `physical_provisioner`:
+
+  Values in the physical_provisioner section apply to the interactions with
+  Drydock in the various steps taken to deploy or update bare-metal servers
+  and networking.
+
+  deployment_strategy
+    The name of the deployment strategy document to be used. There is a default
+    deployment strategy that is used if this field is not present.
+
+  deploy_interval
+    The seconds delayed between checks for progress of the step that performs
+    deployment of servers.
+
+  deploy_timeout
+    The maximum seconds allowed for the step that performs deployment of all
+    servers.
+
+  destroy_interval
+    The seconds delayed between checks for progress of destroying hardware
+    nodes.
+
+  destroy_timeout
+    The maximum seconds allowed for destroying hardware nodes.
+
+  join_wait
+    The number of seconds allowed for a node to join the Kubernetes cluster.
+
+  prepare_node_interval
+    The seconds delayed between checks for progress of preparing nodes.
+
+  prepare_node_timeout
+    The maximum seconds allowed for preparing nodes.
+
+  prepare_site_interval
+    The seconds delayed between checks for progress of preparing the site.
+
+  prepare_site_timeout
+    The maximum seconds allowed for preparing the site.
+
+  verify_interval
+    The seconds delayed between checks for progress of verification.
+
+  verify_timeout
+    The maximum seconds allowed for verification by Drydock.
+
+-  Section: `kubernetes_provisioner`:
+
+  Values in the kubernetes_provisioner section apply to interactions with
+  Promenade in the various steps of redeploying servers.
+
+  drain_timeout
+    The maximum seconds allowed for draining a node.
+
+  drain_grace_period
+    The seconds provided to Promenade as a grace period for pods to cease.
+
+  clear_labels_timeout
+    The maximum seconds provided to Promenade to clear labels on a node.
+
+  remove_etcd_timeout
+    The maximum seconds provided to Promenade to allow for removing etcd from
+    a node.
+
+  etcd_ready_timeout
+    The maximum seconds allowed for etcd to reach a healthy state after
+    a node is removed.
+
+-  Section: `armada`:
+
+  The Armada section provides configuration for the workflow interactions with
+  Armada.
+
+  manifest
+    The name of the `Armada manifest document`_ that the workflow will use during
+    site deployment activities. e.g.:'full-site'
+
+.. _deployment_strategy:
+
+Deployment Strategy
+-------------------
+The deployment strategy document is optionally specified in the
+:ref:`deployment_configuration` and provides a way to group, sequence, and test
+the deployments of groups of hosts deployed using `Drydock`_. A `sample
+deployment-strategy`_ shows one possible strategy, in the context of the
+Shipyard unit testing.
+
+Using A Deployment Strategy
+---------------------------
+Defining a deployment strategy involves understanding the design of a site, and
+the desired criticality of the nodes that make up the site.
+
+A typical site may include a handful or many servers that participate in a
+Kubernetes cluster. Several of the servers may serve as control nodes, while
+others will handle the workload of the site. During the deployment of a site,
+it may be critically important that some servers are operational, while others
+may have a higher tolerance for misconfigured or failed nodes.
+
+The deployment strategy provides a mechanism to handle defining groups of
+nodes such that the criticality is reflected in the success criteria.
+
+The name of the DeploymentStrategy document to use is defined in the
+:ref:`deployment_configuration`, in the
+``physical_provisioner.deployment_strategy`` field. The most simple deployment
+strategy is used if one is not specified in the :ref:`deployment_configuration`
+document for the site. Example::
+
+  schema: shipyard/DeploymentStrategy/v1
+  metadata:
+    schema: metadata/Document/v1
+    name: deployment-strategy
+    layeringDefinition:
+        abstract: false
+        layer: global
+    storagePolicy: cleartext
+  data:
+    groups: [
+      - name: default
+        critical: true
+        depends_on: []
+        selectors: [
+          - node_names: []
+            node_labels: []
+            node_tags: []
+            rack_names: []
+        ]
+        success_criteria:
+          percent_successful_nodes: 100
+    ]
+
+-  This default configuration indicates that there are no selectors, meaning
+   that all nodes in the design are included.
+-  The criticality is set to ``true`` meaning that the workflow will halt if
+   the success criteria are not met.
+-  The success criteria indicates that all nodes must be succssful to consider
+   the group a success.
+
+In short, the default behavior is to deploy everything all at once, and halt
+if there are any failures.
+
+In a large deployment, this could be a problematic strategy as the chance of
+success in one try goes down as complexity rises. A deployment strategy
+provides a means to mitigate the unforeseen.
+
+To define a deployment strategy, an example may be helpful, but first
+definition of the fields follow:
+
+Groups
+~~~~~~
+Groups are named sets of nodes that will be deployed together. The fields of a
+group are:
+
+name
+  Required. The identifying name of the group.
+
+critical
+  Required. Indicates if this group is required to continue to additional
+  phases of deployment.
+
+depends_on
+  Required, may be an empty list. Group names that must be successful before
+  this group can be processed.
+
+selectors
+  Required, may be an empty list. A list of identifying information to indicate
+  the nodes that are members of this group.
+
+success_criteria
+  Optional. Criteria that must evaluate to be true before a group is considered
+  successfully complete with a phase of deployment.
+
+Criticality
+'''''''''''
+-  Field: critical
+-  Valid values: true | false
+
+Each group is required to indicate true or false for the `critical` field.
+This drives the behavior after the deployment of baremetal nodes.  If any
+groups that are marked as `critical: true` fail to meet that group's success
+criteria, the workflow will halt after the deployment of baremetal nodes. A
+group that cannot be processed due to a parent dependency failing will be
+considered failed, regardless of the success criteria.
+
+Dependencies
+''''''''''''
+-  Field: depends_on
+-  Valid values: [] or a list of group names
+
+Each group specifies a list of depends_on groups, or an empty list. All
+identified groups must complete successfully for the phase of deployment before
+the current group is allowed to be processed by the current phase.
+
+-  A failure (based on success criteria) of a group prevents any groups
+   dependent upon the failed group from being attempted.
+-  Circular dependencies will be rejected as invalid during document
+   validation.
+-  There is no guarantee of ordering among groups that have their dependencies
+   met. Any group that is ready for deployment based on declared dependencies
+   will execute, however execution of groups is serialized - two groups will
+   not deploy at the same time.
+
+Selectors
+'''''''''
+-  Field: selectors
+-  Valid values: [] or a list of selectors
+
+The list of selectors indicate the nodes that will be included in a group.
+Each selector has four available filtering values: node_names, node_tags,
+node_labels, and rack_names. Each selector is an intersection of this
+critera, while the list of selectors is a union of the individual selectors.
+
+-  Omitting a criterion from a selector, or using empty list means that
+   criterion is ignored.
+-  Having a completely empty list of selectors, or a selector that has no
+   criteria specified indicates ALL nodes.
+-  A collection of selectors that results in no nodes being identified will be
+   processed as if 100% of nodes successfully deployed (avoiding division by
+   zero), but would fail the minimum or maximum nodes criteria (still counts as
+   0 nodes)
+-  There is no validation against the same node being in multiple groups,
+   however the workflow will not resubmit nodes that have already completed or
+   failed in this deployment to Drydock twice, since it keeps track of each
+   node uniquely. The success or failure of those nodes excluded from
+   submission to Drydock will still be used for the success criteria
+   calculation.
+
+E.g.::
+
+  selectors:
+    - node_names:
+        - node01
+        - node02
+      rack_names:
+        - rack01
+      node_tags:
+        - control
+    - node_names:
+        - node04
+      node_labels:
+        - ucp_control_plane: enabled
+
+Will indicate (not really SQL, just for illustration)::
+
+    SELECT nodes
+    WHERE node_name in ('node01', 'node02')
+          AND rack_name in ('rack01')
+          AND node_tags in ('control')
+    UNION
+    SELECT nodes
+    WHERE node_name in ('node04')
+          AND node_label in ('ucp_control_plane: enabled')
+
+Success Criteria
+''''''''''''''''
+-  Field: success_criteria
+-  Valid values: for possible values, see below
+
+Each group optionally contains success criteria which is used to indicate if
+the deployment of that group is successful. The values that may be specified:
+
+percent_successful_nodes
+  The calculated success rate of nodes completing the deployment phase.
+
+  E.g.: 75 would mean that 3 of 4 nodes must complete the phase successfully.
+
+  This is useful for groups that have larger numbers of nodes, and do not
+  have critical minimums or are not sensitive to an arbitrary number of nodes
+  not working.
+
+minimum_successful_nodes
+  An integer indicating how many nodes must complete the phase to be considered
+  successful.
+
+maximum_failed_nodes
+  An integer indicating a number of nodes that are allowed to have failed the
+  deployment phase and still consider that group successful.
+
+When no criteria are specified, it means that no checks are done - processing
+continues as if nothing is wrong.
+
+When more than one criterion is specified, each is evaluated separately - if
+any fail, the group is considered failed.
+
+Example Deployment Strategy Document
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This example shows a contrived deployment strategy with 5 groups:
+control-nodes, compute-nodes-1, compute-nodes-2, monitoring-nodes,
+and ntp-node.
+
+::
+
+  ---
+  schema: shipyard/DeploymentStrategy/v1
+  metadata:
+    schema: metadata/Document/v1
+    name: deployment-strategy
+    layeringDefinition:
+        abstract: false
+        layer: global
+    storagePolicy: cleartext
+  data:
+    groups:
+      - name: control-nodes
+        critical: true
+        depends_on:
+          - ntp-node
+        selectors:
+          - node_names: []
+            node_labels: []
+            node_tags:
+              - control
+            rack_names:
+              - rack03
+        success_criteria:
+          percent_successful_nodes: 90
+          minimum_successful_nodes: 3
+          maximum_failed_nodes: 1
+      - name: compute-nodes-1
+        critical: false
+        depends_on:
+          - control-nodes
+        selectors:
+          - node_names: []
+            node_labels: []
+            rack_names:
+              - rack01
+            node_tags:
+              - compute
+        success_criteria:
+          percent_successful_nodes: 50
+      - name: compute-nodes-2
+        critical: false
+        depends_on:
+          - control-nodes
+        selectors:
+          - node_names: []
+            node_labels: []
+            rack_names:
+              - rack02
+            node_tags:
+              - compute
+        success_criteria:
+          percent_successful_nodes: 50
+      - name: monitoring-nodes
+        critical: false
+        depends_on: []
+        selectors:
+          - node_names: []
+            node_labels: []
+            node_tags:
+              - monitoring
+            rack_names:
+              - rack03
+              - rack02
+              - rack01
+      - name: ntp-node
+        critical: true
+        depends_on: []
+        selectors:
+          - node_names:
+              - ntp01
+            node_labels: []
+            node_tags: []
+            rack_names: []
+        success_criteria:
+          minimum_successful_nodes: 1
+
+The ordering of groups, as defined by the dependencies (``depends-on``
+fields)::
+
+   __________     __________________
+  | ntp-node |   | monitoring-nodes |
+   ----------     ------------------
+       |
+   ____V__________
+  | control-nodes |
+   ---------------
+       |_________________________
+           |                     |
+     ______V__________     ______V__________
+    | compute-nodes-1 |   | compute-nodes-2 |
+     -----------------     -----------------
+
+Given this, the order of execution could be any of the following:
+
+- ntp-node > monitoring-nodes > control-nodes > compute-nodes-1 > compute-nodes-2
+- ntp-node > control-nodes > compute-nodes-2 > compute-nodes-1 > monitoring-nodes
+- monitoring-nodes > ntp-node > control-nodes > compute-nodes-1 > compute-nodes-2
+- and many more ... the only guarantee is that ntp-node will run some time
+  before control-nodes, which will run sometime before both of the
+  compute-nodes. Monitoring-nodes can run at any time.
+
+Also of note are the various combinations of selectors and the varied use of
+success criteria.
+
+Example Processing
+''''''''''''''''''
+Using the defined deployment strategy in the above example, the following is
+an example of how it may process::
+
+  Start
+  |
+  | prepare ntp-node           <SUCCESS>
+  | deploy ntp-node            <SUCCESS>
+  V
+  | prepare control-nodes      <SUCCESS>
+  | deploy control-nodes       <SUCCESS>
+  V
+  | prepare monitoring-nodes   <SUCCESS>
+  | deploy monitoring-nodes    <SUCCESS>
+  V
+  | prepare compute-nodes-2    <SUCCESS>
+  | deploy compute-nodes-2     <SUCCESS>
+  V
+  | prepare compute-nodes-1    <SUCCESS>
+  | deploy compute-nodes-1     <SUCCESS>
+  |
+  Finish (success)
+
+If there were a failure in preparing the ntp-node, the following would be the
+result::
+
+  Start
+  |
+  | prepare ntp-node           <FAILED>
+  | deploy ntp-node            <FAILED, due to prepare failure>
+  V
+  | prepare control-nodes      <FAILED, due to dependency>
+  | deploy control-nodes       <FAILED, due to dependency>
+  V
+  | prepare monitoring-nodes   <SUCCESS>
+  | deploy monitoring-nodes    <SUCCESS>
+  V
+  | prepare compute-nodes-2    <FAILED, due to dependency>
+  | deploy compute-nodes-2     <FAILED, due to dependency>
+  V
+  | prepare compute-nodes-1    <FAILED, due to dependency>
+  | deploy compute-nodes-1     <FAILED, due to dependency>
+  |
+  Finish (failed due to critical group failed)
+
+If a failure occurred during the deploy of compute-nodes-2, the following would
+result::
+
+  Start
+  |
+  | prepare ntp-node           <SUCCESS>
+  | deploy ntp-node            <SUCCESS>
+  V
+  | prepare control-nodes      <SUCCESS>
+  | deploy control-nodes       <SUCCESS>
+  V
+  | prepare monitoring-nodes   <SUCCESS>
+  | deploy monitoring-nodes    <SUCCESS>
+  V
+  | prepare compute-nodes-2    <SUCCESS>
+  | deploy compute-nodes-2     <FAILED, non critical group>
+  V
+  | prepare compute-nodes-1    <SUCCESS>
+  | deploy compute-nodes-1     <SUCCESS>
+  |
+  Finish (success with some nodes/groups failed)
+
+Important Points
+~~~~~~~~~~~~~~~~
+-  By default, the deployment strategy is all-at-once, requiring total success.
+-  Critical group failures halt the deployment activity AFTER processing all
+   nodes, but before proceeding to deployment of the software using Armada.
+-  Success Criteria are evaluated at the end of processing of each of two
+   phases for each group. A failure in a parent group indicates a failure for
+   child groups - those children will not be processed.
+-  Group processing is serial.
+
+Interactions
+~~~~~~~~~~~~
+During the processing of nodes, the workflow interacts with Drydock using the
+node filter mechanism provided in the Drydock API. When formulating the nodes
+to process in a group, Shipyard will make an inquiry of Drydock's /nodefilter
+endpoint to get the list of nodes that match the selectors for the group.
+
+Shipyard will keep track of nodes that are actionable for each group using the
+response from Drydock, as well as prior group inquiries. This means
+that any nodes processed in a group will not be reprocessed in a later group,
+but will still count toward that group's success criteria.
+
+Two actions (prepare, deploy) will be invoked against Drydock during the actual
+node preparation and deployment. The workflow will monitor the tasks created by
+Drydock and keep track of the successes and failures.
+
+At the end of processing, the workflow step will report the success status for
+each group and each node. Processing will either stop or continue depending on
+the success of critical groups.
+
+Example beginning of group processing output from a workflow step::
+
+   INFO     Setting group control-nodes with None -> Stage.NOT_STARTED
+   INFO     Group control-nodes selectors have resolved to nodes: node2, node1
+   INFO     Setting group compute-nodes-1 with None -> Stage.NOT_STARTED
+   INFO     Group compute-nodes-1 selectors have resolved to nodes: node5, node4
+   INFO     Setting group compute-nodes-2 with None -> Stage.NOT_STARTED
+   INFO     Group compute-nodes-2 selectors have resolved to nodes: node7, node8
+   INFO     Setting group spare-compute-nodes with None -> Stage.NOT_STARTED
+   INFO     Group spare-compute-nodes selectors have resolved to nodes: node11, node10
+   INFO     Setting group all-compute-nodes with None -> Stage.NOT_STARTED
+   INFO     Group all-compute-nodes selectors have resolved to nodes: node11, node7, node4, node8, node10, node5
+   INFO     Setting group monitoring-nodes with None -> Stage.NOT_STARTED
+   INFO     Group monitoring-nodes selectors have resolved to nodes: node12, node6, node9
+   INFO     Setting group ntp-node with None -> Stage.NOT_STARTED
+   INFO     Group ntp-node selectors have resolved to nodes: node3
+   INFO     There are no cycles detected in the graph
+
+Of note is the resolution of groups to a list of nodes. Notice that the nodes
+in all-compute-nodes node11 overlap the nodes listed as part of other groups.
+When processing, if all the groups were to be processed before
+all-compute-nodes, there would be no remaining nodes that are actionable when
+the workflow tries to process all-compute-nodes. The all-compute-nodes groups
+would then be evaluated for success criteria immediately against those nodes
+processed prior. E.g.::
+
+   INFO     There were no actionable nodes for group all-compute-nodes. It is possible that all nodes: [node11, node7, node4, node8, node10, node5] have previously been deployed. Group will be immediately checked against its success criteria
+   INFO     Assessing success criteria for group all-compute-nodes
+   INFO     Group all-compute-nodes success criteria passed
+   INFO     Setting group all-compute-nodes with Stage.NOT_STARTED -> Stage.PREPARED
+   INFO     Group all-compute-nodes has met its success criteria and is now set to stage Stage.PREPARED
+   INFO     Assessing success criteria for group all-compute-nodes
+   INFO     Group all-compute-nodes success criteria passed
+   INFO     Setting group all-compute-nodes with Stage.PREPARED -> Stage.DEPLOYED
+   INFO     Group all-compute-nodes has met its success criteria and is successfully deployed (Stage.DEPLOYED)
+
+Example summary output from workflow step doing node processing::
+
+   INFO     =====   Group Summary   =====
+   INFO       Group monitoring-nodes ended with stage: Stage.DEPLOYED
+   INFO       Group ntp-node [Critical] ended with stage: Stage.DEPLOYED
+   INFO       Group control-nodes [Critical] ended with stage: Stage.DEPLOYED
+   INFO       Group compute-nodes-1 ended with stage: Stage.DEPLOYED
+   INFO       Group compute-nodes-2 ended with stage: Stage.DEPLOYED
+   INFO       Group spare-compute-nodes ended with stage: Stage.DEPLOYED
+   INFO       Group all-compute-nodes ended with stage: Stage.DEPLOYED
+   INFO     ===== End Group Summary =====
+   INFO     =====   Node Summary   =====
+   INFO       Nodes Stage.NOT_STARTED:
+   INFO       Nodes Stage.PREPARED:
+   INFO       Nodes Stage.DEPLOYED: node11, node7, node3, node4, node2, node1, node12, node8, node9, node6, node10, node5
+   INFO       Nodes Stage.FAILED:
+   INFO     ===== End Node Summary =====
+   INFO     All critical groups have met their success criteria
+
+Overall success or failure of workflow step processing based on critical groups
+meeting or failing their success criteria will be reflected in the same fashion
+as any other workflow step output from Shipyard.
+
+An Example of CLI `describe action` command output, with failed processing::
+
+    $ shipyard describe action/01BZZK07NF04XPC5F4SCTHNPKN
+    Name:                  deploy_site
+    Action:                action/01BZZK07NF04XPC5F4SCTHNPKN
+    Lifecycle:             Failed
+    Parameters:            {}
+    Datetime:              2017-11-27 20:34:24.610604+00:00
+    Dag Status:            failed
+    Context Marker:        71d4112e-8b6d-44e8-9617-d9587231ffba
+    User:                  shipyard
+
+    Steps                                                              Index        State
+    step/01BZZK07NF04XPC5F4SCTHNPKN/dag_concurrency_check              1            success
+    step/01BZZK07NF04XPC5F4SCTHNPKN/validate_site_design               2            success
+    step/01BZZK07NF04XPC5F4SCTHNPKN/drydock_build                      3            failed
+    step/01BZZK07NF04XPC5F4SCTHNPKN/armada_build                       4            None
+    step/01BZZK07NF04XPC5F4SCTHNPKN/drydock_prepare_site               5            success
+    step/01BZZK07NF04XPC5F4SCTHNPKN/drydock_nodes                      6            failed
+
+
+.. _`Armada manifest document`: https://airshipit.readthedocs.io/projects/armada/en/latest/operations/guide-build-armada-yaml.html?highlight=manifest
+.. _`Default configuration values`: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/shipyard_airflow/plugins/deployment_configuration_operator.py
+.. _DeploymentConfiguration: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/shipyard_airflow/schemas/deploymentConfiguration.yaml
+.. _DeploymentStrategy: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/shipyard_airflow/schemas/deploymentStrategy.yaml
+.. _Drydock: https://git.airshipit.org/cgit/airship-drydock
+.. _`sample deployment-configuration`: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/tests/unit/yaml_samples/deploymentConfiguration_full_valid.yaml
+.. _`sample deployment-strategy`: https://git.airshipit.org/cgit/airship-shipyard/tree/src/bin/shipyard_airflow/tests/unit/yaml_samples/deploymentStrategy_full_valid.yaml