charm-deployment-guide/deploy-guide/source/app-series-upgrade-openstack.rst

=====================================
Appendix F2: Series upgrade OpenStack
=====================================

Overview
--------

This document will provide specific steps for how to perform a series upgrade
across the entirety of a Charmed OpenStack cloud.

.. warning::

   This document is based upon the foundational knowledge and guidelines set
   forth in the more general `Series upgrade`_ appendix. That reference must be
   studied in-depth prior to attempting the steps outlined here. In particular,
   ensure that the :ref:`Pre-upgrade requirements <pre-upgrade_requirements>`
   are satisfied; the :ref:`Specific series upgrade procedures
   <series_specific_procedures>` have been reviewed and considered; and that
   :ref:`Workload specific preparations <workload_specific_preparations>` have
   been addressed during planning.

Downtime
--------

Although the goal is to minimise downtime the series upgrade process across a
cloud will nonetheless result in some level of downtime for the control plane.

When the machines associated with stateful applications such as percona-cluster
and rabbitmq-server undergo a series upgrade all cloud APIs will experience
downtime, in addition to the stateful application itself.

When machines associated with a single API application undergo a series upgrade
that individual API will also experience downtime. This is because it is
necessary to pause services in order to avoid race condition errors.

For those applications working in tandem with hacluster, as will be shown, some
hacluster units will need to be paused before the upgrade. One should assume
that the commencement of an outage coincides with this step (it will cause
cluster quorum heartbeats to fail and the service VIP will consequently go
offline).

Reference cloud topology
------------------------

This section describes a hyperconverged cloud topology that this document will
use for the procedural steps to follow. Hyperconvergence refers to the practice
of co-locating principle applications on the same machine.

The topology is defined in this way:

* Only compute and storage charms (and their subordinates) may be co-located.

* Third-party charms either do not exist or have been thoroughly tested
  for a series upgrade.

* The following are containerised:

  * All API applications

  * The percona-cluster application

  * The rabbitmq-server application

  * The ceph-mon application

Storage charms are charms that manage physical disks. For example, ceph-osd and
swift-storage. Example OpenStack subordinate charms are networking SDN charms
for the nova-compute charm, or monitoring charms for compute or storage charms.

.. caution::

   If your cloud differs from this topology you must adapt the procedural steps
   accordingly. In particular, look at the aspects of co-located applications
   and containerised applications. Recall that:

   * the :command:`upgrade-series` command:

     * affects all applications residing on the target machine

     * does not affect containers hosted on the target machine

   * an application's leader should be upgraded before its non-leaders

Generalised OpenStack series upgrade
------------------------------------

This section will summarise the series upgrade steps in the context of specific
OpenStack applications. It is an enhancement of the :ref:`Generic series
upgrade <generic_series_upgrade>` section in the companion document.

Applications for which this summary does **not** apply include:

* nova-compute
* ceph-mon
* ceph-osd

This is because the above applications do not require the pausing of units and
application leadership is irrelevant for them.

However, this summary does apply to all API applications (e.g. neutron-api,
keystone, nova-cloud-controller), as well as percona-cluster, and
rabbitmq-server.

.. important::

   The first machine to be upgraded is always associated with the leader of the
   principle application. Let this machine be called the "principle leader
   machine" and its unit be called the "principle leader unit".

The steps are as follows:

#. Set the default series for the principle application and ensure the same has
   been done to the model.

#. If hacluster is used, pause the hacluster units not associated with the
   principle leader machine.

#. Pause the principle non-leader units.

#. Perform a series upgrade on the principle leader machine.

   #. Perform any pre-upgrade workload maintenance tasks.

   #. Invoke the :command:`prepare` sub-command.

   #. Upgrade the operating system (APT commands).

   #. Perform any post-upgrade workload maintenance tasks.

   #. Reboot.

#. Set the value of the (application-dependent) ``openstack-origin`` or the
   ``source`` configuration option to 'distro' (new operating system).

#. Invoke the :command:`complete` sub-command on the principle leader machine.

#. Repeat steps 4 and 6 for the application non-leader machines.

#. Perform any possible cluster completed upgrade tasks once all machines have
   had their series upgraded.

   .. note::

      Here is a non-extensive list of the most common post-upgrade tasks for
      OpenStack and supporting charms:

      * percona-cluster: run action ``complete-cluster-series-upgrade`` on the
        leader unit.
      * rabbitmq-server: run action ``complete-cluster-series-upgrade`` on the
        leader unit.
      * ceilometer: run action ``ceilometer-upgrade`` on the leader unit.
      * vault: Each vault unit will need to be unsealed after its machine is
        rebooted.

Procedures
----------

The procedures are categorised based on application types. The example scenario
used throughout is a 'xenial' to 'bionic' series upgrade, within an OpenStack
release of Queens (i.e. the starting point is a cloud archive pocket of
'xenial-queens').

Stateful applications
~~~~~~~~~~~~~~~~~~~~~

This section covers the series upgrade procedure for containerised stateful
applications. These include:

* ceph-mon
* percona-cluster
* rabbitmq-server

A stateful application is one that maintains the state of various aspects of
the cloud. Clustered stateful applications, such as all the ones given above,
also require a quorum to function properly. Because of these reasons a stateful
application should not have all of its units restarted simultaneously; it must
have the series of its corresponding machines upgraded sequentially.

.. note::

   The concurrent upgrade approach is theoretically possible, although to use
   it all cloud workloads will need to be stopped in order to ensure
   consistency. This is not recommended.

The example procedure will be based on the percona-cluster application.

.. important::

   Unlike percona-cluster, the ceph-mon and rabbitmq-server applications do not
   use hacluster to achieve HA, nor do they need backups. Disregard therefore
   the hacluster and backup steps for these two applications.

   The ceph-mon charm will maintain the MON cluster during a series upgrade, so
   ceph-mon units do not need to be paused.

This scenario is represented by the following partial :command:`juju status`
command output:

.. code-block:: console

   Model    Controller       Cloud/Region    Version  SLA          Timestamp
   upgrade  maas-controller  mymaas/default  2.7.6    unsupported  18:26:57Z

   App                        Version  Status  Scale  Charm            Store       Rev  OS      Notes
   percona-cluster            5.6.37   active      3  percona-cluster  jujucharms  286  ubuntu
   percona-cluster-hacluster           active      3  hacluster        jujucharms   66  ubuntu

   Unit                            Workload  Agent  Machine  Public address  Ports     Message
   percona-cluster/0               active    idle   0/lxd/0  10.0.0.47       3306/tcp  Unit is ready
     percona-cluster-hacluster/0*  active    idle            10.0.0.47                 Unit is ready and clustered
   percona-cluster/1*              active    idle   1/lxd/0  10.0.0.48       3306/tcp  Unit is ready
     percona-cluster-hacluster/2   active    idle            10.0.0.48                 Unit is ready and clustered
   percona-cluster/2               active    idle   2/lxd/0  10.0.0.49       3306/tcp  Unit is ready
     percona-cluster-hacluster/1   active    idle            10.0.0.49                 Unit is ready and clustered

In summary, the principle leader unit is percona-cluster/1 and is deployed on
machine 1/lxd/0 (the principle leader machine).

.. warning::

   During this upgrade, there will be a MySQL service outage. The HA resources
   provided by hacluster will **not** be monitored during the series upgrade
   due to the pausing of units.

#. Perform any workload maintenance pre-upgrade steps. For percona-cluster,
   take a backup and transfer it to a secure location:

   .. code-block:: none

      juju run-action --wait percona-cluster/1 backup
      juju scp -- -r percona-cluster/1:/opt/backups/mysql /path/to/local/directory

   Permissions will need to be altered on the remote machine, and note that the
   last command transfers **all** existing backups.

#. Set the default series for both the model and the principle application:

   .. code-block:: none

      juju model-config default-series=bionic
      juju set-series percona-cluster bionic

#. Pause the hacluster units not associated with the principle leader machine:

   .. code-block:: none

      juju run-action --wait percona-cluster-hacluster/0 pause
      juju run-action --wait percona-cluster-hacluster/1 pause

#. Pause the principle non-leader units:

   .. code-block:: none

      juju run-action --wait percona-cluster/0 pause
      juju run-action --wait percona-cluster/2 pause

   For percona-cluster, leaving the principle leader unit up will ensure it
   has the latest MySQL sequence number; it will be considered the most up to
   date cluster member.

#. Perform a series upgrade on the principle leader machine:

   .. code-block:: none

      juju upgrade-series 1/lxd/0 prepare bionic
      juju run --machine=1/lxd/0 -- sudo apt update
      juju ssh 1/lxd/0 sudo apt full-upgrade
      juju ssh 1/lxd/0 sudo do-release-upgrade

   For percona-cluster, there are no post-upgrade steps; the prompt to reboot
   can be answered in the affirmative.

#. Set the value of the ``source`` configuration option to 'distro':

   .. code-block:: none

      juju config percona-cluster source=distro

#. Invoke the :command:`complete` sub-command on the principle leader machine:

   .. code-block:: none

      juju upgrade-series 1/lxd/0 complete

   At this point the :command:`juju status` output looks like this:

   .. code-block:: console

      Model    Controller       Cloud/Region    Version  SLA          Timestamp
      upgrade  maas-controller  mymaas/default  2.7.6    unsupported  19:51:52Z

      App                        Version  Status       Scale  Charm            Store       Rev  OS      Notes
      percona-cluster            5.7.20   maintenance      3  percona-cluster  jujucharms  286  ubuntu
      percona-cluster-hacluster           blocked          3  hacluster        jujucharms   66  ubuntu

      Unit                            Workload     Agent  Machine  Public address  Ports     Message
      percona-cluster/0               maintenance  idle   0/lxd/0  10.0.0.47       3306/tcp  Paused. Use 'resume' action to resume normal service.
        percona-cluster-hacluster/0*  maintenance  idle            10.0.0.47                 Paused. Use 'resume' action to resume normal service.
      percona-cluster/1*              active       idle   1/lxd/0  10.0.0.48       3306/tcp  Unit is ready
        percona-cluster-hacluster/2   blocked      idle            10.0.0.48                 Resource: res_mysql_11810cc_vip not running
      percona-cluster/2               maintenance  idle   2/lxd/0  10.0.0.49       3306/tcp  Paused. Use 'resume' action to resume normal service.
        percona-cluster-hacluster/1   maintenance  idle            10.0.0.49                 Paused. Use 'resume' action to resume normal service.

      Machine  State    DNS        Inst id              Series  AZ     Message
      0        started  10.0.0.44  node1                xenial  zone1  Deployed
      0/lxd/0  started  10.0.0.47  juju-f83fcd-0-lxd-0  xenial  zone1  Container started
      1        started  10.0.0.45  node2                xenial  zone2  Deployed
      1/lxd/0  started  10.0.0.48  juju-f83fcd-1-lxd-0  bionic  zone2  Running
      2        started  10.0.0.46  node3                xenial  zone3  Deployed
      2/lxd/0  started  10.0.0.49  juju-f83fcd-2-lxd-0  xenial  zone3  Container started

#. For percona-cluster, a sanity check should be done on the leader unit's
   databases and data.

#. Repeat steps 5 and 7 for the principle non-leader machines.

#. Perform any possible cluster completed upgrade tasks once all machines have
   had their series upgraded:

   .. code-block:: none

      juju run-action --wait percona-cluster/leader complete-cluster-series-upgrade

   For percona-cluster (and rabbitmq-server), the above action is performed on
   the leader unit. It informs each cluster node that the upgrade process is
   complete cluster-wide. This also updates MySQL configuration with all peers
   in the cluster.

API applications
~~~~~~~~~~~~~~~~

This section covers series upgrade procedures for containerised API
applications. These include, but are not limited to:

* cinder
* glance
* keystone
* neutron-api
* nova-cloud-controller

Machines hosting API applications can have their series upgraded concurrently
because those applications are stateless. This results in a dramatically
reduced downtime for the application. A sequential approach will not reduce
downtime as the HA services will still need to be brought down during the
upgrade associated with the application leader.

The following two sub-sections will show how to perform a series upgrade
concurrently for a single API application and for multiple API applications.

Upgrading a single API application concurrently
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example procedure will be based on the keystone application.

This scenario is represented by the following partial :command:`juju status`
command output:

.. code-block:: console

   Model    Controller       Cloud/Region    Version  SLA          Timestamp
   upgrade  maas-controller  mymaas/default  2.7.6    unsupported  22:48:41Z

   App                 Version  Status  Scale  Charm            Store       Rev  OS      Notes
   keystone            13.0.2   active      3  keystone         jujucharms  312  ubuntu
   keystone-hacluster           active      3  hacluster        jujucharms   66  ubuntu

   Unit                     Workload  Agent  Machine  Public address  Ports     Message
   keystone/0*              active    idle   0/lxd/0  10.0.0.70       5000/tcp  Unit is ready
     keystone-hacluster/0*  active    idle            10.0.0.70                 Unit is ready and clustered
   keystone/1               active    idle   1/lxd/0  10.0.0.71       5000/tcp  Unit is ready
     keystone-hacluster/2   active    idle            10.0.0.71                 Unit is ready and clustered
   keystone/2               active    idle   2/lxd/0  10.0.0.72       5000/tcp  Unit is ready
     keystone-hacluster/1   active    idle            10.0.0.72                 Unit is ready and clustered

In summary, the principle leader unit is keystone/0 and is deployed on machine
0/lxd/0 (the principle leader machine).

#. Set the default series for both the model and the principle application:

   .. code-block:: none

      juju model-config default-series=bionic
      juju set-series keystone bionic

#. Pause the hacluster units not associated with the principle leader machine:

   .. code-block:: none

      juju run-action --wait keystone-hacluster/1 pause
      juju run-action --wait keystone-hacluster/2 pause

#. Pause the principle non-leader units:

   .. code-block:: none

      juju run-action --wait keystone/1 pause
      juju run-action --wait keystone/2 pause

#. Perform any workload maintenance pre-upgrade steps on all machines. There
   are no keystone-specific steps to perform.

#. Invoke the :command:`prepare` sub-command on all machines, **starting with
   the principle leader machine**:

   .. code-block:: none

      juju upgrade-series 0/lxd/0 prepare bionic
      juju upgrade-series 1/lxd/0 prepare bionic
      juju upgrade-series 2/lxd/0 prepare bionic

   At this point the :command:`juju status` output looks like this:

   .. code-block:: console

      Model    Controller       Cloud/Region    Version  SLA          Timestamp
      upgrade  maas-controller  mymaas/default  2.7.6    unsupported  23:11:01Z

      App                 Version  Status   Scale  Charm            Store       Rev  OS      Notes
      keystone            13.0.2   blocked      3  keystone         jujucharms  312  ubuntu
      keystone-hacluster           blocked      3  hacluster        jujucharms   66  ubuntu

      Unit                     Workload  Agent  Machine  Public address  Ports     Message
      keystone/0*              blocked   idle   0/lxd/0  10.0.0.70       5000/tcp  Ready for do-release-upgrade and reboot. Set complete when finished.
        keystone-hacluster/0*  blocked   idle            10.0.0.70                 Ready for do-release-upgrade. Set complete when finished
      keystone/1               blocked   idle   1/lxd/0  10.0.0.71       5000/tcp  Ready for do-release-upgrade and reboot. Set complete when finished.
        keystone-hacluster/2   blocked   idle            10.0.0.71                 Ready for do-release-upgrade. Set complete when finished
      keystone/2               blocked   idle   2/lxd/0  10.0.0.72       5000/tcp  Ready for do-release-upgrade and reboot. Set complete when finished.
        keystone-hacluster/1   blocked   idle            10.0.0.72                 Ready for do-release-upgrade. Set complete when finished

#. Upgrade the operating system on all machines. The non-interactive method is
   used here:

   .. code-block:: none

      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0 --timeout=10m \
         -- sudo apt-get update
      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0 --timeout=60m \
         -- sudo DEBIAN_FRONTEND=noninteractive apt-get --assume-yes \
         -o "Dpkg::Options::=--force-confdef" \
         -o "Dpkg::Options::=--force-confold" dist-upgrade
      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0 --timeout=120m \
         -- sudo DEBIAN_FRONTEND=noninteractive \
         do-release-upgrade -f DistUpgradeViewNonInteractive

   .. important::

      Choose values for the ``--timeout`` option that are appropriate for the
      task at hand.

#. Perform any workload maintenance post-upgrade steps on all machines. There
   are no keystone-specific steps to perform.

#. Reboot all machines:

   .. code-block:: none

      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0 -- sudo reboot

#. Set the value of the ``openstack-origin`` configuration option to 'distro':

   .. code-block:: none

      juju config keystone openstack-origin=distro

#. Invoke the :command:`complete` sub-command on all machines:

   .. code-block:: none

      juju upgrade-series 0/lxd/0 complete
      juju upgrade-series 1/lxd/0 complete
      juju upgrade-series 2/lxd/0 complete

Upgrading multiple API applications concurrently
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example procedure will be based on the nova-cloud-controller and glance
applications.

This scenario is represented by the following partial :command:`juju status`
command output:

.. code-block:: console

   Model    Controller       Cloud/Region    Version  SLA          Timestamp
   upgrade  maas-controller  mymaas/default  2.7.6    unsupported  19:23:41Z

   App                    Version  Status  Scale  Charm                  Store       Rev  OS      Notes
   glance                 16.0.1   active      3  glance                 jujucharms  295  ubuntu
   glance-hacluster                active      3  hacluster              jujucharms   66  ubuntu
   nova-cc-hacluster               active      3  hacluster              jujucharms   66  ubuntu
   nova-cloud-controller  17.0.12  active      3  nova-cloud-controller  jujucharms  343  ubuntu

   Unit                      Workload  Agent  Machine  Public address  Ports              Message
   glance/0*                 active    idle   0/lxd/0  10.246.114.39   9292/tcp           Unit is ready
     glance-hacluster/0*     active    idle            10.246.114.39                      Unit is ready and clustered
   glance/1                  active    idle   1/lxd/0  10.246.114.40   9292/tcp           Unit is ready
     glance-hacluster/1      active    idle            10.246.114.40                      Unit is ready and clustered
   glance/2                  active    idle   2/lxd/0  10.246.114.41   9292/tcp           Unit is ready
     glance-hacluster/2      active    idle            10.246.114.41                      Unit is ready and clustered
   nova-cloud-controller/0   active    idle   3/lxd/0  10.246.114.48   8774/tcp,8778/tcp  Unit is ready
     nova-cc-hacluster/2     active    idle            10.246.114.48                      Unit is ready and clustered
   nova-cloud-controller/1*  active    idle   4/lxd/0  10.246.114.43   8774/tcp,8778/tcp  Unit is ready
     nova-cc-hacluster/0*    active    idle            10.246.114.43                      Unit is ready and clustered
   nova-cloud-controller/2   active    idle   5/lxd/0  10.246.114.47   8774/tcp,8778/tcp  Unit is ready
     nova-cc-hacluster/1     active    idle            10.246.114.47                      Unit is ready and clustered

In summary,

* The glance principle leader unit is glance/0 and is deployed on machine
  0/lxd/0 (the glance principle leader machine).
* The nova-cloud-controller principle leader unit is nova-cloud-controller/1
  and is deployed on machine 4/lxd/0 (the nova-cloud-controller principle
  leader machine).

The procedure has been expedited slightly by adding the ``--yes`` confirmation
option to the :command:`prepare` sub-command.

#. Set the default series for both the model and the principle applications:

   .. code-block:: none

      juju model-config default-series=bionic
      juju set-series glance bionic
      juju set-series nova-cloud-controller bionic

#. Pause the hacluster units not associated with their principle leader
   machines:

   .. code-block:: none

      juju run-action --wait glance-hacluster/1 pause
      juju run-action --wait glance-hacluster/2 pause
      juju run-action --wait nova-cc-hacluster/1 pause
      juju run-action --wait nova-cc-hacluster/2 pause

#. Pause the principle non-leader units:

   .. code-block:: none

      juju run-action --wait glance/1 pause
      juju run-action --wait glance/2 pause
      juju run-action --wait nova-cloud-controller/0 pause
      juju run-action --wait nova-cloud-controller/2 pause

#. Perform any workload maintenance pre-upgrade steps on all machines. There
   are no glance-specific or nova-cloud-controller-specific steps to perform.

#. Invoke the :command:`prepare` sub-command on all machines, **starting with
   the principle leader machines**:

   .. code-block:: none

      juju upgrade-series --yes 0/lxd/0 prepare bionic
      juju upgrade-series --yes 4/lxd/0 prepare bionic
      juju upgrade-series --yes 1/lxd/0 prepare bionic
      juju upgrade-series --yes 2/lxd/0 prepare bionic
      juju upgrade-series --yes 3/lxd/0 prepare bionic
      juju upgrade-series --yes 5/lxd/0 prepare bionic

#. Upgrade the operating system on all machines. The non-interactive method is
   used here:

   .. code-block:: none

      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0,3/lxd/0,4/lxd/0,5/lxd/0 \
         --timeout=20m -- sudo apt-get update
      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0,3/lxd/0,4/lxd/0,5/lxd/0 \
         --timeout=120m -- sudo DEBIAN_FRONTEND=noninteractive apt-get --assume-yes \
         -o "Dpkg::Options::=--force-confdef" \
         -o "Dpkg::Options::=--force-confold" dist-upgrade
      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0,3/lxd/0,4/lxd/0,5/lxd/0 \
         --timeout=200m -- sudo DEBIAN_FRONTEND=noninteractive \
         do-release-upgrade -f DistUpgradeViewNonInteractive

#. Perform any workload maintenance post-upgrade steps on all machines. There
   are no glance-specific or nova-cloud-controller-specific steps to perform.

#. Reboot all machines:

   .. code-block:: none

      juju run --machine=0/lxd/0,1/lxd/0,2/lxd/0,3/lxd/0,4/lxd/0,5/lxd/0 -- sudo reboot

#. Set the value of the ``openstack-origin`` configuration option to 'distro':

   .. code-block:: none

      juju config glance openstack-origin=distro
      juju config nova-cloud-controller openstack-origin=distro

#. Invoke the :command:`complete` sub-command on all machines:

   .. code-block:: none

      juju upgrade-series 0/lxd/0 complete
      juju upgrade-series 1/lxd/0 complete
      juju upgrade-series 2/lxd/0 complete
      juju upgrade-series 3/lxd/0 complete
      juju upgrade-series 4/lxd/0 complete
      juju upgrade-series 5/lxd/0 complete

Physical machines
~~~~~~~~~~~~~~~~~

This section covers series upgrade procedures for applications hosted on
physical machines in particular. These typically include:

* ceph-osd
* neutron-gateway
* nova-compute

When performing a series upgrade on a physical machine more attention should be
given to any workload maintenance pre-upgrade steps:

* For compute nodes migrate all running VMs to another hypervisor.
* For network nodes force HA routers off of the current node.
* Any storage related tasks that may be required.
* Any site specific tasks that may be required.

The following two sub-sections will show how to perform a series upgrade
for a single physical machine and for multiple physical machines concurrently.

Upgrading a single physical machine
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This example procedure will be based on the nova-compute and ceph-osd
applications residing on the same physical machine. Since application
leadership does not play a significant role with these two applications, and
because the hacluster application is not present, there will be no units to
pause (as there were in previous scenarios).

This scenario is represented by the following partial :command:`juju status`
command output:

.. code-block:: console

   Model    Controller       Cloud/Region    Version  SLA          Timestamp
   upgrade  maas-controller  mymaas/default  2.7.6    unsupported  15:23:21Z

   App           Version  Status  Scale  Charm         Store       Rev  OS      Notes
   ceph-osd      12.2.12  active      1  ceph-osd      jujucharms  301  ubuntu
   keystone      13.0.2   active      1  keystone      jujucharms  312  ubuntu
   nova-compute  17.0.12  active      1  nova-compute  jujucharms  314  ubuntu

   Unit             Workload  Agent  Machine  Public address  Ports     Message
   ceph-osd/0*      active    idle   0        10.0.0.235                Unit is ready (1 OSD)
   keystone/0*      active    idle   0/lxd/0  10.0.0.240      5000/tcp  Unit is ready
   nova-compute/0*  active    idle   0        10.0.0.235                Unit is ready

   Machine  State    DNS         Inst id              Series  AZ     Message
   0        started  10.0.0.235  node1                xenial  zone1  Deployed
   0/lxd/0  started  10.0.0.240  juju-88b27a-0-lxd-0  xenial  zone1  Container started

In summary, the ceph-osd and nova-compute applications are hosted on machine 0.
Recall that container 0/lxd/0 will need to have its series upgraded separately.

#. It is recommended to set the Ceph cluster OSDs to 'noout'. This is typically
   done at the application level (i.e. not at the unit or machine level):

   .. code-block:: none

      juju run-action --wait ceph-mon/leader set-noout

#. All running VMs should be migrated to another hypervisor.

#. Upgrade the series on machine 0:

   #. Invoke the :command:`prepare` sub-command:

      .. code-block:: none

         juju upgrade-series 0 prepare bionic

   #. Upgrade the operating system:

      .. code-block:: none

         juju run --machine=0 -- sudo apt update
         juju ssh 0 sudo apt full-upgrade
         juju ssh 0 sudo do-release-upgrade

   #. Reboot (if not already done):

      .. code-block:: none

         juju run --machine=0 -- sudo reboot

   #. Set the value of the ``openstack-origin`` or ``source`` configuration
      options to 'distro':

      .. code-block:: none

         juju config nova-compute openstack-origin=distro
         juju config ceph-osd source=distro

   #. Invoke the :command:`complete` sub-command on the machine:

      .. code-block:: none

         juju upgrade-series 0 complete

#. If OSDs were previously set to 'noout' then check up/in status of those
   OSDs in ceph status, then unset 'noout' for the cluster:

   .. code-block:: none

      juju run --unit ceph-mon/leader -- ceph status
      juju run-action --wait ceph-mon/leader unset-noout

Upgrading multiple physical hosts concurrently
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When physical machines have their series upgraded concurrently Availability
Zones need to be taken into account. Machines should be placed into upgrade
groups such that any API services running on them have a maximum of one unit
per group. This is to ensure API availability at the reboot stage.

This simplified bundle is used to demonstrate the general idea:

.. code-block:: yaml

   series: xenial
   machines:
     0: {}
     1: {}
     2: {}
     3: {}
     4: {}
     5: {}
   applications:
     nova-compute:
       charm: cs:nova-compute
       num_units: 3
       options:
         openstack-origin: cloud:xenial-queens
       to:
         - 0
         - 2
         - 4
     keystone:
       charm: cs:keystone
       constraints: mem=1G
       num_units: 3
       options:
         vip: 10.85.132.200
         openstack-origin: cloud:xenial-queens
       to:
         - lxd:1
         - lxd:3
         - lxd:5
     keystone-hacluster:
       charm: cs:hacluster
       options:
         cluster_count: 3

Three upgrade groups could consist of the following machines:

#. Machines 0 and 1
#. Machines 2 and 3
#. Machines 4 and 5

In this way, a less time-consuming series upgrade can be performed while still
ensuring the availability of services.

.. caution::

   For the ceph-osd application, ensure that rack-aware replication rules exist
   in the CRUSH map if machines are being rebooted together. This is to prevent
   significant interruption to running workloads from occurring if the
   same placement group is hosted on those machines. For example, if ceph-mon
   is deployed with ``customize-failure-domain`` set to 'true' and the ceph-osd
   units are hosted on machines in three or more separate Juju AZs you can
   safely reboot ceph-osd machines concurrently in the same zone. See
   :ref:`Ceph AZ <ceph_az>` in :doc:`OpenStack high availability <app-ha>` for
   details.

Automation
----------

Series upgrades across an OpenStack cloud can be time consuming, even when
using concurrent methods wherever possible. They can also be tedious and thus
susceptible to human error.

The following code examples encapsulate the processes described in this
document. They are provided solely to illustrate the methods used to develop
and test the series upgrade primitives:

* `Parallel tests`_: An example that is used as a functional verification of
  a series upgrade in the OpenStack Charms project.
* `Upgrade helpers`_: A set of helpers used in the above upgrade example.

.. caution::

   The example code should only be used for its intended use case of
   development and testing. Do not attempt to automate a series upgrade on a
   production cloud.

.. LINKS
.. _Charm upgrades: app-upgrade-openstack#charm-upgrades
.. _Series upgrade: app-series-upgrade
.. _Parallel tests: https://github.com/openstack-charmers/zaza-openstack-tests/blob/c492ecdcac3b2724833c347e978de97ea2e626d7/zaza/openstack/charm_tests/series_upgrade/parallel_tests.py#L64
.. _Upgrade helpers: https://github.com/openstack-charmers/zaza-openstack-tests/blob/9cec2efabe30fb0709bc098c48ec10bcb85cc9d4/zaza/openstack/utilities/parallel_series_upgrade.py