Alexandr Nevenchannyy 92816644f7 OpenStack reliability test plan
This document describes a abstract methodology for analysing
reliability of high-availability OpenStack cluster and it's components.

Co-Authored-By: Bogdan Dobrelia <bdobrelia@mirantis.com>
Change-Id: I5a08c1a39bab96d90c6f7a873fdc771516ffba48
2016-07-01 16:53:20 +03:00

414 lines
18 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. _reliability_testing:
=============================
OpenStack reliability testing
=============================
:status: draft
:version: 0
:Abstract:
This document describes an abstract methodology for OpenStack cluster
high-availability testing and analysis. OpenStack data plane testing
at this moment is out of scope, but will be described in future.
:Conventions:
- **OpenStack cluster:** consists of server nodes with deployed and fully
operational OpenStack environment in high-availability configuration.
- **Fault-injection operation:** represents common types of failures which can
occur in production environment: service-hang, service-crash,
network-partition, network-flapping, and node-crash.
- **Service-hang:** faults are injected into specified OpenStack service by
sending -SIGSTOP and -SIGCONT POSIX signals.
- **Service-crash:** faults are injected by sending -SIGKILL signal into
specified OpenStack service.
- **Node-crash:** faults are injected to an OpenStack cluster by rebooting
or shutting down a server node.
- **Network-partition:** faults are injected by inserting iptables rules to
OpenStack cluster nodes to a corresponding service that should be
network-partitioned.
- **Network-flapping:** faults are injected into OpenStack cluster nodes by
inserting/deleting iptables rules on the fly which will affect
corresponding service that should be tested.
- **Factor:** consists of a set of atomic fault-injection operations. For
example: reboot-random-controller, reboot-random-rabbitmq.
- **Test plan:** contains two elements: test scenario
execution graph and fault-injection factors.
- **SLA**: Service-level agreement
- **Testing-cycles**: number of test cycles of each factor
- **Inf**: assumes infinite time to auto-healing of cluster
after fault-factor injection.
Test Plan
=========
Test Environment
----------------
This section should contain all information about deployed OpenStack
environment including archive with all information in the ``/etc`` folder from
all nodes.
Preparation
^^^^^^^^^^^
This section should contain all steps to reproduce Openstack environment
deployment and client node. For example: if testing environment is deployed
with DevStack, this section should contain all DevStack configuration files,
DevStack version and all deployment steps.
Environment description
^^^^^^^^^^^^^^^^^^^^^^^
This section should contain all cluster hardware information, including
processor model and its frequency, memory size, storage type and its capacity,
network interfaces, and others.
A separate client node must be used to drive the tests.
Hardware
~~~~~~~~
This section should contain a full hardware nodes specification.
.. table:: Description of server hardware
+--------+----------------+-------+-------+
|SERVER |name | | |
| +----------------+-------+-------+
| |role | | |
| +----------------+-------+-------+
| |vendor,model | | |
| +----------------+-------+-------+
| |operating_system| | |
+--------+----------------+-------+-------+
|CPU |vendor,model | | |
| +----------------+-------+-------+
| |processor_count | | |
| +----------------+-------+-------+
| |core_count | | |
| +----------------+-------+-------+
| |frequency_MHz | | |
+--------+----------------+-------+-------+
|RAM |vendor,model | | |
| +----------------+-------+-------+
| |amount_MB | | |
+--------+----------------+-------+-------+
|NETWORK |interface_name | | |
| +----------------+-------+-------+
| |vendor,model | | |
| +----------------+-------+-------+
| |bandwidth | | |
+--------+----------------+-------+-------+
|STORAGE |dev_name | | |
| +----------------+-------+-------+
| |vendor,model | | |
| +----------------+-------+-------+
| |SSD/HDD | | |
| +----------------+-------+-------+
| |size | | |
+--------+----------------+-------+-------+
Networking
~~~~~~~~~~
This section should сontain full description of network equipment used in
OpenStack cluster. Network topology diagram and network hardware
configuration files should be included in this section.
Factors description
-------------------
Please define here description of used factors during test runs.
Examples are:
- **reboot-random-controller:** consist node-crash fault injection on random
OpenStack controller node.
- **reboot-random-rabbitmq:** consist node-crash fault injection on master
RabbitMQ messaging node.
- **sigstop-random-nova-api:** consist service-hang fault injection on random
nova-api service.
- **sigkill-random-mysql:** consist service-crash fault injection on
random MySQL node.
- **network-partition-random-mysql:** consist network-partition fault injection on
random MySQL node.
Test Case 1: NovaServers.boot_and_delete_server
-----------------------------------------------
Description
^^^^^^^^^^^
This Rally scenario boots and deletes virtual instances with injected fault
factors through OpenStack Nova API.
Service-level agreement
^^^^^^^^^^^^^^^^^^^^^^^
In this section, specify SLA values. For example:
=================== ========
Parameter Value
=================== ========
MTTR (sec) <=240
Failure rate (%) <=95
Auto-healing Yes
=================== ========
Parameters
^^^^^^^^^^
In this section, specify load parameters during the test. For example:
=================== ========
Parameter Value
=================== ========
Runner constant
Concurrency X
Times Y
Injection-iteration Z
Testing-cycles N
=================== ========
List of reliability metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
======== ============== ================= =================================================
Priority Value Measurement Units Description
======== ============== ================= =================================================
1 SLA Boolean Service-level agreement result
2 Auto-healing Boolean Is cluster auto-healed after fault-injection
3 Failure rate Percents Test iteration failure ratio
4 MTTR (auto) Seconds Automatic mean time to repair
5 MTTR (manual) Seconds Manual mean time to repair, if Auto MTTR is Inf.
======== ============== ================= =================================================
Results
^^^^^^^
reboot-random-controller
~~~~~~~~~~~~~~~~~~~~~~~~
.. table:: **Full description of cyclic execution results**
+--------------------+----------------+---------------------+------------------+-----------------------------+
| Cycles | MTTR(sec) | Failure rate(%) | Auto-healing | Performance degradation |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 1 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 2 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 3 | X | Y | No | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 4 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 5 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
Place here link to rally report file with results of testing this factor.
.. table:: **Testing results summary**
+--------------------+------------+------------------+
| Value | MTTR | Failure rate |
+--------------------+------------+------------------+
| Min | X | Y |
+--------------------+------------+------------------+
| Max | X | Y |
+--------------------+------------+------------------+
| SLA | X | Y |
+--------------------+------------+------------------+
Detailed results description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this section, specify detailed description of test results,
including factor impact.
reboot-random-rabbitmq
~~~~~~~~~~~~~~~~~~~~~~
.. table:: **Full description of cyclic execution results**
+--------------------+----------------+---------------------+------------------+-----------------------------+
| Cycles | MTTR(sec) | Failure rate(%) | Auto-healing | Performance degradation |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 1 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 2 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 3 | X | Y | No | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 4 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 5 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
Place here link to rally report file with results of testing this factor.
.. table:: **Testing results summary**
+--------------------+------------+------------------+
| Value | MTTR | Failure rate |
+--------------------+------------+------------------+
| Min | X | Y |
+--------------------+------------+------------------+
| Max | X | Y |
+--------------------+------------+------------------+
| SLA | X | Y |
+--------------------+------------+------------------+
Detailed results description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this section, specify detailed description of test results,
including factor impact.
Test Case 2: GlanceImages.create_and_delete_image
-------------------------------------------------
Description
^^^^^^^^^^^
This Rally scenario creates and deletes images with injected fault
factors through OpenStack Glance API.
Service-level agreement
^^^^^^^^^^^^^^^^^^^^^^^
In this section, specify SLA values. For example:
=================== ========
Parameter Value
=================== ========
MTTR (sec) <=120
Failure rate (%) <=95
Auto-healing Yes
=================== ========
Parameters
^^^^^^^^^^
In this section, specify load parameters during the test. For example:
=================== ========
Parameter Value
=================== ========
Runner constant
Concurrency X
Times Y
Injection-iteration Z
Testing-cycles N
=================== ========
List of reliability metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
======== ============== ================= =================================================
Priority Value Measurement Units Description
======== ============== ================= =================================================
1 SLA Boolean Service-level agreement result
2 Auto-healing Boolean Is cluster auto-healed after fault-injection
3 Failure rate Percents Test iteration failure ratio
4 MTTR (auto) Seconds Automatic mean time to repair
5 MTTR (manual) Seconds Manual mean time to repair, if Auto MTTR is Inf.
======== ============== ================= =================================================
Results
^^^^^^^
reboot-random-controller
~~~~~~~~~~~~~~~~~~~~~~~~
.. table:: **Full description of cyclic execution results**
+--------------------+----------------+---------------------+------------------+-----------------------------+
| Cycles | MTTR(sec) | Failure rate(%) | Auto-healing | Performance degradation |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 1 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 2 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 3 | X | Y | No | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 4 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 5 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
Place here link to rally report file with results of testing this factor.
.. table:: **Testing results summary**
+--------------------+------------+------------------+
| Value | MTTR | Failure rate |
+--------------------+------------+------------------+
| Min | X | Y |
+--------------------+------------+------------------+
| Max | X | Y |
+--------------------+------------+------------------+
| SLA | X | Y |
+--------------------+------------+------------------+
Detailed results description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this section, specify detailed description of test results,
including factor impact.
reboot-random-rabbitmq
~~~~~~~~~~~~~~~~~~~~~~
.. table:: **Full description of cyclic execution results**
+--------------------+----------------+---------------------+------------------+-----------------------------+
| Cycles | MTTR(sec) | Failure rate(%) | Auto-healing | Performance degradation |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 1 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 2 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 3 | X | Y | No | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 4 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
| 5 | X | Y | Yes | Yes |
+--------------------+----------------+---------------------+------------------+-----------------------------+
Place here link to rally report file with results of testing this factor.
.. table:: **Testing results summary**
+--------------------+------------+------------------+
| Value | MTTR | Failure rate |
+--------------------+------------+------------------+
| Min | X | Y |
+--------------------+------------+------------------+
| Max | X | Y |
+--------------------+------------+------------------+
| SLA | X | Y |
+--------------------+------------+------------------+
Detailed results description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this section, specify detailed description of test results,
including factor impact.