VM Monitoring
The purpose of this spec is to describe a method for monitoring the health of the VMs without access to the VMs's internals. Change-Id: I82ccb4ae64f48ca154c5450641ed41e04fee9d17
This commit is contained in:
parent
b61b7b70ef
commit
468d526263
295
specs/newton/approved/newton-instance-ha-vm-monitoring-spec.rst
Normal file
295
specs/newton/approved/newton-instance-ha-vm-monitoring-spec.rst
Normal file
@ -0,0 +1,295 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
VM Monitoring
|
||||
==========================================
|
||||
|
||||
The purpose of this spec is to describe a method for monitoring the
|
||||
health of OpenStack VM instances without access to the VMs' internals.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Monitoring VM health is essential for providing high availability for
|
||||
the VMs. Typically cloud operators cannot access inside VMs in order
|
||||
to monitor their health, because this would violate the contract
|
||||
between cloud operators and users that users have complete autonomy
|
||||
over the contents of their VMs and all actions are performed inside
|
||||
them. Operators cannot assume any knowledge of the software stack
|
||||
inside the VM or make any changes to it. Therefore, VM health
|
||||
monitoring must be done externally. This VM monitor must be able to
|
||||
detect VM crashes, hangs (e.g. due to I/O errors) and so on.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
As a cloud operator, I would like to provide my users with highly
|
||||
available VMs to meet high SLA requirements. Therefore, I need my VMs
|
||||
automatically monitored for sudden stops, crashes, IO failures and
|
||||
similar. Any VM failure event detected needs to be passed to a VM
|
||||
recovery workflow service which takes the appropriate actions to
|
||||
recover the VM. For example:
|
||||
|
||||
- If a VM crashes, the recovery service will try to restart it,
|
||||
possibly on the same host at first, and then on a different host if
|
||||
it fails to restart or if it restarts successfully but then crashes
|
||||
a second time on the original host.
|
||||
|
||||
- If a VM receives an I/O error, the recovery service may prefer to
|
||||
immediately contact ``nova-api`` to centrally disable the
|
||||
``nova-compute`` service on that host (so that no new VMs are
|
||||
scheduled on the host) and restart the VM on a different host. It
|
||||
could also potentially live-migrate all other VMs off that host, in
|
||||
order to pre-empt an further I/O errors.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
VM monitoring can be done at the hypervisor level without accessing
|
||||
inside the VMs. In particular, |libvirt|_ provides a mechanism for
|
||||
monitoring its event stream via an event loop. We need to filter the
|
||||
required events and pass them to a recovery workflow service. In
|
||||
order to eliminate redundancy and improve extensibility, these event
|
||||
filters must be configurable.
|
||||
|
||||
.. |libvirt| replace:: `libvirt`
|
||||
.. _libvirt: https://libvirt.org/
|
||||
|
||||
Potential advantages:
|
||||
|
||||
- Catching events at their source (the hypervisor layer) means that we
|
||||
don't have to rely on ``nova`` having knowledge of those events.
|
||||
For example, ``libvirtd`` can output errors when a VM's I/O layer
|
||||
encounters issues, but ``nova`` doesn't emit corresponding events for
|
||||
this.
|
||||
- It should be relatively easy to support a configurable event filter.
|
||||
- The VM instance monitor can be run on each compute node, so it should
|
||||
scale well as the number of compute nodes increases.
|
||||
- The VM instance monitors could be managed by `pacemaker_remote`__ via a
|
||||
new `OCF RA (resource agent)`__.
|
||||
|
||||
__ http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Remote/
|
||||
__ http://www.linux-ha.org/wiki/OCF_Resource_Agents
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are three alternatives to the proposed change:
|
||||
|
||||
1. Listen for VM status change events on message queue.
|
||||
|
||||
Potential disadvantages:
|
||||
|
||||
- It might be less reliable, if for some reason the
|
||||
message queue introduced latency or was lossy.
|
||||
|
||||
- There also might be some gaps in which events are propagated to
|
||||
the queue; if so, we could submit a ``nova`` spec to plug the gaps.
|
||||
|
||||
- If we listen for events from the control plane, it won't scale as
|
||||
well to large numbers of compute nodes, and then would be awkward
|
||||
to trigger recovery via Pacemaker.
|
||||
|
||||
2. Write a new ``nova-libvirt`` OCF RA
|
||||
|
||||
It would compare ``nova``'s expectations of which VMs should be running
|
||||
on the compute node with the reality. Any differences between the
|
||||
two would send appropriate failure events to the recovery workflow
|
||||
service.
|
||||
|
||||
Potential disadvantages:
|
||||
|
||||
- This is more complexity than is expected to run inside an RA.
|
||||
RAs are supposed to be lightweight components which simply start,
|
||||
stop, and monitor services, whereas this would require abusing
|
||||
that model by pretending there is a separate monitoring service
|
||||
when there isn't. The ``monitor`` action would need to fail when
|
||||
any differences as mentioned above were detected, and then the
|
||||
``stop`` or ``start`` action would need to send the failure
|
||||
events.
|
||||
|
||||
- Within this "fake service" model, it's not clear how to avoid
|
||||
sending the same failure events over and over again until the
|
||||
failures were corrected.
|
||||
|
||||
- Typically RAs are implemented in ``bash``. This is not a hard
|
||||
requirement, but something of this complexity would be much
|
||||
better coded in Python, resulting in either a mix of languages
|
||||
within the `openstack-resource-agents`_ repository
|
||||
|
||||
3. Same as 2. above, but as part of the NovaCompute_ RA
|
||||
|
||||
- This has all the disadvantages of 2., but even more so, since
|
||||
new functionality would have to be mixed alongside the existing
|
||||
NovaCompute_ functionality.
|
||||
|
||||
.. _openstack-resource-agents: https://launchpad.net/openstack-resource-agents
|
||||
.. _NovaCompute: https://github.com/openstack/openstack-resource-agents/blob/master/ocf/NovaCompute
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
API impact
|
||||
----------
|
||||
|
||||
The HTTP API of the VM recovery workflow service needs to be able to
|
||||
receive events in the format they are sent by this instance monitor.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Ideally it should be possible for the instance monitor to send
|
||||
instance event data securely to the recovery workflow service
|
||||
(e.g. via TLS), without relying on the security of the admin network
|
||||
over which the data is sent.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There will be a small amount of extra RAM and CPU required on each
|
||||
compute node for running the instance monitor. However it's a
|
||||
relatively simple service, so this should not have significant impact
|
||||
on the node.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Distributions need to package and deploy an extra service on each
|
||||
compute node. However the existing `instance monitor`_ implementation
|
||||
in masakari_ already provides files to simplify packaging on the Linux
|
||||
distributions most commonly used for OpenStack infrastructure.
|
||||
|
||||
.. _masakari: https://github.com/ntt-sic/masakari
|
||||
.. _`instance monitor`:
|
||||
https://github.com/ntt-sic/masakari/tree/master/masakari-instancemonitor/
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Nothing other than the listed work items below.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
``libvirtd`` uses `QMP (QEMU Machine Protocol)`__ via UNIX domain
|
||||
socket (``/var/lib/libvirt/qemu/xxxx.monitor``) to communicate with
|
||||
the VM domain. ``libvirt`` catches the failure events and passes them
|
||||
to the VM monitor. The VM monitor filters the events and passes them
|
||||
to an external recovery workflow via HTTP, which then takes the action
|
||||
required to recover the VM.
|
||||
|
||||
__ http://wiki.qemu.org/QMP
|
||||
|
||||
::
|
||||
|
||||
+-----------------------+
|
||||
| +----------------+ |
|
||||
| | VM | |
|
||||
| | (qemu Process) | |
|
||||
| +---------^------+ |
|
||||
| | |QMP |
|
||||
| +-----v----------+ |
|
||||
| | libvirtd | |
|
||||
| +---------^------+ |
|
||||
| | | |
|
||||
| +-----v----------+ | +-----------------------+
|
||||
| | VM Monitor +------------>+ VM recovery workflow |
|
||||
| +----------------+ | +-----------------------+
|
||||
| |
|
||||
| Compute Node |
|
||||
+-----------------------+
|
||||
|
||||
We can almost certainly reuse the `instance monitor`_ provided
|
||||
by masakari_.
|
||||
|
||||
**FIXME**:
|
||||
|
||||
- Need to detail how and in which format the event data should
|
||||
be sent over HTTP. **This should allow for support for other
|
||||
hypervisors not based on** ``libvirt`` **being added in the future.**
|
||||
- Need to give details of in which exact ways the service can
|
||||
be configured.
|
||||
|
||||
- How should event filtering be configurable?
|
||||
|
||||
- Where should the configuration live? With `masakari`, it
|
||||
lives in ``/etc/masakari-instancemonitor.conf``.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
<launchpad-id or None>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Package `masakari`_'s `instance monitor`_ for SLES (`aspiers`)
|
||||
- Add documentation to the |ha-guide|_ (`beekhof`)
|
||||
- Look into libvirt-test-API_
|
||||
- Write test suite
|
||||
|
||||
.. |ha-guide| replace:: OpenStack High Availability Guide
|
||||
.. _ha-guide: http://docs.openstack.org/ha-guide/
|
||||
.. _libvirt-test-API: https://libvirt.org/testapi.html
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- `libvirt <https://libvirt.org/>`_
|
||||
- `libvirt's Python bindings <https://libvirt.org/python.html>`_
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
It may be possible to write a test suite using libvirt-test-API_ or
|
||||
at least some of its components.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The service should be documented in the |ha-guide|_.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- `Instance HA etherpad started at Newton Design Summit in Austin
|
||||
<https://etherpad.openstack.org/p/newton-instance-ha>`_
|
||||
|
||||
- `"High Availability for Virtual Machines" user story
|
||||
<http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
|
||||
|
||||
- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
|
||||
<https://youtu.be/lddtWUP_IKQ>`_
|
||||
|
||||
- `automatic-evacuation etherpad
|
||||
<https://etherpad.openstack.org/p/automatic-evacuation>`_
|
||||
|
||||
- `Instance auto-evacuation cross project spec (WIP)
|
||||
<https://review.openstack.org/#/c/257809>`_
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Newton
|
||||
- Introduced
|
Loading…
x
Reference in New Issue
Block a user