
This is from a IRC discussion and slightly reworked. Change-Id: I11d72daf04b2d72424b5b3f00507b7b0a9590583 Co-Authored-By: Jeremy Stanley <fungi@yuggoth.org> Co-Authored-By: Clark Boylan <clark.boylan@gmail.com>
169 lines
7.9 KiB
ReStructuredText
169 lines
7.9 KiB
ReStructuredText
:title: Test Environment Information
|
|
|
|
.. _test_env:
|
|
|
|
Test Environment
|
|
################
|
|
|
|
This document should give you a good idea of what you can count on
|
|
in the test environments managed by the Infrastructure team. This
|
|
information may be useful when creating new jobs or debugging existing
|
|
jobs.
|
|
|
|
Unprivileged Single Use VMs
|
|
===========================
|
|
|
|
All jobs currently run on these nodes. These are single use VMs
|
|
booted in OpenStack clouds. You should start here unless you know you
|
|
have a reason to use a privileged VM.
|
|
|
|
Each single use VM has these attributes which you can count on:
|
|
|
|
* Every instance has a public IP address. This may be an IPv4 address
|
|
or an IPv6 address or maybe both.
|
|
|
|
* You may not get both, it is entirely valid for an instance to have
|
|
only a public IPv6 address and for another to have only a public
|
|
IPv4 address.
|
|
|
|
* In some cases the public IPv4 address is provided via NAT and the
|
|
instance will only see a private IPv4 address. In some cases
|
|
instances may have both a public and a private IPv4 address.
|
|
|
|
* It is also possible that these addresses are on multiple network
|
|
interfaces.
|
|
|
|
* CPUs are all running x86-64.
|
|
* There is at least 8GB of system memory available.
|
|
* There is at least 80GB of disk available. This disk may not all be
|
|
exposed in a single filesystem partition and so not all mounted at
|
|
/. Any additional disk can be partitioned, formatted and mounted
|
|
by the root user; though if you need this it is recommended to use
|
|
devstack-gate which takes care of it automatically and mounts the
|
|
extra space on /opt early in its setup phase.
|
|
To give you an idea of what this can look like most clouds just give
|
|
us an 80GB or bigger /. One cloud gives us a 40GB / and 80GB /opt.
|
|
Generally you will want to write large things to /opt to take
|
|
advantage of available disk.
|
|
* Swap is not guaranteed to be present. Some clouds give us swap and
|
|
others do not. Some tests (like devstack-gate based tests) will create
|
|
swap either using a second disk device if available or by using a
|
|
file otherwise. Be aware you may need to create swap if you need it.
|
|
* Filesystems are ext4. If you need other filesystems you can create
|
|
them on files mounted via loop devices.
|
|
* Package mirrors for PyPi, NPM, Ubuntu, Debian, and Centos 7 (including
|
|
EPEL) are provided and preconfigured on these instances before starting
|
|
any jobs. We also have mirrors for Ceph and Ubuntu Cloud Archive that
|
|
jobs must opt into using (details for these are written to disk on the
|
|
test instances but are disabled by default).
|
|
|
|
Because these instances are single use we are able to give jobs full
|
|
root access to them. This means you can install system packages, modify
|
|
partition tables, and so on. Note that if you reboot the test instances
|
|
you will need to restart the zuul-console process.
|
|
|
|
If jobs need to perform privileged actions they can do so using Zuul v3's
|
|
secrets. Things like AFS access tokens or dockerhub credentials can
|
|
be stored in Zuul secrets then used by jobs to perform privileged
|
|
actions requiring this data. Please refer to the Zuul documentation
|
|
for more info.
|
|
|
|
Known Differences to Watch Out For
|
|
==================================
|
|
|
|
* Underlying hypervisors are not all the same. You may run into KVM
|
|
or Xen and possibly others depending on the cloud in use.
|
|
* CPU count, speed, and supported processor flags differ, sometimes
|
|
even within the same cloud region.
|
|
* Nested virt is not available in all clouds. And in clouds where it
|
|
is enabled we have observed a higher rate of crashed test VMs when
|
|
using it. As a result we enforce qemu when running devstack and
|
|
may further restrict the use of nested virt.
|
|
* Some clouds give us multiple network interfaces, some only give
|
|
us one. In the case of multiple network interfaces some clouds
|
|
give all of them Internet routable addresses and some others do
|
|
not.
|
|
* Geographic location is widely variable. We have instances all across
|
|
North America and in Europe. This may affect network performance
|
|
between instances and network resources geographically distant.
|
|
* Some network protocols may be blocked in some clouds. Specfically
|
|
we have had problems with GRE. You can rely on TCP, UDP, and ICMP
|
|
being functional on all of our clouds.
|
|
* Network interface MTU of 1500 is not guaranteed. Some clouds give
|
|
us smaller MTUs due to use of overlay networking. Test jobs
|
|
should check interface MTUs and use an appropriate value for the
|
|
current instance if creating new interfaces or bridges.
|
|
|
|
Why are Jobs for Changes Queued for a Long Time
|
|
===============================================
|
|
|
|
We have a finite number of resources to run jobs on. We process jobs
|
|
for changes in order based on a priority queuing system. This priority
|
|
queue assigns test resources to Zuul queues based on the number of
|
|
total changes in that queue. Changes at the heads of these queues are
|
|
assigned resources before those at the end of the queues.
|
|
|
|
We have done this to ensure that large projects with many changes and
|
|
long running jobs do not starve small projects with few changes and short
|
|
jobs.
|
|
|
|
In order to make the queues run quicker there are several variables we
|
|
can change:
|
|
|
|
#. Lower demand. Fewer changes and/or jobs will result in less demand for
|
|
resources increasing availability for the changes that remain.
|
|
#. Reduce job resource costs. Reducing job runtime means those resources
|
|
can be reused sooner by other jobs. Keep in mind that multinode jobs
|
|
use a whole integer multiple more resources than single node jobs.
|
|
You should only use multinode jobs where necessary to test specific
|
|
interactions or to fit a complex test case into the resources we have.
|
|
#. Improve job reliability. If jobs fail because the tests or software
|
|
under test are unreliable then we have to run more jobs to successfully
|
|
merge our software. This effect is compounded by our gate queues because
|
|
anytime we have a change that fails we must remove it from the queue,
|
|
rebuild the queue without that change, then restart all jobs in the queue
|
|
with that change evicted.
|
|
|
|
Keep in mind that we are dog fooding OpenStack to run OpenStack's CI
|
|
system. This means that a more reliable OpenStack is able to provide
|
|
resources to our CI system effectively. Fixing OpenStack in this case
|
|
is a win win situation.
|
|
#. Add resources to our pools. If we have more total resources then we will
|
|
have more to spread around.
|
|
|
|
In general, we would like to see our software perform the testing that the
|
|
developers feel is necessary. We should do so responsibly. What this means
|
|
is instead of deleting jobs or ignoring changes we should improve our test
|
|
reliability to ensure changes exit queues as quickly as possible with
|
|
minimal resource cost. This then ensures the changes behind are able to get
|
|
resources quickly.
|
|
|
|
We are also always happy to add resources if they are available, but the
|
|
priority from the project should be to ensure we are using what we do have
|
|
responsibly.
|
|
|
|
Can my changes skip the check queue?
|
|
------------------------------------
|
|
|
|
The OpenStack project uses a "clean check" approach to keep flaky
|
|
changes out of the gate. So, a change always needs to pass "check"
|
|
before it enters "gate" - and if it fails in "gate", it re-enters
|
|
the "check" pipeline.
|
|
|
|
|
|
* If your change fails in the gate, then there is an increased chance
|
|
it is introducing non-deterministic failure behavior so forcing it
|
|
to go through check again helps make that more apparent.
|
|
* This avoids also approving changes that have no hope of ever passing
|
|
due to pep8 or other trivial errors.
|
|
* It also helps with approving changes that had been sitting around
|
|
with a 6-month-old passing check.
|
|
|
|
Changes in the gate pipeline are prioritized but also serialized, so
|
|
if a change fails, all tests for changes behind that failing change
|
|
have to be restarted. If restarts after restarts happen, then
|
|
resources are never freed up for the check pipeline.
|
|
|
|
Therefore, having a stable gate pipeline is crucial - and the "clean
|
|
check" requirement will help with the stable jobs.
|