nebulous/monitoring

ipatini 8e7bef13c5 Merge branch 'ems/prepare-for-nebulous'

2023-08-11 09:31:47 +03:00

81 KiB

Raw Blame History

Testing of New EMS Features

New features of EMS

Support for Resource-Limited (RL) nodes, like edge devices or small VMs
Support for Self-Healing monitoring topology (partially implemented)

Definitions

We distinguish between Resource-Limited (RL) nodes and Normal or Non-RL nodes.

Normal nodes are VMs have enough resources, where an EMS client will be installed, along with JRE and Netdata.
RL nodes are VMs with few resources, where only Netdata will be installed.
Currently, EMS will classify a VM as an RL node if:
- it has 1 or 2 cores, or
- it has 2GB of RAM or less, or
- it has Total Disk space 1GB or less, or
- its architecture name starts with ARM (it will normally be x86_64).
- Thresholds can be changed in gr.iccs.imu.ems.baguette-client-install.properties file.

We also distinguish between Monitoring Topologies:

2-LEVEL Monitoring Topology: Nodes send their metrics directly to EMS server.
- Includes an EMS server, and any number of Normal and/or RL nodes.
- No clustering occurs in 2-LEVEL topologies, hence Aggregator role is not used.
- CAMEL Metric Models will only use GLOBAL and PER_INSTANCE groupings or no groupings at all (GLOBAL and PER_INSTANCE are then implied).
3-LEVEL Monitoring Topology: Nodes send their metrics to cluster-wide Aggregators, then Aggregators send (composite) metrics to EMS server.
- Includes an EMS server, Aggregators (one per cluster), and Normal and/or RL nodes.
- Nodes are groupped into clusters. Each cluster has a node with the Aggregator role.
- Only Normal nodes can be Aggregators.
- There must be exactly one Aggregator per cluster.
- Each cluster must have at least one Normal node (in order to become Aggregator).
- CAMEL Metric Model will use GLOBAL, PER_ZONE / PER_REGION / PER_CLOUD, and PER_INSTANCE groupings.
Clustering of nodes is used for faster failure detection, as well as distribution of load:
- Only 3-LEVEL topologies are clustered.
- 2-LEVEL topologies are not clustered.
Currently, nodes are clustered based on their:
- Availability Zone or Region or Cloud Service Provider, or
- assigned to a default cluster.

A) Support for Resource-Limited nodes

Feature Quick Notes:

EMS server will NOT install EMS client and JRE in RL nodes.

EMS server will install Netdata in RL nodes.

EMS server or an Aggregator will periodically query Netdata agents of RL nodes for metrics.

Normal nodes will periodically query their Local Netdata agent for metrics.

Test Cases

A.1) Metrics collection from RL nodes in a 2-LEVEL topology

Test Case Quick Notes:

EMS server MUST log when it collects metrics from RL nodes.

EMS server MUST NOT log or collect metrics from Normal (Non-RL) nodes.

Normal nodes MUST log when they collect metrics from their Local Netdata agents. (The Log records are slightly different).

You need a CAMEL model:

with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
with 1-2 COMPONENTS using Requirement Set #1 (Normal nodes)
with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
with no Groupings in Metric Model

After Application deployment you need to check the logs of:

EMS server, for log messages about collecting metrics from RL-nodes' Netdata agents. E.g.

e.m.e.c.c.netdata.NetdataCollector       : Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.32.2, 192.168.32.4]
e.m.e.c.c.netdata.NetdataCollector       : Collectors::Netdata:   Collecting data from url: http://192.168.32.2:19999/api/v1/allmetrics?format=json
e.m.e.c.c.netdata.NetdataCollector       : Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
e.m.e.c.c.netdata.NetdataCollector       : Collectors::Netdata:   Collecting data from url: http://192.168.32.4:19999/api/v1/allmetrics?format=json
e.m.e.c.c.netdata.NetdataCollector       : Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

Normal nodes, for log messages about collecting metrics from their Local Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

A.2) Metrics collection from RL nodes in a 3-LEVEL topology

Test Case Quick Notes:

The Aggregator (it is a Normal node) MUST log each time it collects metrics from RL nodes in its cluster.

The Aggregator MUST NOT log or collect metrics from Normal (Non-RL) nodes in its cluster.

Normal nodes (including Aggregator) MUST log each time they collect metrics from their Local Netdata agents. (The Log records are slightly different).

You need a CAMEL model:

with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes)
with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
with three (3) Groupings used in the Metric Model (GLOBAL, PER_ZONE, PER_INSTANCE)

After Application deployment you need to check the logs of:

EMS server, for NO logs related collecting metrics from any Netdata agent

Aggregator node(s), for logs about collecting metrics from the Netdata agents of RL nodes, in the same cluster. E.g.

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2, 192.168.96.5]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata:   Collecting data from url: http://192.168.96.5:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

Normal nodes (including Aggregator node), for logs about collecting metrics from their Local Netdata agents. E.g.

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

B) Support for Monitoring Self-Healing

Feature Quick Notes:

Self-Healing refers to recovering the monitoring software running at the nodes.

In Normal nodes, specifically refers to recovering of EMS client and/or Netdata agent.

In RL nodes, refers to recovering Netdata agent only.

Design Choices

Each EMS client (in a Normal node) is responsible for recovering the Local Netdata agent, collocated with it.
When clustering is used (i.e. in a 3-level topology), Aggregator is responsible for recovering other nodes in its cluster, both Normal and RL.
When clustering is not used (i.e. in a 2-level topology), EMS server is responsible for recovering nodes (both Normal and RL).

Self-Healing actions

We distinguish between monitoring topologies:

2-LEVEL Monitoring topology: Only EMS server and nodes (Normal & RL) are used. No Aggregators or clustering.
- EMS server will try to recover any Normal node that disconnects and not reconnects after a configured period of time.
  
  Condition:
  - EMS client disconnects and not re-connects after X seconds
  Recovery steps taken by EMS server:
  - SSH to node (assuming it is a VM)
  - Kill EMS client (if it is still running)
  - Launch EMS client
  - Close SSH connection
  - Wait for a configured period of time for recovered EMS client to reconnect to EMS server
  - After that period of time, the process is repeated (up to a configured number of retries, and then gives up).
- EMS server will try to recovery any RL node with inaccessible Netdata agent.
  
  Condition:
  - X consecutive connection failures to Netdata agent occur.
  Recovery steps taken by EMS server:
  - SSH to node (assuming it is a VM)
  - Kill Netdata (if it is still running)
  - Launch Netdata
  - Close SSH connection
  - Reset the consecutive failures counter.
3-LEVEL Monitoring topology: EMS server, Aggregators (one per cluster), and Nodes in clusters exist. Use of clustering.
- Aggregator will try to recover any Normal node that leaves the cluster and not joins back in a configured period of time.
  
  Condition:
  - EMS client leaves cluster and not joins back after X seconds
  Recovery steps taken by Aggregators:
  - Contact EMS server to get node's credentials
  - SSH to node (assuming it is a VM)
  - Kill EMS client (if it is still running)
  - Launch EMS client
  - Close SSH connection
  - Wait for a configured period of time for EMS client to join back to cluster
  - After that period of time the process is repeated (up to a configured number of retries, and then it gives up and notifies EMS server)
  - When EMS client joins to cluster or in case of giving up, the node credentials are cleared from Aggregator's cache.
- Aggregator will try to recover any RL node with inaccessible Netdata agent.
  
  Condition:
  - X consecutive connection failures to Netdata agent occur.
  Recovery steps taken by Aggregators:
  - Contact EMS server to get node's credentials
  - SSH to node (assuming it is a VM)
  - Kill Netdata agent (if it is still running)
  - Launch Netdata agent
  - Close SSH connection
  - Reset the consecutive failures counter
  - On successful connection to Netdata agent the node credentials are cleared from Aggregator cache.
2-LEVEL or 3-LEVEL Monitoring topology
- Any Normal node will try to recover its Local Netdata agent, if it becomes inaccessible.
  
  Condition:
  - X consecutive connection failures to Local Netdata agent occur.
  Recovery steps (taken by NORMAL node):
  - Kill Netdata agent (if it is still running)
  - Launch Netdata agent
  - Reset the consecutive failures counter

Test Cases for 2-LEVEL topology

PREREQUISITE:

You need a CAMEL model with a 2-LEVEL monitoring topology:

with two Requirement Sets:

for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and

for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk

with 1-2 components with Requirement Set #1 (Normal nodes)

with 1-2 components with Requirement Set #2 (RL nodes)

with no Groupings used in Metric Model.

This CAMEL model is common to the following test cases, unless another CAMEL model is specified.

CAMEL model MUST be re-deployed after each test case execution.

B.1.a) Successful recovery of an EMS client in a Normal node

Test Case Quick Notes:

Kill EMS client of any Normal node.

The EMS server will recover the killed EMS client after a configured period of time.

Check EMS server logs for disconnection, recovery actions and re-connection messages.

After Application deployment...

Connect to a Normal node and kill EMS client

Next, check the logs of:

EMS server, for messages reporting an EMS client disconnection, the recovery attempt(s) and EMS client re-connection.

EMS server log: An EMS client disconnected

e.m.e.b.server.ClientShellCommand        : #00000==> Signaling client to exit
e.m.e.b.server.ClientShellCommand        : #00000--> Thread stops
e.m.e.b.s.coordinator.NoopCoordinator    : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00000
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: Client unregistered: #00000 @ 172.29.0.3
e.m.e.b.c.s.ClientRecoveryPlugin         : ClientRecoveryPlugin: processExitEvent(): client-id=#00000, client-address=172.29.0.3

EMS server log: EMS client recovery actions

e.m.e.b.c.s.ClientRecoveryPlugin         : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi
o.a.s.c.k.AcceptAllServerKeyVerifier     : Server at /172.29.0.3:22 presented unverified EC key: SHA256:gNU4ScwysUpv050SaorPj7zlZrkiyGq4YSsOGBl+DCk
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: Session will be recorded in file: /logs/172.29.0.3-22-2022.02.16.09.33.31.121-0.txt
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Connected to remote host: task #0: host: 172.29.0.3:22
e.m.e.b.c.install.SshClientInstaller     :
  ----------------------------------------------------------------------
  Task #0 :  Instruction Set: Restarting Baguette agent at VM node
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: Executing installation instructions set: Restarting Baguette agent at VM node
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: Executing instruction 1/2: Killing previous EMS client process
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: EXEC: /opt/baguette-client/bin/kill.sh
o.a.s.c.session.ClientConnectionService  : globalRequest(ClientConnectionService[ClientSessionImpl[ubuntu@/172.29.0.3:22]])[hostkeys-00@openssh.com, want-reply=false] failed (SshException) to process: EdDSA provider not supported
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: EXEC: exit-status=0
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: Executing instruction 2/2: Starting new EMS client process
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: EXEC: /opt/baguette-client/bin/run.sh
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: EXEC: exit-status=0
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task #0: Installation Instructions set succeeded: Restarting Baguette agent at VM node
e.m.e.b.c.install.SshClientInstaller     :
  -------------------------------------------------------------------------
  Task #0 :  Instruction sets processed: successful=1, failed=0, exit-result=SUCCESS
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Disconnected from remote host: task #0: host: 172.29.0.3:22
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Task completed successfully #0
e.m.e.b.c.s.ClientRecoveryPlugin         : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=true, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteSe

EMS server log: EMS client reconnected

o.a.s.s.session.ServerUserAuthService    : Session user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696 authenticated
e.m.e.b.server.ClientShellCommand        : #00001--> Got session : ServerSessionImpl[user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696]
e.m.e.b.server.ClientShellCommand        : #00001==> Thread started
e.m.e.b.server.ClientShellCommand        : #00001--> Client Id: VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450
e.m.e.b.server.ClientShellCommand        : #00001--> Broker URL: ssl://172.29.0.3:61617?daemon=true&trace=false&useInactivityMonitor=false&connectionTimeout=0&keepAlive=true
e.m.e.b.server.ClientShellCommand        : #00001--> Broker Username: user-local-Q1mnKfNgzM
e.m.e.b.server.ClientShellCommand        : #00001--> Broker Password: xityAHGDhIiVeAxJdfax
e.m.e.b.server.ClientShellCommand        : #00001--> Broker Cert.: -----BEGIN CERTIFICATE-----
.........................
-----END CERTIFICATE-----
e.m.e.b.server.ClientShellCommand        : #00001--> Adding/Replacing client certificate in Truststore: alias=172.29.0.3
e.m.e.b.server.ClientShellCommand        : #00001--> Added/Replaced client certificate in Truststore: alias=172.29.0.3, CN=C=GR, ST=Attika, L=Athens, O=Institute of Communication and Computer Systems (ICCS), OU=Information Management Unit (IMU), CN=172.29.0.3, certificate-na
e.m.e.b.s.coordinator.NoopCoordinator    : TwoLevelCoordinator: register(): Method invoked. CSC: ClientShellCommand_#00001
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: Sending grouping configurations to client #00001...
.........................
e.m.e.b.server.ClientShellCommand        : sendGroupingConfiguration: Serialization of Grouping configuration for PER_INSTANCE: rO0ABXNyACt.........................
e.m.e.b.server.ClientShellCommand        : #00001==> PUSH : SET-GROUPING-CONFIG rO0ABXNyACt.........................
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: Sending grouping configurations to client #00001... done
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: Setting active grouping of client #00001: PER_INSTANCE
e.m.e.b.server.ClientShellCommand        : #00001==> PUSH : SET-ACTIVE-GROUPING PER_INSTANCE
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.server.ClientShellCommand        : #00001--> Client grouping changed: null --> PER_INSTANCE

Normal node where EMS client killed, for EMS client's logs indicating its restart.

Normal node: EMS client restarts

Starting baguette client...
EMS_CONFIG_DIR=/opt/baguette-client/conf
LOG_FILE=/opt/baguette-client/logs/output.txt
  ____                         _   _          _____ _ _            _
 |  _ \                       | | | |        / ____| (_)          | |
 | |_) | __ _  __ _ _   _  ___| |_| |_ ___  | |    | |_  ___ _ __ | |_
 |  _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | |    | | |/ _ \ '_ \| __|
 | |_) | (_| | (_| | |_| |  __/ |_| ||  __/ | |____| | |  __/ | | | |_
 |____/ \__,_|\__, |\__,_|\___|\__|\__\___|  \_____|_|_|\___|_| |_|\__|
               __/ |
              |___/
Starting BaguetteClient v4.5.0-SNAPSHOT on 21845bcaf772 with PID 779 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client)
No active profile set, falling back to default profiles: default
loadCachedClientId: Used cached Client Id: null
Password encoder class name is empty. Default instance of PasswordEncoder will be created
.........................
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
.........................

Other Normal nodes, for NO logs indicating failure or recovery attempts.

B.1.b) Failed recovery of EMS client in a Normal node

Test Case Quick Notes:

Kill the VM of any Normal node.

The EMS server will try to connect to the affected VM but fail.

After a configured number of retries EMS server will give up.

After Application deployment...

Terminate the VM of a Normal node

Next, check the logs of:

EMS server, for messages reporting an EMS client disconnection, failed recovery attempts and giving up recovery

EMS server log: An EMS client disconnected

e.m.e.b.server.ClientShellCommand        : #00001==> Signaling client to exit
e.m.e.b.server.ClientShellCommand        : #00001--> Thread stops
e.m.e.b.s.coordinator.NoopCoordinator    : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00001
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator          : TwoLevelCoordinator: Client unregistered: #00001 @ 172.29.0.3
e.m.e.b.c.s.ClientRecoveryPlugin         : ClientRecoveryPlugin: processExitEvent(): client-id=#00001, client-address=172.29.0.3

EMS server log: EMS client recovery actions and give up message

e.m.e.b.c.s.ClientRecoveryPlugin         : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Error while connecting to remote host: task #0:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Failed executing task #0, Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)
.........................
.........................
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Retry 5/5 executing task #0
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Error while connecting to remote host: task #0:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)
e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Failed executing task #0, Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)

e.m.e.b.c.install.SshClientInstaller     : SshClientInstaller: Giving up executing task #0 after 5 retries
e.m.e.b.c.s.ClientRecoveryPlugin         : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=false, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteS

Normal nodes that operate, for NO logs indicating any failure or recovery attempts

B.2.a) Successful recovery of a Netdata agent in a RL node

Test Case Quick Notes:

Kill Netdata agent of any RL node.

The EMS server will recover the killed Netdata agent after a configured period of time.

Check EMS server log messages reporting failures to collect metrics, recovery actions, and successful metrics collection.

After Application deployment...

Connect to a RL node and kill Netdata agent.

EMS server log: Failed metric collection attempts from a Netdata agent
```
......................... Not yet implemented
```

Next, check the logs of:

EMS server, for logs reporting connection failure to a Netdata agent, and recovery actions.

EMS server log: Netdata agent recovery actions
```
......................... Not yet implemented
```
RL node with killed Netdata, check if the Netdata processes have started again.
RL node shell: Recovered Netdata agent process
```
......................... Not yet implemented
```
Normal nodes (that operate), for NO Logs indicating failure or recovery attempts.

B.2.b) Failed recovery of a Netdata agent in a RL node

Test Case Quick Notes:

Kill the VM of any RL node.

The EMS server will try to connect to the affected VM but fail.

After a configured number of retries EMS server will give up.

After Application deployment...

Terminate the VM of a RL node

You need to check the logs of:

EMS server, for logs reporting connection failure to a Netdata agent, and then a number of failed attempts to connect to VM.

EMS server log: Failed metric collection attempts from a Netdata agent
```
......................... Not yet implemented
```
EMS server log: Failed Netdata agent recovery actions and give up message
```
......................... Not yet implemented
```
Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.

B.3) Successful recovery of a Netdata agent in a Normal node

Test Case Quick Notes:

Kill Netdata agent of any Normal node.

The EMS client of the node will recover the killed Netdata agent after a configured period of time.

Check EMS client's logs for messages reporting failures to collect metrics, recovery actions, and successful metrics collection.

After Application deployment...

Connect to a Normal node and kill Netdata agent.

Next, check the logs of:

EMS server, for No log messages indicating connection failures to Netdata, or recovery actions.

Normal node with killed Netdata, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collections, recovery actions, and successful metric collection.

Normal node - EMS client log: Failed attempts to collect metrics from Local Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3
Collectors::Netdata: Will pause metrics collection from node for 60 seconds:
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=

Normal node - EMS client log: Local Netdata agent recovery actions

SelfHealingPlugin: Retry #0: Recovering node: id=null, address=
ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
##############  Initial wait......
##############  Waiting for 5000ms after Initial wait......
##############  Sending Netdata agent kill command......
##############  Waiting for 2000ms after Sending Netdata agent kill command......
##############  Sending Netdata agent start command......
##############  Waiting for 10000ms after Sending Netdata agent start command......
ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:
 OUT> /opt/baguette-client
 ERR> -U: 1: -U: Syntax error: Unterminated quoted string
 ERR> 2022-02-16 10:23:29: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.

Normal node - EMS client log: Successful metrics collection from Local Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:

Collectors::Netdata: Resumed metrics collection from node:
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.

Test Cases for 3-LEVEL topology

PREREQUISITE:

You need a CAMEL model for 3-LEVEL topology:

with two Requirement Sets:

for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and

for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk,

with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes)

with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)

with three (3) Groupings used in the Metric Model (GLOBAL, PER_ZONE, PER_INSTANCE).

This CAMEL model is common to the following test cases, unless another CAMEL model is specified.

CAMEL model MUST be re-deployed after each test case execution.

B.4.a) Successful recovery of an EMS client in a clustered Normal node

Test Case Quick Notes:

Kill EMS client of any Normal node except the Aggregator.

The Aggregator will recover the killed EMS client after a configured period of time.

Check Aggregator log messages for node leaving cluster, recovery actions, and node joining back.

After Application deployment...

Connect to a Normal node, except Aggregator, and kill EMS client

Next, check the logs of:

EMS server, for Aggregator's query for node credentials.

EMS server log: Aggregator queries for node's credentials

e.m.e.b.server.ClientShellCommand        : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",.........................

Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.

Aggregator, for log messages about, (i) EMS client leaving cluster, (ii) recovery actions, and (iii) EMS client joining back to the cluster.

Aggregator log: An EMS client left cluster

CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4

Aggregator log: EMS client recovery actions

SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SSH client is ready
VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
##############  Initial wait......
##############  Waiting for 5000ms after Initial wait......
##############  Sending baguette client kill command......
##############  Waiting for 2000ms after Sending baguette client kill command......
##############  Sending baguette client start command......
##############  Waiting for 10000ms after Sending baguette client start command......
SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4
New client config.: ClientConfiguration(nodesWithoutClient=[])
VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.4, port=22, username=ubuntu
Stopping SSH client...
SSH client stopped
 OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4
 OUT>
 OUT> pwd
 OUT> ubuntu@3866738cb0f4:~$ pwd
 OUT> /home/ubuntu
 OUT> ubuntu@3866738cb0f4:~$ /opt/baguette-client/bin/kill.sh
 OUT> Baguette client is not running
 OUT> ubuntu@3866738cb0f4:~$ /opt/baguette-client/bin/run.sh
 OUT> Starting baguette client...
 OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf
 OUT> LOG_FILE=/opt/baguette-client/logs/output.txt
 OUT> Baguette client PID:   973
VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

Aggregator log: EMS client joined back to cluster

CLM: MEMBER_ADDED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4

Normal node whose EMS client killed, for EMS client's logs indicating its restart.

Normal node: EMS client restarts

Starting baguette client...
EMS_CONFIG_DIR=/opt/baguette-client/conf
LOG_FILE=/opt/baguette-client/logs/output.txt
  ____                         _   _          _____ _ _            _
 |  _ \                       | | | |        / ____| (_)          | |
 | |_) | __ _  __ _ _   _  ___| |_| |_ ___  | |    | |_  ___ _ __ | |_
 |  _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | |    | | |/ _ \ '_ \| __|
 | |_) | (_| | (_| | |_| |  __/ |_| ||  __/ | |____| | |  __/ | | | |_
 |____/ \__,_|\__, |\__,_|\___|\__|\__\___|  \_____|_|_|\___|_| |_|\__|
               __/ |
              |___/
Starting BaguetteClient v4.5.0-SNAPSHOT on 3866738cb0f4 with PID 973 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client)
No active profile set, falling back to default profiles: default
loadCachedClientId: Used cached Client Id: null
Password encoder class name is empty. Default instance of PasswordEncoder will be created
PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL...
KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done
BrokerConfig: Creating new Broker Service instance: url=ssl://0.0.0.0:61617
.........................
.........................
CLUSTER-JOIN IMU-ZONE  GLOBAL:PER_ZONE:PER_INSTANCE  start-election=true  192.168.16.4:2002  192.168.16.3:2001
CLUSTER-JOIN ARGS: cluster-id=IMU-ZONE, groupings=GLOBAL:PER_ZONE:PER_INSTANCE, local-node=192.168.16.4:2002, other-nodes=[192.168.16.3:2001]
CLUSTER-JOIN ARGS: Groupings: global=GLOBAL, aggregator=PER_ZONE, node=PER_INSTANCE
CLM: Local address used for building Atomix: 192.168.16.4:2002
CLM: Building Atomix: Other members: [Node{id=node_3866738cb0f4_2001, address=192.168.16.3:2001}]
.........................
.........................
CLUSTER-EXEC broker list
Cluster executes command: broker list
CLI: Node status and scores:
CLI:    node_581d745be52c_2001  [AGGREGATOR, 0.6640625, 9e790362-704c-4d9e-aa74-77f76e297816]
CLI:    node_3866738cb0f4_2002  [CANDIDATE, 0.6640625, 44a5afb7-890a-4090-9f80-c65f046aeddd]
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

Other Normal nodes, for logs about, (i) EMS client leaving cluster, (ii) EMS client joining to cluster, but NO logs about recovery actions.

B.4.b) Failed recovery of an EMS client in a clustered Normal node

Test Case Quick Notes:

Kill the VM of any Normal node, except Aggregator.

The Aggregator will try to connect to the affected VM but fail.

After a configured number of retries Aggregator will give up.

After Application deployment...

Terminate the VM of a Normal node, except the Aggregator's

Next, check the logs of:

EMS server, for a recovery Give up message from Aggregator

EMS server log: Aggregator queries for node's credentials

e.m.e.b.server.ClientShellCommand        : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",.........................

EMS server log: Aggregator give up message

e.m.e.b.server.ClientShellCommand        : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
e.m.e.b.server.ClientShellCommand        : #00000--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4

Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.

Aggregator, for messages reporting, (i) an EMS client left cluster, (ii) a number of failed connection attempts to the VM, and (iii) a recovery give up message.

Aggregator log: An EMS client left cluster

CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4

Aggregator log: EMS client recovery actions and give up message

SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)
.........................
.........................
SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)

SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4

Normal nodes that operate, for logs about EMS client leaving cluster, and NO logs about recovery actions or EMS client joining back.

B.5.a) Successful recovery of EMS client of the cluster Aggregator

Test Case Quick Notes:

Kill EMS client of the Aggregator.

The cluster nodes will elect a new Aggregator. Check logs of any cluster node.

The new Aggregator will recover the killed EMS client after a configured period of time.

Check new Aggregator log messages for node leaving cluster, being elected as Aggregator, recovery actions, and node joining back.

Old Aggregator will join back as a Normal node.

After Application deployment...

Connect to the Aggregator node, and kill EMS client.

Next, check the logs of:

EMS server, for message about Aggregator change.

EMS server log: A new Aggregator initialized

e.m.e.b.server.ClientShellCommand        : #00003--> Client status changed: CANDIDATE --> INITIALIZING
e.m.e.b.server.ClientShellCommand        : #00003--> Client grouping changed: PER_INSTANCE --> PER_ZONE
e.m.e.b.s.c.c.ClusteringCoordinator      : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00003 @ 192.168.16.4 (VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f)
e.m.e.b.server.ClientShellCommand        : #00003--> Client status changed: INITIALIZING --> AGGREGATOR

EMS server log: Aggregator queries for node's credentials

e.m.e.b.server.ClientShellCommand        : #00003==> PUSH : {"random":"8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a","zone-id":"IMU-ZONE","address":"192.168.16.3","provider":"AWS",.........................

Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.

New Aggregator, for log messages about, (i) EMS client leaving cluster, (ii) being elected as Aggregator, (iii) recovery actions, and (iv) EMS client joining to cluster.

New Aggregator log: Old Aggregator left cluster - New Aggregator election

CLM: MEMBER_REMOVED: node=node_581d745be52c_2001
BRU: Brokers after cluster change: []

BRU: Broker election requested: broadcasting election message...
BRU: **** Broker message received: election
BRU: **** BROKER: Starting Broker election:
BRU: Member-Score: node_3866738cb0f4_2002 => 0.6640625  d4f2eb55-c355-4715-8a27-9f7c12c32924
BRU: Broker: node_3866738cb0f4_2002

New Aggregator log: Initializing to become the new Aggregator

BRU: Node will become Broker. Initializing...
NOTIFY-STATUS-CHANGE: INITIALIZING
initialize(): Node starts initializing as Aggregator...
.........................
.........................
Notifying Baguette Server i am the new aggregator
.........................
.........................
BRU: Node is ready to act as Aggregator. Ready
BRU: **** Broker message received: ready node_3866738cb0f4_2002 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................
BRU: **** BROKER: New Broker is ready: node_3866738cb0f4_2002, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................
BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................

New Aggregator log: Requesting old Aggregator node's credentials

SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.3
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3

New Aggregator log: Recovery actions of old Aggregator

SelfHealingPlugin: Retry #0: Recovering node: id=node_581d745be52c_2001, address=192.168.16.3
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.3, port=22, username=ubuntu
Connecting to server...
SSH client is ready
VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
##############  Initial wait......
##############  Waiting for 5000ms after Initial wait......
##############  Sending baguette client kill command......
##############  Waiting for 2000ms after Sending baguette client kill command......
##############  Sending baguette client start command......
##############  Waiting for 10000ms after Sending baguette client start command......
SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4
New client config.: ClientConfiguration(nodesWithoutClient=[])
VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.3, port=22, username=ubuntu
Stopping SSH client...
SSH client stopped
 OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4
 OUT>
 OUT> pwd
 OUT> ubuntu@581d745be52c:~$ pwd
 OUT> /home/ubuntu
 OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/kill.sh
 OUT> Baguette client is not running
 OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/run.sh
 OUT> Starting baguette client...
 OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf
 OUT> LOG_FILE=/opt/baguette-client/logs/output.txt
 OUT> Baguette client PID:  1242
VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT

New Aggregator log: Old Aggregator joins back to cluster as plain node

CLM: MEMBER_ADDED: node=node_581d745be52c_2001
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3

Old Aggregator node whose EMS client killed, for EMS client's logs indicating its restart (as a PER_INSTANCE node).

Normal node: Old Aggregator restarts as a plain Normal node

Starting baguette client...
EMS_CONFIG_DIR=/opt/baguette-client/conf
LOG_FILE=/opt/baguette-client/logs/output.txt
  ____                         _   _          _____ _ _            _
 |  _ \                       | | | |        / ____| (_)          | |
 | |_) | __ _  __ _ _   _  ___| |_| |_ ___  | |    | |_  ___ _ __ | |_
 |  _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | |    | | |/ _ \ '_ \| __|
 | |_) | (_| | (_| | |_| |  __/ |_| ||  __/ | |____| | |  __/ | | | |_
 |____/ \__,_|\__, |\__,_|\___|\__|\__\___|  \_____|_|_|\___|_| |_|\__|
               __/ |
              |___/
Starting BaguetteClient v4.5.0-SNAPSHOT on 581d745be52c with PID 1242 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client)
No active profile set, falling back to default profiles: default
loadCachedClientId: Used cached Client Id: null
Password encoder class name is empty. Default instance of PasswordEncoder will be created
PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL...
KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done
.........................
.........................
CLM: Joining cluster...
NOTIFY-STATUS-CHANGE: CANDIDATE
.........................
.........................
Joined to cluster
.........................
.........................
CLUSTER-EXEC broker list
Cluster executes command: broker list
CLI: Node status and scores:
CLI:    node_3866738cb0f4_2002  [AGGREGATOR, 0.6640625, d4f2eb55-c355-4715-8a27-9f7c12c32924]
CLI:    node_581d745be52c_2001  [CANDIDATE, 0.6640625, e974ebcd-e11e-4baa-b3cb-fa34242705ff]

Other Normal nodes, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, (iii) EMS client joining to cluster, but NO logs about recovery actions.

B.5.b) Failed recovery of EMS client of the cluster Aggregator

Test Case Quick Notes:

Kill the VM of the Aggregator.

The cluster nodes will elect a new Aggregator. Check logs of any cluster node.

The new Aggregator will try to connect to the affected VM but fail.

After a configured number of retries new Aggregator will give up.

After Application deployment...

Terminate the VM of the Aggregator

Next, check the logs of:

EMS server, for one message about Aggregator change, and one about new Aggregator giving up recovery.

EMS server log: A new Aggregator initialized

e.m.e.b.server.ClientShellCommand        : #00004--> Client status changed: CANDIDATE --> INITIALIZING
e.m.e.b.server.ClientShellCommand        : #00004--> Client grouping changed: PER_INSTANCE --> PER_ZONE
e.m.e.b.s.c.c.ClusteringCoordinator      : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00004 @ 192.168.16.3 (VM-UBUNTU-vm1-vm1-AWS-vm1-8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a)
e.m.e.b.server.ClientShellCommand        : #00004--> Client status changed: INITIALIZING --> AGGREGATOR

EMS server log: New Aggregator queries for node's credentials

e.m.e.b.server.ClientShellCommand        : #00004==> PUSH : {"random":"4abf9ae2-b7fc-4e8c-b6d9-464623d1b05f","zone-id":"IMU-ZONE","address":"192.168.16.4",.........................

EMS server log: New Aggregator give up message

e.m.e.b.server.ClientShellCommand        : #00004--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
e.m.e.b.server.ClientShellCommand        : #00004--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4

Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.

New Aggregator, for messages reporting, (i) an EMS client left cluster, (ii) being elected as Aggregator, (iii) a number of failed connection attempts to the VM, and (iv) a recovery give up message.

New Aggregator log: Old Aggregator left cluster - New Aggregator election

CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: []
BRU: Broker election requested: broadcasting election message...
BRU: **** Broker message received: election
BRU: **** BROKER: Starting Broker election:
BRU: Member-Score: node_581d745be52c_2001 => 0.6640625  e974ebcd-e11e-4baa-b3cb-fa34242705ff
BRU: Broker: node_581d745be52c_2001

New Aggregator log: Initializing to become the new Aggregator

CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: []
BRU: Broker election requested: broadcasting election message...
BRU: **** Broker message received: election
BRU: **** BROKER: Starting Broker election:
BRU: Member-Score: node_581d745be52c_2001 => 0.6640625  e974ebcd-e11e-4baa-b3cb-fa34242705ff
BRU: Broker: node_581d745be52c_2001

BRU: Node will become Broker. Initializing...
2022-02-16 12:01:34.448 [INFO ] NOTIFY-STATUS-CHANGE: INITIALIZING
initialize(): Node starts initializing as Aggregator...
.........................
.........................
Notifying Baguette Server i am the new aggregator
.........................
.........................
BRU: Node is ready to act as Aggregator. Ready
BRU: **** Broker message received: ready node_581d745be52c_2001 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................
BRU: **** BROKER: New Broker is ready: node_581d745be52c_2001, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................
BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................

New Aggregator log: Requesting old Aggregator node's credentials

SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4

New Aggregator log: Failing recovery actions of old Aggregator

SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)
.........................
.........................
SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)

New Aggregator log: Recovery actions Give Up message

SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4

Normal nodes that operate, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, but NO logs about recovery actions, or EMS client joining back to cluster.

B.6.a) Successful recovery of Netdata agent in a clustered RL node

Test Case Quick Notes:

Kill Netdata agent of any RL node.

The Aggregator will recover the killed Netdata agent after a configured period of time.

Check Aggregator log messages reporting failures to collect metrics, recovery actions, and successful metrics collection.

After Application deployment...

Connect to a RL node and kill Netdata agent.

Next, check the logs of:

EMS server, for NO logs indicating a Netdata failure and recovery. EMS server log: Aggregator queries for RL node's credentials

e.m.e.b.server.ClientShellCommand        : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",.........................

Aggregator, for logs reporting, (i) connection failures to a Netdata agent, (ii) recovery actions, and (iii) successful connection to Netdata agent and collection of metrics. Aggregator log: Failed metric collection attempts from a RL node's Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3
Collectors::Netdata: Pausing collection from Node: 192.168.96.2

Aggregator log: Requesting RL node's credentials

SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2

Aggregator log: Netdata agent recovery actions

SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu
Connecting to server...
SSH client is ready
VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
##############  Initial wait......
##############  Waiting for 5000ms after Initial wait......
##############  Sending Netdata agent kill command......
##############  Waiting for 2000ms after Sending Netdata agent kill command......
##############  Sending Netdata agent start command......
##############  Waiting for 10000ms after Sending Netdata agent start command......
VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.96.2, port=22, username=ubuntu
Stopping SSH client...
SSH client stopped
Collectors::Netdata: Resuming collection from Node: 192.168.96.2
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2
 OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4
 OUT>
 OUT> pwd
 OUT> ubuntu@ec17d3e87fb4:~$ pwd
 OUT> /home/ubuntu
 OUT> ubuntu@ec17d3e87fb4:~$
 OUT> < -U netdata -o "pid" --no-headers | xargs kill -9'
 OUT>
 OUT> Usage:
 OUT>  kill [options] <pid> [...]
 OUT>
 OUT> Options:
 OUT>  <pid> [...]            send signal to every <pid> listed
 OUT>  -<signal>, -s, --signal <signal>
 OUT>                         specify the <signal> to be sent
 OUT>  -l, --list=[<signal>]  list all signal names, or convert one to a name
 OUT>  -L, --table            list all signal names in a nice table
 OUT>
 OUT>  -h, --help     display this help and exit
 OUT>  -V, --version  output version information and exit
 OUT>
 OUT> For more details see kill(1).
 OUT> ubuntu@ec17d3e87fb4:~$ sudo netdata
 OUT> 2022-02-16 12:27:55: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT

Aggregator log: Successful metrics collection from RL node's Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

RL node with killed Netdata, check if the Netdata processes have started again. RL node shell: Recovered Netdata agent process

# ps -ef |grep netdata
root       610    29  0 12:27 pts/0    00:00:00 grep --color=auto netd
.........................
.........................
# ps -ef |grep netdata
netdata    623     1  5 12:27 ?        00:00:51 netdata
netdata    625   623  0 12:27 ?        00:00:02 /usr/sbin/netdata --special-spawn-server
root       894   623  0 12:28 ?        00:00:05 /usr/libexec/netdata/plugins.d/apps.plugin 1
netdata   1050   623  0 12:28 ?        00:00:04 /usr/libexec/netdata/plugins.d/go.d.plugin 1
root      1105    29  0 12:45 pts/0    00:00:00 grep --color=auto netd

Normal nodes (that operate), for NO logs indicating connection failures or recovery action.

B.6.b) Failed recovery of Netdata agent in a clustered RL node

Test Case Quick Notes:

Kill the VM of any RL node.

The EMS server will try to connect to the affected VM but fail.

After a configured number of retries EMS server will give up.

After Application deployment...

Terminate the VM of a RL node

You need to check the logs of:

EMS server, for NO logs indicating a Netdata failure and recovery, BUT reporting a recovery give up from Aggregator. EMS server log: Aggregator queries for RL node's credentials

e.m.e.b.server.ClientShellCommand        : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",.........................

EMS server log: Aggregator give up message

e.m.e.b.server.ClientShellCommand        : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP null @ 192.168.96.2
e.m.e.b.server.ClientShellCommand        : #00000--> Client Recovery Notification: GIVE_UP: null @ 192.168.96.2
e.m.e.baguette.server.BaguetteServer     : BaguetteServer.onMessage: Marked Node as Failed: 192.168.96.2

Aggregator, for logs reporting (i) connection failures to a Netdata agent, (ii) a number of failed attempts to connect to VM, and (iii) a recovery give up message. Aggregator log: Failed metric collection attempts from a RL node's Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata:   Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3
Collectors::Netdata: Pausing collection from Node: 192.168.96.2

Aggregator log: Requesting RL node's credentials

SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2

Aggregator log: Netdata agent (failing) recovery actions

SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)

Collecting metrics from local node...
  Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
    Metrics: extracted=0, published=0, failed=0
Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
  Node is in ignore list: 192.168.96.2
.........................
.........................
SelfHealingPlugin: Retry #3: Recovering node: id=null, address=192.168.96.2
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception:
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
        at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
        at java.lang.Thread.run(Thread.java:748)

Aggregator log: Netdata agent recovery Give Up message

SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=null, address=192.168.96.2
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2
Collectors::Netdata: Giving up collection from Node: 192.168.96.2
NOTIFY-X: RECOVERY GIVE_UP null @ 192.168.96.2

Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.

B.7) Successful recovery of local Netdata agent, in a clustered Normal node (including Aggregator)

Test Case Quick Notes:

Kill Netdata agent of any Normal node.

The EMS client of the affected node will recover the killed Netdata agent after a configured period of time.

Check EMS client's log for messages reporting failures to collect metrics, recovery actions, and successful metrics collection.

After Application deployment...

Connect to a Normal node and kill Netdata agent.

Next, check the logs of:

EMS server, for No log messages indicating connection failures to a Netdata agent or recovery actions.
Aggregator, for No log messages indicating connection failures to a Netdata agent or recovery actions.

Normal node with killed Netdata, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collection attempts, recovery actions, and successful metric collection. Normal node - EMS client log: Failed attempts to collect metrics from Local Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3
Collectors::Netdata: Will pause metrics collection from node for 60 seconds:
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=

Normal node - EMS client log: Local Netdata agent recovery actions

SelfHealingPlugin: Retry #0: Recovering node: id=null, address=
ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
##############  Initial wait......
##############  Waiting for 5000ms after Initial wait......
##############  Sending Netdata agent kill command......
##############  Waiting for 2000ms after Sending Netdata agent kill command......
##############  Sending Netdata agent start command......
##############  Waiting for 10000ms after Sending Netdata agent start command......
ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:
 OUT> /opt/baguette-client
 ERR> -U: 1: -U: Syntax error: Unterminated quoted string
 ERR> 2022-02-16 13:21:52: netdata INFO  : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.

Normal node - EMS client log: Successful metrics collection from Local Netdata agent

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Node is in ignore list:

Collectors::Netdata: Resumed metrics collection from node:
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=

Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata:   Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata:     Metrics: extracted=0, published=0, failed=0

Other Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.

Limitations

Clustering is never used for 2-level monitoring topologies.
When no Normal nodes (and hence no Aggregator) exist in a cluster, no one will collect metrics from the (orphan) RL nodes.
When no Normal nodes (and hence no Aggregator) exist in a cluster, no one will recover the (orphan) RL nodes.
If EMS server fails no one will recover it.
Metric messages are not cached/redirected, if the next node has failed.

81 KiB Raw Blame History

Testing of New EMS Features

New features of EMS

Definitions

A) Support for Resource-Limited nodes

Test Cases

B) Support for Monitoring Self-Healing

Design Choices

Self-Healing actions

Test Cases for 2-LEVEL topology

Test Cases for 3-LEVEL topology

Limitations

81 KiB

Raw Blame History