monitoring/ems-core/README-for-TESTING.md
2023-08-11 09:31:47 +03:00

1329 lines
81 KiB
Markdown

# <u>Testing of New EMS Features</u>
## New features of EMS
- Support for **Resource-Limited (RL)** nodes, like edge devices or small VMs
- Support for **Self-Healing** monitoring topology (partially implemented)
## Definitions
We distinguish between ***Resource-Limited (RL)*** nodes and ***Normal or Non-RL*** nodes.
- **Normal nodes** are VMs have enough resources, where an EMS client will be installed, along with JRE and Netdata.
- **RL nodes** are VMs with few resources, where only Netdata will be installed.
- Currently, EMS will classify a VM as an RL node if:
* it has 1 or 2 cores, or
* it has 2GB of RAM or less, or
* it has Total Disk space 1GB or less, or
* its architecture name starts with `ARM` (it will normally be `x86_64`).
* Thresholds can be changed in `gr.iccs.imu.ems.baguette-client-install.properties` file.
We also distinguish between ***Monitoring Topologies***:
- **2-LEVEL Monitoring Topology**: Nodes send their metrics directly to EMS server.
* Includes an EMS server, and any number of Normal and/or RL nodes.
* No clustering occurs in 2-LEVEL topologies, hence Aggregator role is not used.
* CAMEL Metric Models will only use `GLOBAL` and `PER_INSTANCE` groupings or no groupings at all (`GLOBAL` and `PER_INSTANCE` are then implied).
- **3-LEVEL Monitoring Topology**: Nodes send their metrics to cluster-wide Aggregators, then Aggregators send (composite) metrics to EMS server.
* Includes an EMS server, Aggregators (one per cluster), and Normal and/or RL nodes.
* Nodes are groupped into clusters. Each cluster has a node with the Aggregator role.
* Only Normal nodes can be Aggregators.
* There must be exactly one Aggregator per cluster.
* Each cluster must have at least one Normal node (in order to become Aggregator).
* CAMEL Metric Model will use `GLOBAL`, `PER_ZONE` / `PER_REGION` / `PER_CLOUD`, and `PER_INSTANCE` groupings.
Clustering of nodes is used for faster failure detection, as well as distribution of load:
- Only 3-LEVEL topologies are clustered.
- 2-LEVEL topologies are not clustered.
Currently, nodes are clustered based on their:
- Availability Zone or Region or Cloud Service Provider, or
- assigned to a default cluster.
------
## A) <u>Support for Resource-Limited nodes</u>
> Feature Quick Notes:
> - EMS server will NOT install EMS client and JRE in RL nodes.
> - EMS server will install Netdata in RL nodes.
> - EMS server or an Aggregator will periodically query Netdata agents of RL nodes for metrics.
> - Normal nodes will periodically query their Local Netdata agent for metrics.
### <u>Test Cases</u>
**A.1) Metrics collection from RL nodes in a 2-LEVEL topology**
> Test Case Quick Notes:
> - EMS server MUST log when it collects metrics from RL nodes.
> - EMS server MUST *NOT* log or collect metrics from Normal (Non-RL) nodes.
> - Normal nodes MUST log when they collect metrics from their Local Netdata agents. (The Log records are slightly different).
**You need a CAMEL model:**
* with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
* with 1-2 COMPONENTS using Requirement Set #1 (Normal nodes)
* with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
* with no Groupings in Metric Model
**After Application deployment you need to check the logs of:**
* ***EMS server***, for log messages about collecting metrics from RL-nodes' Netdata agents. E.g.
```
e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.32.2, 192.168.32.4]
e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting data from url: http://192.168.32.2:19999/api/v1/allmetrics?format=json
e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting data from url: http://192.168.32.4:19999/api/v1/allmetrics?format=json
e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
* ***Normal nodes***, for log messages about collecting metrics from their Local Netdata agent
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
**A.2) Metrics collection from RL nodes in a 3-LEVEL topology**
> Test Case Quick Notes:
> - The Aggregator (it is a Normal node) MUST log each time it collects metrics from RL nodes in its cluster.
> - The Aggregator MUST *NOT* log or collect metrics from Normal (Non-RL) nodes in its cluster.
> - Normal nodes (including Aggregator) MUST log each time they collect metrics from their Local Netdata agents. (The Log records are slightly different).
**You need a CAMEL model:**
* with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
* with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes)
* with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
* with three (3) Groupings used in the Metric Model (`GLOBAL`, `PER_ZONE`, `PER_INSTANCE`)
**After Application deployment you need to check the logs of:**
* ***EMS server***, for NO logs related collecting metrics from any Netdata agent
* ***Aggregator node(s)***, for logs about collecting metrics from the Netdata agents of RL nodes, in the same cluster. E.g.
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2, 192.168.96.5]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting data from url: http://192.168.96.5:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
* ***Normal nodes*** (including Aggregator node), for logs about collecting metrics from their Local Netdata agents. E.g.
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
------
## B) <u>Support for Monitoring Self-Healing</u>
> Feature Quick Notes:
> - Self-Healing refers to recovering the monitoring software running at the nodes.
> - In Normal nodes, specifically refers to recovering of EMS client and/or Netdata agent.
> - In RL nodes, refers to recovering Netdata agent only.
#### Design Choices
1. Each EMS client (in a Normal node) is responsible for recovering the Local Netdata agent, collocated with it.
2. When clustering is used (i.e. in a 3-level topology), Aggregator is responsible for recovering other nodes in its cluster, both Normal and RL.
3. When clustering is not used (i.e. in a 2-level topology), EMS server is responsible for recovering nodes (both Normal and RL).
#### Self-Healing actions
We distinguish between monitoring topologies:
* **2-LEVEL Monitoring topology:** Only EMS server and nodes (Normal & RL) are used. No Aggregators or clustering.
* EMS server will try to recover any <u>*Normal node*</u> that disconnects and not reconnects after a configured period of time.
***Condition:***
* EMS client disconnects and not re-connects after X seconds
***Recovery steps taken by EMS server:***
* SSH to node (assuming it is a VM)
* Kill EMS client (if it is still running)
* Launch EMS client
* Close SSH connection
* Wait for a configured period of time for recovered EMS client to reconnect to EMS server
* After that period of time, the process is repeated (up to a configured number of retries, and then gives up).
* EMS server will try to recovery any <u>*RL node*</u> with inaccessible Netdata agent.
***Condition:***
* X consecutive connection failures to Netdata agent occur.
***Recovery steps taken by EMS server:***
* SSH to node (assuming it is a VM)
* Kill Netdata (if it is still running)
* Launch Netdata
* Close SSH connection
* Reset the consecutive failures counter.
* **3-LEVEL Monitoring topology:** EMS server, Aggregators (one per cluster), and Nodes in clusters exist. Use of clustering.
* <u>Aggregator</u> will try to recover any <u>*Normal node*</u> that leaves the cluster and not joins back in a configured period of time.
***Condition:***
* EMS client leaves cluster and not joins back after X seconds
***Recovery steps taken by Aggregators:***
* Contact EMS server to get node's credentials
* SSH to node (assuming it is a VM)
* Kill EMS client (if it is still running)
* Launch EMS client
* Close SSH connection
* Wait for a configured period of time for EMS client to join back to cluster
* After that period of time the process is repeated (up to a configured number of retries, and then it gives up and notifies EMS server)
* When EMS client joins to cluster or in case of giving up, the node credentials are cleared from Aggregator's cache.
* <u>Aggregator</u> will try to recover any <u>*RL node*</u> with inaccessible Netdata agent.
***Condition:***
* X consecutive connection failures to Netdata agent occur.
***Recovery steps taken by Aggregators:***
* Contact EMS server to get node's credentials
* SSH to node (assuming it is a VM)
* Kill Netdata agent (if it is still running)
* Launch Netdata agent
* Close SSH connection
* Reset the consecutive failures counter
* On successful connection to Netdata agent the node credentials are cleared from Aggregator cache.
* **2-LEVEL or 3-LEVEL Monitoring topology**
* Any Normal node will try to recover its Local Netdata agent, if it becomes inaccessible.
***Condition:***
* X consecutive connection failures to Local Netdata agent occur.
***Recovery steps (taken by NORMAL node):***
* Kill Netdata agent (if it is still running)
* Launch Netdata agent
* Reset the consecutive failures counter
### <u>Test Cases for 2-LEVEL topology</u>
> ***PREREQUISITE:***
>
> You need a CAMEL model with a 2-LEVEL monitoring topology:
>
> * with two Requirement Sets:
> - for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
> - for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
> * with 1-2 components with Requirement Set #1 (Normal nodes)
> * with 1-2 components with Requirement Set #2 (RL nodes)
> * with no Groupings used in Metric Model.
>
> This CAMEL model is ***common*** to the following test cases, unless another CAMEL model is specified.
>
> CAMEL model MUST be re-deployed after each test case execution.
**B.1.a) Successful recovery of an EMS client in a Normal node**
> Test Case Quick Notes:
> - Kill EMS client of any Normal node.
> - The EMS server will recover the killed EMS client after a configured period of time.
> - Check EMS server logs for disconnection, recovery actions and re-connection messages.
**After Application deployment...**
* Connect to a Normal node and ***kill*** EMS client
**Next, check the logs of:**
* ***EMS server***, for messages reporting an EMS client disconnection, the recovery attempt(s) and EMS client re-connection.
*<p align="center">EMS server log: An EMS client disconnected</p>*
```
e.m.e.b.server.ClientShellCommand : #00000==> Signaling client to exit
e.m.e.b.server.ClientShellCommand : #00000--> Thread stops
e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00000
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Client unregistered: #00000 @ 172.29.0.3
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: processExitEvent(): client-id=#00000, client-address=172.29.0.3
```
*<p align="center">EMS server log: EMS client recovery actions</p>*
```
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi
o.a.s.c.k.AcceptAllServerKeyVerifier : Server at /172.29.0.3:22 presented unverified EC key: SHA256:gNU4ScwysUpv050SaorPj7zlZrkiyGq4YSsOGBl+DCk
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Session will be recorded in file: /logs/172.29.0.3-22-2022.02.16.09.33.31.121-0.txt
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Connected to remote host: task #0: host: 172.29.0.3:22
e.m.e.b.c.install.SshClientInstaller :
----------------------------------------------------------------------
Task #0 : Instruction Set: Restarting Baguette agent at VM node
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Executing installation instructions set: Restarting Baguette agent at VM node
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Executing instruction 1/2: Killing previous EMS client process
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: /opt/baguette-client/bin/kill.sh
o.a.s.c.session.ClientConnectionService : globalRequest(ClientConnectionService[ClientSessionImpl[ubuntu@/172.29.0.3:22]])[hostkeys-00@openssh.com, want-reply=false] failed (SshException) to process: EdDSA provider not supported
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: exit-status=0
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Executing instruction 2/2: Starting new EMS client process
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: /opt/baguette-client/bin/run.sh
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: exit-status=0
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Installation Instructions set succeeded: Restarting Baguette agent at VM node
e.m.e.b.c.install.SshClientInstaller :
-------------------------------------------------------------------------
Task #0 : Instruction sets processed: successful=1, failed=0, exit-result=SUCCESS
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Disconnected from remote host: task #0: host: 172.29.0.3:22
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task completed successfully #0
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=true, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteSe
```
*<p align="center">EMS server log: EMS client reconnected</p>*
```
o.a.s.s.session.ServerUserAuthService : Session user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696 authenticated
e.m.e.b.server.ClientShellCommand : #00001--> Got session : ServerSessionImpl[user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696]
e.m.e.b.server.ClientShellCommand : #00001==> Thread started
e.m.e.b.server.ClientShellCommand : #00001--> Client Id: VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450
e.m.e.b.server.ClientShellCommand : #00001--> Broker URL: ssl://172.29.0.3:61617?daemon=true&trace=false&useInactivityMonitor=false&connectionTimeout=0&keepAlive=true
e.m.e.b.server.ClientShellCommand : #00001--> Broker Username: user-local-Q1mnKfNgzM
e.m.e.b.server.ClientShellCommand : #00001--> Broker Password: xityAHGDhIiVeAxJdfax
e.m.e.b.server.ClientShellCommand : #00001--> Broker Cert.: -----BEGIN CERTIFICATE-----
.........................
-----END CERTIFICATE-----
e.m.e.b.server.ClientShellCommand : #00001--> Adding/Replacing client certificate in Truststore: alias=172.29.0.3
e.m.e.b.server.ClientShellCommand : #00001--> Added/Replaced client certificate in Truststore: alias=172.29.0.3, CN=C=GR, ST=Attika, L=Athens, O=Institute of Communication and Computer Systems (ICCS), OU=Information Management Unit (IMU), CN=172.29.0.3, certificate-na
e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: register(): Method invoked. CSC: ClientShellCommand_#00001
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Sending grouping configurations to client #00001...
.........................
e.m.e.b.server.ClientShellCommand : sendGroupingConfiguration: Serialization of Grouping configuration for PER_INSTANCE: rO0ABXNyACt.........................
e.m.e.b.server.ClientShellCommand : #00001==> PUSH : SET-GROUPING-CONFIG rO0ABXNyACt.........................
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Sending grouping configurations to client #00001... done
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Setting active grouping of client #00001: PER_INSTANCE
e.m.e.b.server.ClientShellCommand : #00001==> PUSH : SET-ACTIVE-GROUPING PER_INSTANCE
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.server.ClientShellCommand : #00001--> Client grouping changed: null --> PER_INSTANCE
```
* ***Normal node where EMS client killed***, for EMS client's logs indicating its restart.
*<p align="center">Normal node: EMS client restarts</p>*
```
Starting baguette client...
EMS_CONFIG_DIR=/opt/baguette-client/conf
LOG_FILE=/opt/baguette-client/logs/output.txt
____ _ _ _____ _ _ _
| _ \ | | | | / ____| (_) | |
| |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_
| _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __|
| |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_
|____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__|
__/ |
|___/
Starting BaguetteClient v4.5.0-SNAPSHOT on 21845bcaf772 with PID 779 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client)
No active profile set, falling back to default profiles: default
loadCachedClientId: Used cached Client Id: null
Password encoder class name is empty. Default instance of PasswordEncoder will be created
.........................
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
.........................
```
* ***Other Normal nodes***, for NO logs indicating failure or recovery attempts.
**B.1.b) Failed recovery of EMS client in a Normal node**
> Test Case Quick Notes:
> - Kill the VM of any Normal node.
> - The EMS server will try to connect to the affected VM but fail.
> - After a configured number of retries EMS server will give up.
**After Application deployment...**
* Terminate the VM of a Normal node
**Next, check the logs of:**
* ***EMS server***, for messages reporting an EMS client disconnection, failed recovery attempts and giving up recovery
*<p align="center">EMS server log: An EMS client disconnected</p>*
```
e.m.e.b.server.ClientShellCommand : #00001==> Signaling client to exit
e.m.e.b.server.ClientShellCommand : #00001--> Thread stops
e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00001
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: --------------------------------------------------
e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Client unregistered: #00001 @ 172.29.0.3
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: processExitEvent(): client-id=#00001, client-address=172.29.0.3
```
*<p align="center">EMS server log: EMS client recovery actions and give up message</p>*
```
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Error while connecting to remote host: task #0:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Failed executing task #0, Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
.........................
.........................
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Retry 5/5 executing task #0
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Error while connecting to remote host: task #0:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Failed executing task #0, Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Giving up executing task #0 after 5 retries
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=false, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteS
```
* ***Normal nodes that operate***, for NO logs indicating any failure or recovery attempts
**B.2.a) Successful recovery of a Netdata agent in a RL node**
> Test Case Quick Notes:
> - Kill Netdata agent of any RL node.
> - The EMS server will recover the killed Netdata agent after a configured period of time.
> - Check EMS server log messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
**After Application deployment...**
* Connect to a RL node and kill Netdata agent.
*<p align="center">EMS server log: Failed metric collection attempts from a Netdata agent</p>*
```
......................... Not yet implemented
```
**Next, check the logs of:**
* ***EMS server***, for logs reporting connection failure to a Netdata agent, and recovery actions.
*<p align="center">EMS server log: Netdata agent recovery actions</p>*
```
......................... Not yet implemented
```
* ***RL node with killed Netdata***, check if the Netdata processes have started again.
*<p align="center">RL node shell: Recovered Netdata agent process</p>*
```
......................... Not yet implemented
```
* ***Normal nodes (that operate)***, for NO Logs indicating failure or recovery attempts.
**B.2.b) Failed recovery of a Netdata agent in a RL node**
> Test Case Quick Notes:
> - Kill the VM of any RL node.
> - The EMS server will try to connect to the affected VM but fail.
> - After a configured number of retries EMS server will give up.
**After Application deployment...**
* Terminate the VM of a RL node
**You need to check the logs of:**
* ***EMS server***, for logs reporting connection failure to a Netdata agent, and then a number of failed attempts to connect to VM.
*<p align="center">EMS server log: Failed metric collection attempts from a Netdata agent</p>*
```
......................... Not yet implemented
```
*<p align="center">EMS server log: Failed Netdata agent recovery actions and give up message</p>*
```
......................... Not yet implemented
```
* ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions.
**B.3) Successful recovery of a Netdata agent in a Normal node**
> Test Case Quick Notes:
> - Kill Netdata agent of any Normal node.
> - The EMS client of the node will recover the killed Netdata agent after a configured period of time.
> - Check EMS client's logs for messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
**After Application deployment...**
* Connect to a Normal node and kill Netdata agent.
**Next, check the logs of:**
* ***EMS server***, for No log messages indicating connection failures to Netdata, or recovery actions.
* ***Normal node with killed Netdata***, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collections, recovery actions, and successful metric collection.
*<p align="center">Normal node - EMS client log: Failed attempts to collect metrics from <u><b>Local</b></u> Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3
Collectors::Netdata: Will pause metrics collection from node for 60 seconds:
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=
```
*<p align="center">Normal node - EMS client log: <u><b>Local</b></u> Netdata agent recovery actions</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=null, address=
ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
############## Initial wait......
############## Waiting for 5000ms after Initial wait......
############## Sending Netdata agent kill command......
############## Waiting for 2000ms after Sending Netdata agent kill command......
############## Sending Netdata agent start command......
############## Waiting for 10000ms after Sending Netdata agent start command......
ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
OUT> /opt/baguette-client
ERR> -U: 1: -U: Syntax error: Unterminated quoted string
ERR> 2022-02-16 10:23:29: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
```
*<p align="center">Normal node - EMS client log: Successful metrics collection from <u><b>Local</b></u> Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
Collectors::Netdata: Resumed metrics collection from node:
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
* ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions.
### <u>Test Cases for 3-LEVEL topology</u>
> ***PREREQUISITE:***
>
> You need a CAMEL model for 3-LEVEL topology:
>
> * with two Requirement Sets:
> - for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
> - for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk,
> * with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes)
> * with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
> * with three (3) Groupings used in the Metric Model (`GLOBAL`, `PER_ZONE`, `PER_INSTANCE`).
>
> This CAMEL model is ***common*** to the following test cases, unless another CAMEL model is specified.
>
> CAMEL model MUST be re-deployed after each test case execution.
**B.4.a) Successful recovery of an EMS client in a clustered Normal node**
> Test Case Quick Notes:
> - Kill EMS client of any Normal node except the Aggregator.
> - The Aggregator will recover the killed EMS client after a configured period of time.
> - Check Aggregator log messages for node leaving cluster, recovery actions, and node joining back.
**After Application deployment...**
* Connect to a Normal node, except Aggregator, and ***kill*** EMS client
**Next, check the logs of:**
* ***EMS server***, for Aggregator's query for node credentials.
*<p align="center">EMS server log: Aggregator queries for node's credentials</p>*
```
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",.........................
```
Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
* ***Aggregator***, for log messages about, (i) EMS client leaving cluster, (ii) recovery actions, and (iii) EMS client joining back to the cluster.
*<p align="center">Aggregator log: An EMS client left cluster</p>*
```
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
```
*<p align="center">Aggregator log: EMS client recovery actions</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SSH client is ready
VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
############## Initial wait......
############## Waiting for 5000ms after Initial wait......
############## Sending baguette client kill command......
############## Waiting for 2000ms after Sending baguette client kill command......
############## Sending baguette client start command......
############## Waiting for 10000ms after Sending baguette client start command......
SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4
New client config.: ClientConfiguration(nodesWithoutClient=[])
VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.4, port=22, username=ubuntu
Stopping SSH client...
SSH client stopped
OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4
OUT>
OUT> pwd
OUT> ubuntu@3866738cb0f4:~$ pwd
OUT> /home/ubuntu
OUT> ubuntu@3866738cb0f4:~$ /opt/baguette-client/bin/kill.sh
OUT> Baguette client is not running
OUT> ubuntu@3866738cb0f4:~$ /opt/baguette-client/bin/run.sh
OUT> Starting baguette client...
OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf
OUT> LOG_FILE=/opt/baguette-client/logs/output.txt
OUT> Baguette client PID: 973
VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
*<p align="center">Aggregator log: EMS client joined back to cluster</p>*
```
CLM: MEMBER_ADDED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
```
* ***Normal node whose EMS client killed***, for EMS client's logs indicating its restart.
*<p align="center">Normal node: EMS client restarts</p>*
```
Starting baguette client...
EMS_CONFIG_DIR=/opt/baguette-client/conf
LOG_FILE=/opt/baguette-client/logs/output.txt
____ _ _ _____ _ _ _
| _ \ | | | | / ____| (_) | |
| |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_
| _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __|
| |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_
|____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__|
__/ |
|___/
Starting BaguetteClient v4.5.0-SNAPSHOT on 3866738cb0f4 with PID 973 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client)
No active profile set, falling back to default profiles: default
loadCachedClientId: Used cached Client Id: null
Password encoder class name is empty. Default instance of PasswordEncoder will be created
PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL...
KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done
BrokerConfig: Creating new Broker Service instance: url=ssl://0.0.0.0:61617
.........................
.........................
CLUSTER-JOIN IMU-ZONE GLOBAL:PER_ZONE:PER_INSTANCE start-election=true 192.168.16.4:2002 192.168.16.3:2001
CLUSTER-JOIN ARGS: cluster-id=IMU-ZONE, groupings=GLOBAL:PER_ZONE:PER_INSTANCE, local-node=192.168.16.4:2002, other-nodes=[192.168.16.3:2001]
CLUSTER-JOIN ARGS: Groupings: global=GLOBAL, aggregator=PER_ZONE, node=PER_INSTANCE
CLM: Local address used for building Atomix: 192.168.16.4:2002
CLM: Building Atomix: Other members: [Node{id=node_3866738cb0f4_2001, address=192.168.16.3:2001}]
.........................
.........................
CLUSTER-EXEC broker list
Cluster executes command: broker list
CLI: Node status and scores:
CLI: node_581d745be52c_2001 [AGGREGATOR, 0.6640625, 9e790362-704c-4d9e-aa74-77f76e297816]
CLI: node_3866738cb0f4_2002 [CANDIDATE, 0.6640625, 44a5afb7-890a-4090-9f80-c65f046aeddd]
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
* ***Other Normal nodes***, for logs about, (i) EMS client leaving cluster, (ii) EMS client joining to cluster, but NO logs about recovery actions.
**B.4.b) Failed recovery of an EMS client in a clustered Normal node**
> Test Case Quick Notes:
> - Kill the VM of any Normal node, except Aggregator.
> - The Aggregator will try to connect to the affected VM but fail.
> - After a configured number of retries Aggregator will give up.
**After Application deployment...**
* Terminate the VM of a Normal node, except the Aggregator's
**Next, check the logs of:**
* ***EMS server***, for a recovery Give up message from Aggregator
*<p align="center">EMS server log: Aggregator queries for node's credentials</p>*
```
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",.........................
```
*<p align="center">EMS server log: Aggregator give up message</p>*
```
e.m.e.b.server.ClientShellCommand : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
e.m.e.b.server.ClientShellCommand : #00000--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4
```
Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
* ***Aggregator***, for messages reporting, (i) an EMS client left cluster, (ii) a number of failed connection attempts to the VM, and (iii) a recovery give up message.
*<p align="center">Aggregator log: An EMS client left cluster</p>*
```
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
```
*<p align="center">Aggregator log: EMS client recovery actions and give up message</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
.........................
.........................
SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
```
```
SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
```
* ***Normal nodes that operate***, for logs about EMS client leaving cluster, and NO logs about recovery actions or EMS client joining back.
**B.5.a) Successful recovery of EMS client of the cluster Aggregator**
> Test Case Quick Notes:
> - Kill EMS client of the Aggregator.
> - The cluster nodes will elect a new Aggregator. Check logs of any cluster node.
> - The new Aggregator will recover the killed EMS client after a configured period of time.
> - Check new Aggregator log messages for node leaving cluster, being elected as Aggregator, recovery actions, and node joining back.
> - Old Aggregator will join back as a Normal node.
**After Application deployment...**
* Connect to the Aggregator node, and ***kill*** EMS client.
**Next, check the logs of:**
* ***EMS server***, for message about Aggregator change.
*<p align="center">EMS server log: A new Aggregator initialized</p>*
```
e.m.e.b.server.ClientShellCommand : #00003--> Client status changed: CANDIDATE --> INITIALIZING
e.m.e.b.server.ClientShellCommand : #00003--> Client grouping changed: PER_INSTANCE --> PER_ZONE
e.m.e.b.s.c.c.ClusteringCoordinator : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00003 @ 192.168.16.4 (VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f)
e.m.e.b.server.ClientShellCommand : #00003--> Client status changed: INITIALIZING --> AGGREGATOR
```
*<p align="center">EMS server log: Aggregator queries for node's credentials</p>*
```
e.m.e.b.server.ClientShellCommand : #00003==> PUSH : {"random":"8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a","zone-id":"IMU-ZONE","address":"192.168.16.3","provider":"AWS",.........................
```
Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
* ***New Aggregator***, for log messages about, (i) EMS client leaving cluster, (ii) being elected as Aggregator, (iii) recovery actions, and (iv) EMS client joining to cluster.
*<p align="center">New Aggregator log: Old Aggregator left cluster - New Aggregator election</p>*
```
CLM: MEMBER_REMOVED: node=node_581d745be52c_2001
BRU: Brokers after cluster change: []
BRU: Broker election requested: broadcasting election message...
BRU: **** Broker message received: election
BRU: **** BROKER: Starting Broker election:
BRU: Member-Score: node_3866738cb0f4_2002 => 0.6640625 d4f2eb55-c355-4715-8a27-9f7c12c32924
BRU: Broker: node_3866738cb0f4_2002
```
*<p align="center">New Aggregator log: Initializing to become the new Aggregator</p>*
```
BRU: Node will become Broker. Initializing...
NOTIFY-STATUS-CHANGE: INITIALIZING
initialize(): Node starts initializing as Aggregator...
.........................
.........................
Notifying Baguette Server i am the new aggregator
.........................
.........................
BRU: Node is ready to act as Aggregator. Ready
BRU: **** Broker message received: ready node_3866738cb0f4_2002 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................
BRU: **** BROKER: New Broker is ready: node_3866738cb0f4_2002, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................
BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................
```
*<p align="center">New Aggregator log: Requesting old Aggregator node's credentials</p>*
```
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.3
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3
```
*<p align="center">New Aggregator log: Recovery actions of old Aggregator</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=node_581d745be52c_2001, address=192.168.16.3
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.3, port=22, username=ubuntu
Connecting to server...
SSH client is ready
VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
############## Initial wait......
############## Waiting for 5000ms after Initial wait......
############## Sending baguette client kill command......
############## Waiting for 2000ms after Sending baguette client kill command......
############## Sending baguette client start command......
############## Waiting for 10000ms after Sending baguette client start command......
SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4
New client config.: ClientConfiguration(nodesWithoutClient=[])
VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.3, port=22, username=ubuntu
Stopping SSH client...
SSH client stopped
OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4
OUT>
OUT> pwd
OUT> ubuntu@581d745be52c:~$ pwd
OUT> /home/ubuntu
OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/kill.sh
OUT> Baguette client is not running
OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/run.sh
OUT> Starting baguette client...
OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf
OUT> LOG_FILE=/opt/baguette-client/logs/output.txt
OUT> Baguette client PID: 1242
VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT
```
*<p align="center">New Aggregator log: Old Aggregator joins back to cluster as plain node</p>*
```
CLM: MEMBER_ADDED: node=node_581d745be52c_2001
BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I.........................
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3
```
* ***Old Aggregator node whose EMS client killed***, for EMS client's logs indicating its restart (as a `PER_INSTANCE` node).
*<p align="center">Normal node: Old Aggregator restarts as a plain Normal node</p>*
```
Starting baguette client...
EMS_CONFIG_DIR=/opt/baguette-client/conf
LOG_FILE=/opt/baguette-client/logs/output.txt
____ _ _ _____ _ _ _
| _ \ | | | | / ____| (_) | |
| |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_
| _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __|
| |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_
|____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__|
__/ |
|___/
Starting BaguetteClient v4.5.0-SNAPSHOT on 581d745be52c with PID 1242 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client)
No active profile set, falling back to default profiles: default
loadCachedClientId: Used cached Client Id: null
Password encoder class name is empty. Default instance of PasswordEncoder will be created
PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL...
KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate
BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done
.........................
.........................
CLM: Joining cluster...
NOTIFY-STATUS-CHANGE: CANDIDATE
.........................
.........................
Joined to cluster
.........................
.........................
CLUSTER-EXEC broker list
Cluster executes command: broker list
CLI: Node status and scores:
CLI: node_3866738cb0f4_2002 [AGGREGATOR, 0.6640625, d4f2eb55-c355-4715-8a27-9f7c12c32924]
CLI: node_581d745be52c_2001 [CANDIDATE, 0.6640625, e974ebcd-e11e-4baa-b3cb-fa34242705ff]
```
* ***Other Normal nodes***, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, (iii) EMS client joining to cluster, but NO logs about recovery actions.
**B.5.b) Failed recovery of EMS client of the cluster Aggregator**
> Test Case Quick Notes:
> - Kill the VM of the Aggregator.
> - The cluster nodes will elect a new Aggregator. Check logs of any cluster node.
> - The new Aggregator will try to connect to the affected VM but fail.
> - After a configured number of retries new Aggregator will give up.
**After Application deployment...**
* Terminate the VM of the Aggregator
**Next, check the logs of:**
* ***EMS server***, for one message about Aggregator change, and one about new Aggregator giving up recovery.
*<p align="center">EMS server log: A new Aggregator initialized</p>*
```
e.m.e.b.server.ClientShellCommand : #00004--> Client status changed: CANDIDATE --> INITIALIZING
e.m.e.b.server.ClientShellCommand : #00004--> Client grouping changed: PER_INSTANCE --> PER_ZONE
e.m.e.b.s.c.c.ClusteringCoordinator : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00004 @ 192.168.16.3 (VM-UBUNTU-vm1-vm1-AWS-vm1-8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a)
e.m.e.b.server.ClientShellCommand : #00004--> Client status changed: INITIALIZING --> AGGREGATOR
```
*<p align="center">EMS server log: New Aggregator queries for node's credentials</p>*
```
e.m.e.b.server.ClientShellCommand : #00004==> PUSH : {"random":"4abf9ae2-b7fc-4e8c-b6d9-464623d1b05f","zone-id":"IMU-ZONE","address":"192.168.16.4",.........................
```
*<p align="center">EMS server log: New Aggregator give up message</p>*
```
e.m.e.b.server.ClientShellCommand : #00004--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
e.m.e.b.server.ClientShellCommand : #00004--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4
```
Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
* ***New Aggregator***, for messages reporting, (i) an EMS client left cluster, (ii) being elected as Aggregator, (iii) a number of failed connection attempts to the VM, and (iv) a recovery give up message.
*<p align="center">New Aggregator log: Old Aggregator left cluster - New Aggregator election</p>*
```
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: []
BRU: Broker election requested: broadcasting election message...
BRU: **** Broker message received: election
BRU: **** BROKER: Starting Broker election:
BRU: Member-Score: node_581d745be52c_2001 => 0.6640625 e974ebcd-e11e-4baa-b3cb-fa34242705ff
BRU: Broker: node_581d745be52c_2001
```
*<p align="center">New Aggregator log: Initializing to become the new Aggregator</p>*
```
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002
BRU: Brokers after cluster change: []
BRU: Broker election requested: broadcasting election message...
BRU: **** Broker message received: election
BRU: **** BROKER: Starting Broker election:
BRU: Member-Score: node_581d745be52c_2001 => 0.6640625 e974ebcd-e11e-4baa-b3cb-fa34242705ff
BRU: Broker: node_581d745be52c_2001
BRU: Node will become Broker. Initializing...
2022-02-16 12:01:34.448 [INFO ] NOTIFY-STATUS-CHANGE: INITIALIZING
initialize(): Node starts initializing as Aggregator...
.........................
.........................
Notifying Baguette Server i am the new aggregator
.........................
.........................
BRU: Node is ready to act as Aggregator. Ready
BRU: **** Broker message received: ready node_581d745be52c_2001 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................
BRU: **** BROKER: New Broker is ready: node_581d745be52c_2001, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................
BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................
```
*<p align="center">New Aggregator log: Requesting old Aggregator node's credentials</p>*
```
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
```
*<p align="center">New Aggregator log: Failing recovery actions of old Aggregator</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
.........................
.........................
SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
```
*<p align="center">New Aggregator log: Recovery actions Give Up message</p>*
```
SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
```
* ***Normal nodes that operate***, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, but NO logs about recovery actions, or EMS client joining back to cluster.
**B.6.a) Successful recovery of Netdata agent in a clustered RL node**
> Test Case Quick Notes:
> - Kill Netdata agent of any RL node.
> - The Aggregator will recover the killed Netdata agent after a configured period of time.
> - Check Aggregator log messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
**After Application deployment...**
* Connect to a RL node and ***kill*** Netdata agent.
**Next, check the logs of:**
* ***EMS server***, for NO logs indicating a Netdata failure and recovery.
*<p align="center">EMS server log: Aggregator queries for RL node's credentials</p>*
```
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",.........................
```
* ***Aggregator***, for logs reporting, (i) connection failures to a Netdata agent, (ii) recovery actions, and (iii) successful connection to Netdata agent and collection of metrics.
*<p align="center">Aggregator log: Failed metric collection attempts from a RL node's Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3
Collectors::Netdata: Pausing collection from Node: 192.168.96.2
```
*<p align="center">Aggregator log: Requesting RL node's credentials</p>*
```
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2
```
*<p align="center">Aggregator log: Netdata agent recovery actions</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu
Connecting to server...
SSH client is ready
VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
############## Initial wait......
############## Waiting for 5000ms after Initial wait......
############## Sending Netdata agent kill command......
############## Waiting for 2000ms after Sending Netdata agent kill command......
############## Sending Netdata agent start command......
############## Waiting for 10000ms after Sending Netdata agent start command......
VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.96.2, port=22, username=ubuntu
Stopping SSH client...
SSH client stopped
Collectors::Netdata: Resuming collection from Node: 192.168.96.2
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2
OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4
OUT>
OUT> pwd
OUT> ubuntu@ec17d3e87fb4:~$ pwd
OUT> /home/ubuntu
OUT> ubuntu@ec17d3e87fb4:~$
OUT> < -U netdata -o "pid" --no-headers | xargs kill -9'
OUT>
OUT> Usage:
OUT> kill [options] <pid> [...]
OUT>
OUT> Options:
OUT> <pid> [...] send signal to every <pid> listed
OUT> -<signal>, -s, --signal <signal>
OUT> specify the <signal> to be sent
OUT> -l, --list=[<signal>] list all signal names, or convert one to a name
OUT> -L, --table list all signal names in a nice table
OUT>
OUT> -h, --help display this help and exit
OUT> -V, --version output version information and exit
OUT>
OUT> For more details see kill(1).
OUT> ubuntu@ec17d3e87fb4:~$ sudo netdata
OUT> 2022-02-16 12:27:55: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT
```
*<p align="center">Aggregator log: Successful metrics collection from RL node's Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
* ***RL node with killed Netdata***, check if the Netdata processes have started again.
*<p align="center">RL node shell: Recovered Netdata agent process</p>*
```sh
# ps -ef |grep netdata
root 610 29 0 12:27 pts/0 00:00:00 grep --color=auto netd
.........................
.........................
# ps -ef |grep netdata
netdata 623 1 5 12:27 ? 00:00:51 netdata
netdata 625 623 0 12:27 ? 00:00:02 /usr/sbin/netdata --special-spawn-server
root 894 623 0 12:28 ? 00:00:05 /usr/libexec/netdata/plugins.d/apps.plugin 1
netdata 1050 623 0 12:28 ? 00:00:04 /usr/libexec/netdata/plugins.d/go.d.plugin 1
root 1105 29 0 12:45 pts/0 00:00:00 grep --color=auto netd
```
* ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery action.
**B.6.b) Failed recovery of Netdata agent in a clustered RL node**
> Test Case Quick Notes:
> - Kill the VM of any RL node.
> - The EMS server will try to connect to the affected VM but fail.
> - After a configured number of retries EMS server will give up.
**After Application deployment...**
* Terminate the VM of a RL node
**You need to check the logs of:**
* ***EMS server***, for NO logs indicating a Netdata failure and recovery, BUT reporting a recovery give up from Aggregator.
*<p align="center">EMS server log: Aggregator queries for RL node's credentials</p>*
```
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",.........................
```
*<p align="center">EMS server log: Aggregator give up message</p>*
```
e.m.e.b.server.ClientShellCommand : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP null @ 192.168.96.2
e.m.e.b.server.ClientShellCommand : #00000--> Client Recovery Notification: GIVE_UP: null @ 192.168.96.2
e.m.e.baguette.server.BaguetteServer : BaguetteServer.onMessage: Marked Node as Failed: 192.168.96.2
```
* ***Aggregator***, for logs reporting (i) connection failures to a Netdata agent, (ii) a number of failed attempts to connect to VM, and (iii) a recovery give up message.
*<p align="center">Aggregator log: Failed metric collection attempts from a RL node's Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3
Collectors::Netdata: Pausing collection from Node: 192.168.96.2
```
*<p align="center">Aggregator log: Requesting RL node's credentials</p>*
```
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2
```
*<p align="center">Aggregator log: Netdata agent (failing) recovery actions</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
Collecting metrics from local node...
Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Metrics: extracted=0, published=0, failed=0
Collecting metrics from remote nodes (without EMS client): [192.168.96.2]
Node is in ignore list: 192.168.96.2
.........................
.........................
SelfHealingPlugin: Retry #3: Recovering node: id=null, address=192.168.96.2
VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu
Connecting to server...
SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception:
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198)
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)
at java.lang.Thread.run(Thread.java:748)
```
*<p align="center">Aggregator log: Netdata agent recovery Give Up message</p>*
```
SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=null, address=192.168.96.2
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2
Collectors::Netdata: Giving up collection from Node: 192.168.96.2
NOTIFY-X: RECOVERY GIVE_UP null @ 192.168.96.2
```
* ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions.
**B.7) Successful recovery of local Netdata agent, in a clustered Normal node (including Aggregator)**
> Test Case Quick Notes:
> - Kill Netdata agent of any Normal node.
> - The EMS client of the affected node will recover the killed Netdata agent after a configured period of time.
> - Check EMS client's log for messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
**After Application deployment...**
* Connect to a Normal node and ***kill*** Netdata agent.
**Next, check the logs of:**
* ***EMS server***, for No log messages indicating connection failures to a Netdata agent or recovery actions.
* ***Aggregator***, for No log messages indicating connection failures to a Netdata agent or recovery actions.
* ***Normal node with killed Netdata***, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collection attempts, recovery actions, and successful metric collection.
*<p align="center">Normal node - EMS client log: Failed attempts to collect metrics from <u><b>Local</b></u> Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused)
Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3
Collectors::Netdata: Will pause metrics collection from node for 60 seconds:
SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=
```
*<p align="center">Normal node - EMS client log: <u><b>Local</b></u> Netdata agent recovery actions</p>*
```
SelfHealingPlugin: Retry #0: Recovering node: id=null, address=
ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands
############## Initial wait......
############## Waiting for 5000ms after Initial wait......
############## Sending Netdata agent kill command......
############## Waiting for 2000ms after Sending Netdata agent kill command......
############## Sending Netdata agent start command......
############## Waiting for 10000ms after Sending Netdata agent start command......
ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
OUT> /opt/baguette-client
ERR> -U: 1: -U: Syntax error: Unterminated quoted string
ERR> 2022-02-16 13:21:52: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
```
*<p align="center">Normal node - EMS client log: Successful metrics collection from <u><b>Local</b></u> Netdata agent</p>*
```
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Node is in ignore list:
Collectors::Netdata: Resumed metrics collection from node:
SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=
Collectors::Netdata: Collecting metrics from local node...
Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json
Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
```
* ***Other Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions.
------
## Limitations
* Clustering is never used for 2-level monitoring topologies.
* When no Normal nodes (and hence no Aggregator) exist in a cluster, no one will collect metrics from the (orphan) RL nodes.
* When no Normal nodes (and hence no Aggregator) exist in a cluster, no one will recover the (orphan) RL nodes.
* If EMS server fails no one will recover it.
* Metric messages are not cached/redirected, if the next node has failed.