# Testing of New EMS Features ## New features of EMS - Support for **Resource-Limited (RL)** nodes, like edge devices or small VMs - Support for **Self-Healing** monitoring topology (partially implemented) ## Definitions We distinguish between ***Resource-Limited (RL)*** nodes and ***Normal or Non-RL*** nodes. - **Normal nodes** are VMs have enough resources, where an EMS client will be installed, along with JRE and Netdata. - **RL nodes** are VMs with few resources, where only Netdata will be installed. - Currently, EMS will classify a VM as an RL node if: * it has 1 or 2 cores, or * it has 2GB of RAM or less, or * it has Total Disk space 1GB or less, or * its architecture name starts with `ARM` (it will normally be `x86_64`). * Thresholds can be changed in `gr.iccs.imu.ems.baguette-client-install.properties` file. We also distinguish between ***Monitoring Topologies***: - **2-LEVEL Monitoring Topology**: Nodes send their metrics directly to EMS server. * Includes an EMS server, and any number of Normal and/or RL nodes. * No clustering occurs in 2-LEVEL topologies, hence Aggregator role is not used. * CAMEL Metric Models will only use `GLOBAL` and `PER_INSTANCE` groupings or no groupings at all (`GLOBAL` and `PER_INSTANCE` are then implied). - **3-LEVEL Monitoring Topology**: Nodes send their metrics to cluster-wide Aggregators, then Aggregators send (composite) metrics to EMS server. * Includes an EMS server, Aggregators (one per cluster), and Normal and/or RL nodes. * Nodes are groupped into clusters. Each cluster has a node with the Aggregator role. * Only Normal nodes can be Aggregators. * There must be exactly one Aggregator per cluster. * Each cluster must have at least one Normal node (in order to become Aggregator). * CAMEL Metric Model will use `GLOBAL`, `PER_ZONE` / `PER_REGION` / `PER_CLOUD`, and `PER_INSTANCE` groupings. Clustering of nodes is used for faster failure detection, as well as distribution of load: - Only 3-LEVEL topologies are clustered. - 2-LEVEL topologies are not clustered. Currently, nodes are clustered based on their: - Availability Zone or Region or Cloud Service Provider, or - assigned to a default cluster. ------ ## A) Support for Resource-Limited nodes > Feature Quick Notes: > - EMS server will NOT install EMS client and JRE in RL nodes. > - EMS server will install Netdata in RL nodes. > - EMS server or an Aggregator will periodically query Netdata agents of RL nodes for metrics. > - Normal nodes will periodically query their Local Netdata agent for metrics. ### Test Cases **A.1) Metrics collection from RL nodes in a 2-LEVEL topology** > Test Case Quick Notes: > - EMS server MUST log when it collects metrics from RL nodes. > - EMS server MUST *NOT* log or collect metrics from Normal (Non-RL) nodes. > - Normal nodes MUST log when they collect metrics from their Local Netdata agents. (The Log records are slightly different). **You need a CAMEL model:** * with two Requirement Sets: - for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and - for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk * with 1-2 COMPONENTS using Requirement Set #1 (Normal nodes) * with 1-2 COMPONENTS with Requirement Set #2 (RL nodes) * with no Groupings in Metric Model **After Application deployment you need to check the logs of:** * ***EMS server***, for log messages about collecting metrics from RL-nodes' Netdata agents. E.g. ``` e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.32.2, 192.168.32.4] e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting data from url: http://192.168.32.2:19999/api/v1/allmetrics?format=json e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting data from url: http://192.168.32.4:19999/api/v1/allmetrics?format=json e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 ``` * ***Normal nodes***, for log messages about collecting metrics from their Local Netdata agent ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 ``` **A.2) Metrics collection from RL nodes in a 3-LEVEL topology** > Test Case Quick Notes: > - The Aggregator (it is a Normal node) MUST log each time it collects metrics from RL nodes in its cluster. > - The Aggregator MUST *NOT* log or collect metrics from Normal (Non-RL) nodes in its cluster. > - Normal nodes (including Aggregator) MUST log each time they collect metrics from their Local Netdata agents. (The Log records are slightly different). **You need a CAMEL model:** * with two Requirement Sets: - for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and - for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk * with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes) * with 1-2 COMPONENTS with Requirement Set #2 (RL nodes) * with three (3) Groupings used in the Metric Model (`GLOBAL`, `PER_ZONE`, `PER_INSTANCE`) **After Application deployment you need to check the logs of:** * ***EMS server***, for NO logs related collecting metrics from any Netdata agent * ***Aggregator node(s)***, for logs about collecting metrics from the Netdata agents of RL nodes, in the same cluster. E.g. ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2, 192.168.96.5] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting data from url: http://192.168.96.5:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 ``` * ***Normal nodes*** (including Aggregator node), for logs about collecting metrics from their Local Netdata agents. E.g. ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 ``` ------ ## B) Support for Monitoring Self-Healing > Feature Quick Notes: > - Self-Healing refers to recovering the monitoring software running at the nodes. > - In Normal nodes, specifically refers to recovering of EMS client and/or Netdata agent. > - In RL nodes, refers to recovering Netdata agent only. #### Design Choices 1. Each EMS client (in a Normal node) is responsible for recovering the Local Netdata agent, collocated with it. 2. When clustering is used (i.e. in a 3-level topology), Aggregator is responsible for recovering other nodes in its cluster, both Normal and RL. 3. When clustering is not used (i.e. in a 2-level topology), EMS server is responsible for recovering nodes (both Normal and RL). #### Self-Healing actions We distinguish between monitoring topologies: * **2-LEVEL Monitoring topology:** Only EMS server and nodes (Normal & RL) are used. No Aggregators or clustering. * EMS server will try to recover any *Normal node* that disconnects and not reconnects after a configured period of time. ***Condition:*** * EMS client disconnects and not re-connects after X seconds ***Recovery steps taken by EMS server:*** * SSH to node (assuming it is a VM) * Kill EMS client (if it is still running) * Launch EMS client * Close SSH connection * Wait for a configured period of time for recovered EMS client to reconnect to EMS server * After that period of time, the process is repeated (up to a configured number of retries, and then gives up). * EMS server will try to recovery any *RL node* with inaccessible Netdata agent. ***Condition:*** * X consecutive connection failures to Netdata agent occur. ***Recovery steps taken by EMS server:*** * SSH to node (assuming it is a VM) * Kill Netdata (if it is still running) * Launch Netdata * Close SSH connection * Reset the consecutive failures counter. * **3-LEVEL Monitoring topology:** EMS server, Aggregators (one per cluster), and Nodes in clusters exist. Use of clustering. * Aggregator will try to recover any *Normal node* that leaves the cluster and not joins back in a configured period of time. ***Condition:*** * EMS client leaves cluster and not joins back after X seconds ***Recovery steps taken by Aggregators:*** * Contact EMS server to get node's credentials * SSH to node (assuming it is a VM) * Kill EMS client (if it is still running) * Launch EMS client * Close SSH connection * Wait for a configured period of time for EMS client to join back to cluster * After that period of time the process is repeated (up to a configured number of retries, and then it gives up and notifies EMS server) * When EMS client joins to cluster or in case of giving up, the node credentials are cleared from Aggregator's cache. * Aggregator will try to recover any *RL node* with inaccessible Netdata agent. ***Condition:*** * X consecutive connection failures to Netdata agent occur. ***Recovery steps taken by Aggregators:*** * Contact EMS server to get node's credentials * SSH to node (assuming it is a VM) * Kill Netdata agent (if it is still running) * Launch Netdata agent * Close SSH connection * Reset the consecutive failures counter * On successful connection to Netdata agent the node credentials are cleared from Aggregator cache. * **2-LEVEL or 3-LEVEL Monitoring topology** * Any Normal node will try to recover its Local Netdata agent, if it becomes inaccessible. ***Condition:*** * X consecutive connection failures to Local Netdata agent occur. ***Recovery steps (taken by NORMAL node):*** * Kill Netdata agent (if it is still running) * Launch Netdata agent * Reset the consecutive failures counter ### Test Cases for 2-LEVEL topology > ***PREREQUISITE:*** > > You need a CAMEL model with a 2-LEVEL monitoring topology: > > * with two Requirement Sets: > - for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and > - for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk > * with 1-2 components with Requirement Set #1 (Normal nodes) > * with 1-2 components with Requirement Set #2 (RL nodes) > * with no Groupings used in Metric Model. > > This CAMEL model is ***common*** to the following test cases, unless another CAMEL model is specified. > > CAMEL model MUST be re-deployed after each test case execution. **B.1.a) Successful recovery of an EMS client in a Normal node** > Test Case Quick Notes: > - Kill EMS client of any Normal node. > - The EMS server will recover the killed EMS client after a configured period of time. > - Check EMS server logs for disconnection, recovery actions and re-connection messages. **After Application deployment...** * Connect to a Normal node and ***kill*** EMS client **Next, check the logs of:** * ***EMS server***, for messages reporting an EMS client disconnection, the recovery attempt(s) and EMS client re-connection. *

EMS server log: An EMS client disconnected

* ``` e.m.e.b.server.ClientShellCommand : #00000==> Signaling client to exit e.m.e.b.server.ClientShellCommand : #00000--> Thread stops e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00000 e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Client unregistered: #00000 @ 172.29.0.3 e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: processExitEvent(): client-id=#00000, client-address=172.29.0.3 ``` *

EMS server log: EMS client recovery actions

EMS server log: EMS client reconnected

* ``` o.a.s.s.session.ServerUserAuthService : Session user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696 authenticated e.m.e.b.server.ClientShellCommand : #00001--> Got session : ServerSessionImpl[user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696] e.m.e.b.server.ClientShellCommand : #00001==> Thread started e.m.e.b.server.ClientShellCommand : #00001--> Client Id: VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450 e.m.e.b.server.ClientShellCommand : #00001--> Broker URL: ssl://172.29.0.3:61617?daemon=true&trace=false&useInactivityMonitor=false&connectionTimeout=0&keepAlive=true e.m.e.b.server.ClientShellCommand : #00001--> Broker Username: user-local-Q1mnKfNgzM e.m.e.b.server.ClientShellCommand : #00001--> Broker Password: xityAHGDhIiVeAxJdfax e.m.e.b.server.ClientShellCommand : #00001--> Broker Cert.: -----BEGIN CERTIFICATE----- ......................... -----END CERTIFICATE----- e.m.e.b.server.ClientShellCommand : #00001--> Adding/Replacing client certificate in Truststore: alias=172.29.0.3 e.m.e.b.server.ClientShellCommand : #00001--> Added/Replaced client certificate in Truststore: alias=172.29.0.3, CN=C=GR, ST=Attika, L=Athens, O=Institute of Communication and Computer Systems (ICCS), OU=Information Management Unit (IMU), CN=172.29.0.3, certificate-na e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: register(): Method invoked. CSC: ClientShellCommand_#00001 e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Sending grouping configurations to client #00001... ......................... e.m.e.b.server.ClientShellCommand : sendGroupingConfiguration: Serialization of Grouping configuration for PER_INSTANCE: rO0ABXNyACt......................... e.m.e.b.server.ClientShellCommand : #00001==> PUSH : SET-GROUPING-CONFIG rO0ABXNyACt......................... e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Sending grouping configurations to client #00001... done e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Setting active grouping of client #00001: PER_INSTANCE e.m.e.b.server.ClientShellCommand : #00001==> PUSH : SET-ACTIVE-GROUPING PER_INSTANCE e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.server.ClientShellCommand : #00001--> Client grouping changed: null --> PER_INSTANCE ``` * ***Normal node where EMS client killed***, for EMS client's logs indicating its restart. *

Normal node: EMS client restarts

EMS server log: An EMS client disconnected

* ``` e.m.e.b.server.ClientShellCommand : #00001==> Signaling client to exit e.m.e.b.server.ClientShellCommand : #00001--> Thread stops e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00001 e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Client unregistered: #00001 @ 172.29.0.3 e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: processExitEvent(): client-id=#00001, client-address=172.29.0.3 ``` *

EMS server log: EMS client recovery actions and give up message

* ``` e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Error while connecting to remote host: task #0: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Failed executing task #0, Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ......................... ......................... e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Retry 5/5 executing task #0 e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Error while connecting to remote host: task #0: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Failed executing task #0, Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Giving up executing task #0 after 5 retries e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=false, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteS ``` * ***Normal nodes that operate***, for NO logs indicating any failure or recovery attempts **B.2.a) Successful recovery of a Netdata agent in a RL node** > Test Case Quick Notes: > - Kill Netdata agent of any RL node. > - The EMS server will recover the killed Netdata agent after a configured period of time. > - Check EMS server log messages reporting failures to collect metrics, recovery actions, and successful metrics collection. **After Application deployment...** * Connect to a RL node and kill Netdata agent. *

EMS server log: Failed metric collection attempts from a Netdata agent

* ``` ......................... Not yet implemented ``` **Next, check the logs of:** * ***EMS server***, for logs reporting connection failure to a Netdata agent, and recovery actions. *

EMS server log: Netdata agent recovery actions

* ``` ......................... Not yet implemented ``` * ***RL node with killed Netdata***, check if the Netdata processes have started again. *

RL node shell: Recovered Netdata agent process

* ``` ......................... Not yet implemented ``` * ***Normal nodes (that operate)***, for NO Logs indicating failure or recovery attempts. **B.2.b) Failed recovery of a Netdata agent in a RL node** > Test Case Quick Notes: > - Kill the VM of any RL node. > - The EMS server will try to connect to the affected VM but fail. > - After a configured number of retries EMS server will give up. **After Application deployment...** * Terminate the VM of a RL node **You need to check the logs of:** * ***EMS server***, for logs reporting connection failure to a Netdata agent, and then a number of failed attempts to connect to VM. *

EMS server log: Failed metric collection attempts from a Netdata agent

* ``` ......................... Not yet implemented ``` *

EMS server log: Failed Netdata agent recovery actions and give up message

* ``` ......................... Not yet implemented ``` * ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions. **B.3) Successful recovery of a Netdata agent in a Normal node** > Test Case Quick Notes: > - Kill Netdata agent of any Normal node. > - The EMS client of the node will recover the killed Netdata agent after a configured period of time. > - Check EMS client's logs for messages reporting failures to collect metrics, recovery actions, and successful metrics collection. **After Application deployment...** * Connect to a Normal node and kill Netdata agent. **Next, check the logs of:** * ***EMS server***, for No log messages indicating connection failures to Netdata, or recovery actions. * ***Normal node with killed Netdata***, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collections, recovery actions, and successful metric collection. *

Normal node - EMS client log: Failed attempts to collect metrics from Local Netdata agent

* ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3 Collectors::Netdata: Will pause metrics collection from node for 60 seconds: SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address= ``` *

Normal node - EMS client log: Local Netdata agent recovery actions

* ``` SelfHealingPlugin: Retry #0: Recovering node: id=null, address= ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending Netdata agent kill command...... ############## Waiting for 2000ms after Sending Netdata agent kill command...... ############## Sending Netdata agent start command...... ############## Waiting for 10000ms after Sending Netdata agent start command...... ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: OUT> /opt/baguette-client ERR> -U: 1: -U: Syntax error: Unterminated quoted string ERR> 2022-02-16 10:23:29: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults. ``` *

Normal node - EMS client log: Successful metrics collection from Local Netdata agent

* ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Resumed metrics collection from node: SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address= Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 ``` * ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions. ### Test Cases for 3-LEVEL topology > ***PREREQUISITE:*** > > You need a CAMEL model for 3-LEVEL topology: > > * with two Requirement Sets: > - for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and > - for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk, > * with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes) > * with 1-2 COMPONENTS with Requirement Set #2 (RL nodes) > * with three (3) Groupings used in the Metric Model (`GLOBAL`, `PER_ZONE`, `PER_INSTANCE`). > > This CAMEL model is ***common*** to the following test cases, unless another CAMEL model is specified. > > CAMEL model MUST be re-deployed after each test case execution. **B.4.a) Successful recovery of an EMS client in a clustered Normal node** > Test Case Quick Notes: > - Kill EMS client of any Normal node except the Aggregator. > - The Aggregator will recover the killed EMS client after a configured period of time. > - Check Aggregator log messages for node leaving cluster, recovery actions, and node joining back. **After Application deployment...** * Connect to a Normal node, except Aggregator, and ***kill*** EMS client **Next, check the logs of:** * ***EMS server***, for Aggregator's query for node credentials. *

EMS server log: Aggregator queries for node's credentials

* ``` e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",......................... ``` Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server. * ***Aggregator***, for log messages about, (i) EMS client leaving cluster, (ii) recovery actions, and (iii) EMS client joining back to the cluster. *

Aggregator log: An EMS client left cluster

* ``` CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 ``` *

Aggregator log: EMS client recovery actions

Aggregator log: EMS client joined back to cluster

* ``` CLM: MEMBER_ADDED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 ``` * ***Normal node whose EMS client killed***, for EMS client's logs indicating its restart. *

Normal node: EMS client restarts

* ``` Starting baguette client... EMS_CONFIG_DIR=/opt/baguette-client/conf LOG_FILE=/opt/baguette-client/logs/output.txt ____ _ _ _____ _ _ _ | _ \ | | | | / ____| (_) | | | |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_ | _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __| | |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_ |____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__| __/ | |___/ Starting BaguetteClient v4.5.0-SNAPSHOT on 3866738cb0f4 with PID 973 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client) No active profile set, falling back to default profiles: default loadCachedClientId: Used cached Client Id: null Password encoder class name is empty. Default instance of PasswordEncoder will be created PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done BrokerConfig: Creating new Broker Service instance: url=ssl://0.0.0.0:61617 ......................... ......................... CLUSTER-JOIN IMU-ZONE GLOBAL:PER_ZONE:PER_INSTANCE start-election=true 192.168.16.4:2002 192.168.16.3:2001 CLUSTER-JOIN ARGS: cluster-id=IMU-ZONE, groupings=GLOBAL:PER_ZONE:PER_INSTANCE, local-node=192.168.16.4:2002, other-nodes=[192.168.16.3:2001] CLUSTER-JOIN ARGS: Groupings: global=GLOBAL, aggregator=PER_ZONE, node=PER_INSTANCE CLM: Local address used for building Atomix: 192.168.16.4:2002 CLM: Building Atomix: Other members: [Node{id=node_3866738cb0f4_2001, address=192.168.16.3:2001}] ......................... ......................... CLUSTER-EXEC broker list Cluster executes command: broker list CLI: Node status and scores: CLI: node_581d745be52c_2001 [AGGREGATOR, 0.6640625, 9e790362-704c-4d9e-aa74-77f76e297816] CLI: node_3866738cb0f4_2002 [CANDIDATE, 0.6640625, 44a5afb7-890a-4090-9f80-c65f046aeddd] Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 ``` * ***Other Normal nodes***, for logs about, (i) EMS client leaving cluster, (ii) EMS client joining to cluster, but NO logs about recovery actions. **B.4.b) Failed recovery of an EMS client in a clustered Normal node** > Test Case Quick Notes: > - Kill the VM of any Normal node, except Aggregator. > - The Aggregator will try to connect to the affected VM but fail. > - After a configured number of retries Aggregator will give up. **After Application deployment...** * Terminate the VM of a Normal node, except the Aggregator's **Next, check the logs of:** * ***EMS server***, for a recovery Give up message from Aggregator *

EMS server log: Aggregator queries for node's credentials

EMS server log: Aggregator give up message

* ``` e.m.e.b.server.ClientShellCommand : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4 e.m.e.b.server.ClientShellCommand : #00000--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4 ``` Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server. * ***Aggregator***, for messages reporting, (i) an EMS client left cluster, (ii) a number of failed connection attempts to the VM, and (iii) a recovery give up message. *

Aggregator log: An EMS client left cluster

Aggregator log: EMS client recovery actions and give up message

* ``` SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ......................... ......................... SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ``` ``` SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4 ``` * ***Normal nodes that operate***, for logs about EMS client leaving cluster, and NO logs about recovery actions or EMS client joining back. **B.5.a) Successful recovery of EMS client of the cluster Aggregator** > Test Case Quick Notes: > - Kill EMS client of the Aggregator. > - The cluster nodes will elect a new Aggregator. Check logs of any cluster node. > - The new Aggregator will recover the killed EMS client after a configured period of time. > - Check new Aggregator log messages for node leaving cluster, being elected as Aggregator, recovery actions, and node joining back. > - Old Aggregator will join back as a Normal node. **After Application deployment...** * Connect to the Aggregator node, and ***kill*** EMS client. **Next, check the logs of:** * ***EMS server***, for message about Aggregator change. *

EMS server log: A new Aggregator initialized

* ``` e.m.e.b.server.ClientShellCommand : #00003--> Client status changed: CANDIDATE --> INITIALIZING e.m.e.b.server.ClientShellCommand : #00003--> Client grouping changed: PER_INSTANCE --> PER_ZONE e.m.e.b.s.c.c.ClusteringCoordinator : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00003 @ 192.168.16.4 (VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f) e.m.e.b.server.ClientShellCommand : #00003--> Client status changed: INITIALIZING --> AGGREGATOR ``` *

EMS server log: Aggregator queries for node's credentials

* ``` e.m.e.b.server.ClientShellCommand : #00003==> PUSH : {"random":"8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a","zone-id":"IMU-ZONE","address":"192.168.16.3","provider":"AWS",......................... ``` Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server. * ***New Aggregator***, for log messages about, (i) EMS client leaving cluster, (ii) being elected as Aggregator, (iii) recovery actions, and (iv) EMS client joining to cluster. *

New Aggregator log: Old Aggregator left cluster - New Aggregator election

* ``` CLM: MEMBER_REMOVED: node=node_581d745be52c_2001 BRU: Brokers after cluster change: [] BRU: Broker election requested: broadcasting election message... BRU: **** Broker message received: election BRU: **** BROKER: Starting Broker election: BRU: Member-Score: node_3866738cb0f4_2002 => 0.6640625 d4f2eb55-c355-4715-8a27-9f7c12c32924 BRU: Broker: node_3866738cb0f4_2002 ``` *

New Aggregator log: Initializing to become the new Aggregator

* ``` BRU: Node will become Broker. Initializing... NOTIFY-STATUS-CHANGE: INITIALIZING initialize(): Node starts initializing as Aggregator... ......................... ......................... Notifying Baguette Server i am the new aggregator ......................... ......................... BRU: Node is ready to act as Aggregator. Ready BRU: **** Broker message received: ready node_3866738cb0f4_2002 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn......................... BRU: **** BROKER: New Broker is ready: node_3866738cb0f4_2002, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn......................... BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn......................... ``` *

New Aggregator log: Requesting old Aggregator node's credentials

* ``` SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.3 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3 ``` *

New Aggregator log: Recovery actions of old Aggregator

* ``` SelfHealingPlugin: Retry #0: Recovering node: id=node_581d745be52c_2001, address=192.168.16.3 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.3, port=22, username=ubuntu Connecting to server... SSH client is ready VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending baguette client kill command...... ############## Waiting for 2000ms after Sending baguette client kill command...... ############## Sending baguette client start command...... ############## Waiting for 10000ms after Sending baguette client start command...... SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4 New client config.: ClientConfiguration(nodesWithoutClient=[]) VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.3, port=22, username=ubuntu Stopping SSH client... SSH client stopped OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4 OUT> OUT> pwd OUT> ubuntu@581d745be52c:~$ pwd OUT> /home/ubuntu OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/kill.sh OUT> Baguette client is not running OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/run.sh OUT> Starting baguette client... OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf OUT> LOG_FILE=/opt/baguette-client/logs/output.txt OUT> Baguette client PID: 1242 VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT ``` *

New Aggregator log: Old Aggregator joins back to cluster as plain node

* ``` CLM: MEMBER_ADDED: node=node_581d745be52c_2001 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3 ``` * ***Old Aggregator node whose EMS client killed***, for EMS client's logs indicating its restart (as a `PER_INSTANCE` node). *

Normal node: Old Aggregator restarts as a plain Normal node

* ``` Starting baguette client... EMS_CONFIG_DIR=/opt/baguette-client/conf LOG_FILE=/opt/baguette-client/logs/output.txt ____ _ _ _____ _ _ _ | _ \ | | | | / ____| (_) | | | |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_ | _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __| | |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_ |____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__| __/ | |___/ Starting BaguetteClient v4.5.0-SNAPSHOT on 581d745be52c with PID 1242 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client) No active profile set, falling back to default profiles: default loadCachedClientId: Used cached Client Id: null Password encoder class name is empty. Default instance of PasswordEncoder will be created PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done ......................... ......................... CLM: Joining cluster... NOTIFY-STATUS-CHANGE: CANDIDATE ......................... ......................... Joined to cluster ......................... ......................... CLUSTER-EXEC broker list Cluster executes command: broker list CLI: Node status and scores: CLI: node_3866738cb0f4_2002 [AGGREGATOR, 0.6640625, d4f2eb55-c355-4715-8a27-9f7c12c32924] CLI: node_581d745be52c_2001 [CANDIDATE, 0.6640625, e974ebcd-e11e-4baa-b3cb-fa34242705ff] ``` * ***Other Normal nodes***, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, (iii) EMS client joining to cluster, but NO logs about recovery actions. **B.5.b) Failed recovery of EMS client of the cluster Aggregator** > Test Case Quick Notes: > - Kill the VM of the Aggregator. > - The cluster nodes will elect a new Aggregator. Check logs of any cluster node. > - The new Aggregator will try to connect to the affected VM but fail. > - After a configured number of retries new Aggregator will give up. **After Application deployment...** * Terminate the VM of the Aggregator **Next, check the logs of:** * ***EMS server***, for one message about Aggregator change, and one about new Aggregator giving up recovery. *

EMS server log: A new Aggregator initialized

* ``` e.m.e.b.server.ClientShellCommand : #00004--> Client status changed: CANDIDATE --> INITIALIZING e.m.e.b.server.ClientShellCommand : #00004--> Client grouping changed: PER_INSTANCE --> PER_ZONE e.m.e.b.s.c.c.ClusteringCoordinator : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00004 @ 192.168.16.3 (VM-UBUNTU-vm1-vm1-AWS-vm1-8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a) e.m.e.b.server.ClientShellCommand : #00004--> Client status changed: INITIALIZING --> AGGREGATOR ``` *

EMS server log: New Aggregator queries for node's credentials

* ``` e.m.e.b.server.ClientShellCommand : #00004==> PUSH : {"random":"4abf9ae2-b7fc-4e8c-b6d9-464623d1b05f","zone-id":"IMU-ZONE","address":"192.168.16.4",......................... ``` *

EMS server log: New Aggregator give up message

* ``` e.m.e.b.server.ClientShellCommand : #00004--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4 e.m.e.b.server.ClientShellCommand : #00004--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4 ``` Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server. * ***New Aggregator***, for messages reporting, (i) an EMS client left cluster, (ii) being elected as Aggregator, (iii) a number of failed connection attempts to the VM, and (iv) a recovery give up message. *

New Aggregator log: Old Aggregator left cluster - New Aggregator election

New Aggregator log: Initializing to become the new Aggregator

* ``` CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [] BRU: Broker election requested: broadcasting election message... BRU: **** Broker message received: election BRU: **** BROKER: Starting Broker election: BRU: Member-Score: node_581d745be52c_2001 => 0.6640625 e974ebcd-e11e-4baa-b3cb-fa34242705ff BRU: Broker: node_581d745be52c_2001 BRU: Node will become Broker. Initializing... 2022-02-16 12:01:34.448 [INFO ] NOTIFY-STATUS-CHANGE: INITIALIZING initialize(): Node starts initializing as Aggregator... ......................... ......................... Notifying Baguette Server i am the new aggregator ......................... ......................... BRU: Node is ready to act as Aggregator. Ready BRU: **** Broker message received: ready node_581d745be52c_2001 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn......................... BRU: **** BROKER: New Broker is ready: node_581d745be52c_2001, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn......................... BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn......................... ``` *

New Aggregator log: Requesting old Aggregator node's credentials

* ``` SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 ``` *

New Aggregator log: Failing recovery actions of old Aggregator

New Aggregator log: Recovery actions Give Up message

* ``` SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4 ``` * ***Normal nodes that operate***, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, but NO logs about recovery actions, or EMS client joining back to cluster. **B.6.a) Successful recovery of Netdata agent in a clustered RL node** > Test Case Quick Notes: > - Kill Netdata agent of any RL node. > - The Aggregator will recover the killed Netdata agent after a configured period of time. > - Check Aggregator log messages reporting failures to collect metrics, recovery actions, and successful metrics collection. **After Application deployment...** * Connect to a RL node and ***kill*** Netdata agent. **Next, check the logs of:** * ***EMS server***, for NO logs indicating a Netdata failure and recovery. *

EMS server log: Aggregator queries for RL node's credentials

* ``` e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",......................... ``` * ***Aggregator***, for logs reporting, (i) connection failures to a Netdata agent, (ii) recovery actions, and (iii) successful connection to Netdata agent and collection of metrics. *

Aggregator log: Failed metric collection attempts from a RL node's Netdata agent

* ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3 Collectors::Netdata: Pausing collection from Node: 192.168.96.2 ``` *

Aggregator log: Requesting RL node's credentials

* ``` SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2 ``` *

Aggregator log: Netdata agent recovery actions

* ``` SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu Connecting to server... SSH client is ready VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending Netdata agent kill command...... ############## Waiting for 2000ms after Sending Netdata agent kill command...... ############## Sending Netdata agent start command...... ############## Waiting for 10000ms after Sending Netdata agent start command...... VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.96.2, port=22, username=ubuntu Stopping SSH client... SSH client stopped Collectors::Netdata: Resuming collection from Node: 192.168.96.2 Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2 OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4 OUT> OUT> pwd OUT> ubuntu@ec17d3e87fb4:~$ pwd OUT> /home/ubuntu OUT> ubuntu@ec17d3e87fb4:~$ OUT> < -U netdata -o "pid" --no-headers | xargs kill -9' OUT> OUT> Usage: OUT> kill [options] [...] OUT> OUT> Options: OUT> [...] send signal to every listed OUT> -, -s, --signal OUT> specify the to be sent OUT> -l, --list=[] list all signal names, or convert one to a name OUT> -L, --table list all signal names in a nice table OUT> OUT> -h, --help display this help and exit OUT> -V, --version output version information and exit OUT> OUT> For more details see kill(1). OUT> ubuntu@ec17d3e87fb4:~$ sudo netdata OUT> 2022-02-16 12:27:55: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults. VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT ``` *

Aggregator log: Successful metrics collection from RL node's Netdata agent

RL node shell: Recovered Netdata agent process

* ```sh # ps -ef |grep netdata root 610 29 0 12:27 pts/0 00:00:00 grep --color=auto netd ......................... ......................... # ps -ef |grep netdata netdata 623 1 5 12:27 ? 00:00:51 netdata netdata 625 623 0 12:27 ? 00:00:02 /usr/sbin/netdata --special-spawn-server root 894 623 0 12:28 ? 00:00:05 /usr/libexec/netdata/plugins.d/apps.plugin 1 netdata 1050 623 0 12:28 ? 00:00:04 /usr/libexec/netdata/plugins.d/go.d.plugin 1 root 1105 29 0 12:45 pts/0 00:00:00 grep --color=auto netd ``` * ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery action. **B.6.b) Failed recovery of Netdata agent in a clustered RL node** > Test Case Quick Notes: > - Kill the VM of any RL node. > - The EMS server will try to connect to the affected VM but fail. > - After a configured number of retries EMS server will give up. **After Application deployment...** * Terminate the VM of a RL node **You need to check the logs of:** * ***EMS server***, for NO logs indicating a Netdata failure and recovery, BUT reporting a recovery give up from Aggregator. *

EMS server log: Aggregator queries for RL node's credentials

EMS server log: Aggregator give up message

* ``` e.m.e.b.server.ClientShellCommand : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP null @ 192.168.96.2 e.m.e.b.server.ClientShellCommand : #00000--> Client Recovery Notification: GIVE_UP: null @ 192.168.96.2 e.m.e.baguette.server.BaguetteServer : BaguetteServer.onMessage: Marked Node as Failed: 192.168.96.2 ``` * ***Aggregator***, for logs reporting (i) connection failures to a Netdata agent, (ii) a number of failed attempts to connect to VM, and (iii) a recovery give up message. *

Aggregator log: Failed metric collection attempts from a RL node's Netdata agent

* ``` Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3 Collectors::Netdata: Pausing collection from Node: 192.168.96.2 ``` *

Aggregator log: Requesting RL node's credentials

* ``` SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2 ``` *

Aggregator log: Netdata agent (failing) recovery actions

* ``` SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) Collecting metrics from local node... Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Metrics: extracted=0, published=0, failed=0 Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Node is in ignore list: 192.168.96.2 ......................... ......................... SelfHealingPlugin: Retry #3: Recovering node: id=null, address=192.168.96.2 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ``` *

Aggregator log: Netdata agent recovery Give Up message

* ``` SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=null, address=192.168.96.2 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2 Collectors::Netdata: Giving up collection from Node: 192.168.96.2 NOTIFY-X: RECOVERY GIVE_UP null @ 192.168.96.2 ``` * ***Normal nodes (that operate)***, for NO logs indicating connection failures or recovery actions. **B.7) Successful recovery of local Netdata agent, in a clustered Normal node (including Aggregator)** > Test Case Quick Notes: > - Kill Netdata agent of any Normal node. > - The EMS client of the affected node will recover the killed Netdata agent after a configured period of time. > - Check EMS client's log for messages reporting failures to collect metrics, recovery actions, and successful metrics collection. **After Application deployment...** * Connect to a Normal node and ***kill*** Netdata agent. **Next, check the logs of:** * ***EMS server***, for No log messages indicating connection failures to a Netdata agent or recovery actions. * ***Aggregator***, for No log messages indicating connection failures to a Netdata agent or recovery actions. * ***Normal node with killed Netdata***, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collection attempts, recovery actions, and successful metric collection. *

Normal node - EMS client log: Failed attempts to collect metrics from Local Netdata agent

Normal node - EMS client log: Local Netdata agent recovery actions

* ``` SelfHealingPlugin: Retry #0: Recovering node: id=null, address= ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending Netdata agent kill command...... ############## Waiting for 2000ms after Sending Netdata agent kill command...... ############## Sending Netdata agent start command...... ############## Waiting for 10000ms after Sending Netdata agent start command...... ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: OUT> /opt/baguette-client ERR> -U: 1: -U: Syntax error: Unterminated quoted string ERR> 2022-02-16 13:21:52: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults. ``` *

Normal node - EMS client log: Successful metrics collection from Local Netdata agent