81 KiB
Testing of New EMS Features
New features of EMS
- Support for Resource-Limited (RL) nodes, like edge devices or small VMs
- Support for Self-Healing monitoring topology (partially implemented)
Definitions
We distinguish between Resource-Limited (RL) nodes and Normal or Non-RL nodes.
- Normal nodes are VMs have enough resources, where an EMS client will be installed, along with JRE and Netdata.
- RL nodes are VMs with few resources, where only Netdata will be installed.
- Currently, EMS will classify a VM as an RL node if:
- it has 1 or 2 cores, or
- it has 2GB of RAM or less, or
- it has Total Disk space 1GB or less, or
- its architecture name starts with
ARM
(it will normally bex86_64
). - Thresholds can be changed in
gr.iccs.imu.ems.baguette-client-install.properties
file.
We also distinguish between Monitoring Topologies:
-
2-LEVEL Monitoring Topology: Nodes send their metrics directly to EMS server.
- Includes an EMS server, and any number of Normal and/or RL nodes.
- No clustering occurs in 2-LEVEL topologies, hence Aggregator role is not used.
- CAMEL Metric Models will only use
GLOBAL
andPER_INSTANCE
groupings or no groupings at all (GLOBAL
andPER_INSTANCE
are then implied).
-
3-LEVEL Monitoring Topology: Nodes send their metrics to cluster-wide Aggregators, then Aggregators send (composite) metrics to EMS server.
- Includes an EMS server, Aggregators (one per cluster), and Normal and/or RL nodes.
- Nodes are groupped into clusters. Each cluster has a node with the Aggregator role.
- Only Normal nodes can be Aggregators.
- There must be exactly one Aggregator per cluster.
- Each cluster must have at least one Normal node (in order to become Aggregator).
- CAMEL Metric Model will use
GLOBAL
,PER_ZONE
/PER_REGION
/PER_CLOUD
, andPER_INSTANCE
groupings.
Clustering of nodes is used for faster failure detection, as well as distribution of load:
- Only 3-LEVEL topologies are clustered.
- 2-LEVEL topologies are not clustered.
Currently, nodes are clustered based on their:
- Availability Zone or Region or Cloud Service Provider, or
- assigned to a default cluster.
A) Support for Resource-Limited nodes
Feature Quick Notes:
- EMS server will NOT install EMS client and JRE in RL nodes.
- EMS server will install Netdata in RL nodes.
- EMS server or an Aggregator will periodically query Netdata agents of RL nodes for metrics.
- Normal nodes will periodically query their Local Netdata agent for metrics.
Test Cases
A.1) Metrics collection from RL nodes in a 2-LEVEL topology
Test Case Quick Notes:
- EMS server MUST log when it collects metrics from RL nodes.
- EMS server MUST NOT log or collect metrics from Normal (Non-RL) nodes.
- Normal nodes MUST log when they collect metrics from their Local Netdata agents. (The Log records are slightly different).
You need a CAMEL model:
- with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
- with 1-2 COMPONENTS using Requirement Set #1 (Normal nodes)
- with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
- with no Groupings in Metric Model
After Application deployment you need to check the logs of:
-
EMS server, for log messages about collecting metrics from RL-nodes' Netdata agents. E.g.
e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.32.2, 192.168.32.4] e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting data from url: http://192.168.32.2:19999/api/v1/allmetrics?format=json e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Collecting data from url: http://192.168.32.4:19999/api/v1/allmetrics?format=json e.m.e.c.c.netdata.NetdataCollector : Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
-
Normal nodes, for log messages about collecting metrics from their Local Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
A.2) Metrics collection from RL nodes in a 3-LEVEL topology
Test Case Quick Notes:
- The Aggregator (it is a Normal node) MUST log each time it collects metrics from RL nodes in its cluster.
- The Aggregator MUST NOT log or collect metrics from Normal (Non-RL) nodes in its cluster.
- Normal nodes (including Aggregator) MUST log each time they collect metrics from their Local Netdata agents. (The Log records are slightly different).
You need a CAMEL model:
- with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
- with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes)
- with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
- with three (3) Groupings used in the Metric Model (
GLOBAL
,PER_ZONE
,PER_INSTANCE
)
After Application deployment you need to check the logs of:
-
EMS server, for NO logs related collecting metrics from any Netdata agent
-
Aggregator node(s), for logs about collecting metrics from the Netdata agents of RL nodes, in the same cluster. E.g.
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2, 192.168.96.5] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting data from url: http://192.168.96.5:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
-
Normal nodes (including Aggregator node), for logs about collecting metrics from their Local Netdata agents. E.g.
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
B) Support for Monitoring Self-Healing
Feature Quick Notes:
- Self-Healing refers to recovering the monitoring software running at the nodes.
- In Normal nodes, specifically refers to recovering of EMS client and/or Netdata agent.
- In RL nodes, refers to recovering Netdata agent only.
Design Choices
- Each EMS client (in a Normal node) is responsible for recovering the Local Netdata agent, collocated with it.
- When clustering is used (i.e. in a 3-level topology), Aggregator is responsible for recovering other nodes in its cluster, both Normal and RL.
- When clustering is not used (i.e. in a 2-level topology), EMS server is responsible for recovering nodes (both Normal and RL).
Self-Healing actions
We distinguish between monitoring topologies:
-
2-LEVEL Monitoring topology: Only EMS server and nodes (Normal & RL) are used. No Aggregators or clustering.
-
EMS server will try to recover any Normal node that disconnects and not reconnects after a configured period of time.
Condition:
- EMS client disconnects and not re-connects after X seconds
Recovery steps taken by EMS server:
- SSH to node (assuming it is a VM)
- Kill EMS client (if it is still running)
- Launch EMS client
- Close SSH connection
- Wait for a configured period of time for recovered EMS client to reconnect to EMS server
- After that period of time, the process is repeated (up to a configured number of retries, and then gives up).
-
EMS server will try to recovery any RL node with inaccessible Netdata agent.
Condition:
- X consecutive connection failures to Netdata agent occur.
Recovery steps taken by EMS server:
- SSH to node (assuming it is a VM)
- Kill Netdata (if it is still running)
- Launch Netdata
- Close SSH connection
- Reset the consecutive failures counter.
-
-
3-LEVEL Monitoring topology: EMS server, Aggregators (one per cluster), and Nodes in clusters exist. Use of clustering.
-
Aggregator will try to recover any Normal node that leaves the cluster and not joins back in a configured period of time.
Condition:
- EMS client leaves cluster and not joins back after X seconds
Recovery steps taken by Aggregators:
- Contact EMS server to get node's credentials
- SSH to node (assuming it is a VM)
- Kill EMS client (if it is still running)
- Launch EMS client
- Close SSH connection
- Wait for a configured period of time for EMS client to join back to cluster
- After that period of time the process is repeated (up to a configured number of retries, and then it gives up and notifies EMS server)
- When EMS client joins to cluster or in case of giving up, the node credentials are cleared from Aggregator's cache.
-
Aggregator will try to recover any RL node with inaccessible Netdata agent.
Condition:
- X consecutive connection failures to Netdata agent occur.
Recovery steps taken by Aggregators:
- Contact EMS server to get node's credentials
- SSH to node (assuming it is a VM)
- Kill Netdata agent (if it is still running)
- Launch Netdata agent
- Close SSH connection
- Reset the consecutive failures counter
- On successful connection to Netdata agent the node credentials are cleared from Aggregator cache.
-
-
2-LEVEL or 3-LEVEL Monitoring topology
-
Any Normal node will try to recover its Local Netdata agent, if it becomes inaccessible.
Condition:
- X consecutive connection failures to Local Netdata agent occur.
Recovery steps (taken by NORMAL node):
- Kill Netdata agent (if it is still running)
- Launch Netdata agent
- Reset the consecutive failures counter
-
Test Cases for 2-LEVEL topology
PREREQUISITE:
You need a CAMEL model with a 2-LEVEL monitoring topology:
- with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk
- with 1-2 components with Requirement Set #1 (Normal nodes)
- with 1-2 components with Requirement Set #2 (RL nodes)
- with no Groupings used in Metric Model.
This CAMEL model is common to the following test cases, unless another CAMEL model is specified.
CAMEL model MUST be re-deployed after each test case execution.
B.1.a) Successful recovery of an EMS client in a Normal node
Test Case Quick Notes:
- Kill EMS client of any Normal node.
- The EMS server will recover the killed EMS client after a configured period of time.
- Check EMS server logs for disconnection, recovery actions and re-connection messages.
After Application deployment...
- Connect to a Normal node and kill EMS client
Next, check the logs of:
-
EMS server, for messages reporting an EMS client disconnection, the recovery attempt(s) and EMS client re-connection.
EMS server log: An EMS client disconnected
e.m.e.b.server.ClientShellCommand : #00000==> Signaling client to exit e.m.e.b.server.ClientShellCommand : #00000--> Thread stops e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00000 e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Client unregistered: #00000 @ 172.29.0.3 e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: processExitEvent(): client-id=#00000, client-address=172.29.0.3
EMS server log: EMS client recovery actions
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi o.a.s.c.k.AcceptAllServerKeyVerifier : Server at /172.29.0.3:22 presented unverified EC key: SHA256:gNU4ScwysUpv050SaorPj7zlZrkiyGq4YSsOGBl+DCk e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Session will be recorded in file: /logs/172.29.0.3-22-2022.02.16.09.33.31.121-0.txt e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Connected to remote host: task #0: host: 172.29.0.3:22 e.m.e.b.c.install.SshClientInstaller : ---------------------------------------------------------------------- Task #0 : Instruction Set: Restarting Baguette agent at VM node e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Executing installation instructions set: Restarting Baguette agent at VM node e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Executing instruction 1/2: Killing previous EMS client process e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: /opt/baguette-client/bin/kill.sh o.a.s.c.session.ClientConnectionService : globalRequest(ClientConnectionService[ClientSessionImpl[ubuntu@/172.29.0.3:22]])[hostkeys-00@openssh.com, want-reply=false] failed (SshException) to process: EdDSA provider not supported e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: exit-status=0 e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Executing instruction 2/2: Starting new EMS client process e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: /opt/baguette-client/bin/run.sh e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: EXEC: exit-status=0 e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task #0: Installation Instructions set succeeded: Restarting Baguette agent at VM node e.m.e.b.c.install.SshClientInstaller : ------------------------------------------------------------------------- Task #0 : Instruction sets processed: successful=1, failed=0, exit-result=SUCCESS e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Disconnected from remote host: task #0: host: 172.29.0.3:22 e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Task completed successfully #0 e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=true, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteSe
EMS server log: EMS client reconnected
o.a.s.s.session.ServerUserAuthService : Session user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696 authenticated e.m.e.b.server.ClientShellCommand : #00001--> Got session : ServerSessionImpl[user-bbb5b809-3296-485c-a605-cc8bae646bbb@/172.29.0.3:39696] e.m.e.b.server.ClientShellCommand : #00001==> Thread started e.m.e.b.server.ClientShellCommand : #00001--> Client Id: VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450 e.m.e.b.server.ClientShellCommand : #00001--> Broker URL: ssl://172.29.0.3:61617?daemon=true&trace=false&useInactivityMonitor=false&connectionTimeout=0&keepAlive=true e.m.e.b.server.ClientShellCommand : #00001--> Broker Username: user-local-Q1mnKfNgzM e.m.e.b.server.ClientShellCommand : #00001--> Broker Password: xityAHGDhIiVeAxJdfax e.m.e.b.server.ClientShellCommand : #00001--> Broker Cert.: -----BEGIN CERTIFICATE----- ......................... -----END CERTIFICATE----- e.m.e.b.server.ClientShellCommand : #00001--> Adding/Replacing client certificate in Truststore: alias=172.29.0.3 e.m.e.b.server.ClientShellCommand : #00001--> Added/Replaced client certificate in Truststore: alias=172.29.0.3, CN=C=GR, ST=Attika, L=Athens, O=Institute of Communication and Computer Systems (ICCS), OU=Information Management Unit (IMU), CN=172.29.0.3, certificate-na e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: register(): Method invoked. CSC: ClientShellCommand_#00001 e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Sending grouping configurations to client #00001... ......................... e.m.e.b.server.ClientShellCommand : sendGroupingConfiguration: Serialization of Grouping configuration for PER_INSTANCE: rO0ABXNyACt......................... e.m.e.b.server.ClientShellCommand : #00001==> PUSH : SET-GROUPING-CONFIG rO0ABXNyACt......................... e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Sending grouping configurations to client #00001... done e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Setting active grouping of client #00001: PER_INSTANCE e.m.e.b.server.ClientShellCommand : #00001==> PUSH : SET-ACTIVE-GROUPING PER_INSTANCE e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.server.ClientShellCommand : #00001--> Client grouping changed: null --> PER_INSTANCE
-
Normal node where EMS client killed, for EMS client's logs indicating its restart.
Normal node: EMS client restarts
Starting baguette client... EMS_CONFIG_DIR=/opt/baguette-client/conf LOG_FILE=/opt/baguette-client/logs/output.txt ____ _ _ _____ _ _ _ | _ \ | | | | / ____| (_) | | | |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_ | _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __| | |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_ |____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__| __/ | |___/ Starting BaguetteClient v4.5.0-SNAPSHOT on 21845bcaf772 with PID 779 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client) No active profile set, falling back to default profiles: default loadCachedClientId: Used cached Client Id: null Password encoder class name is empty. Default instance of PasswordEncoder will be created ......................... Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 .........................
-
Other Normal nodes, for NO logs indicating failure or recovery attempts.
B.1.b) Failed recovery of EMS client in a Normal node
Test Case Quick Notes:
- Kill the VM of any Normal node.
- The EMS server will try to connect to the affected VM but fail.
- After a configured number of retries EMS server will give up.
After Application deployment...
- Terminate the VM of a Normal node
Next, check the logs of:
-
EMS server, for messages reporting an EMS client disconnection, failed recovery attempts and giving up recovery
EMS server log: An EMS client disconnected
e.m.e.b.server.ClientShellCommand : #00001==> Signaling client to exit e.m.e.b.server.ClientShellCommand : #00001--> Thread stops e.m.e.b.s.coordinator.NoopCoordinator : TwoLevelCoordinator: unregister(): Method invoked. CSC: ClientShellCommand_#00001 e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: -------------------------------------------------- e.m.e.b.s.c.TwoLevelCoordinator : TwoLevelCoordinator: Client unregistered: #00001 @ 172.29.0.3 e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: processExitEvent(): client-id=#00001, client-address=172.29.0.3
EMS server log: EMS client recovery actions and give up message
e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Starting client recovery: node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteServer=eu.melodi e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Error while connecting to remote host: task #0: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Failed executing task #0, Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ......................... ......................... e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Retry 5/5 executing task #0 e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Error while connecting to remote host: task #0: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Failed executing task #0, Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) e.m.e.b.c.install.SshClientInstaller : SshClientInstaller: Giving up executing task #0 after 5 retries e.m.e.b.c.s.ClientRecoveryPlugin : ClientRecoveryPlugin: runClientRecovery(): Client recovery completed: result=false, node-info=NodeRegistryEntry(ipAddress=172.29.0.3, clientId=VM-UBUNTU-vm1-vm1-AWS-vm1-85499eeb-14bc-481d-9c42-eac879845450, baguetteS
-
Normal nodes that operate, for NO logs indicating any failure or recovery attempts
B.2.a) Successful recovery of a Netdata agent in a RL node
Test Case Quick Notes:
- Kill Netdata agent of any RL node.
- The EMS server will recover the killed Netdata agent after a configured period of time.
- Check EMS server log messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
After Application deployment...
-
Connect to a RL node and kill Netdata agent.
EMS server log: Failed metric collection attempts from a Netdata agent
......................... Not yet implemented
Next, check the logs of:
-
EMS server, for logs reporting connection failure to a Netdata agent, and recovery actions.
EMS server log: Netdata agent recovery actions
......................... Not yet implemented
-
RL node with killed Netdata, check if the Netdata processes have started again.
RL node shell: Recovered Netdata agent process
......................... Not yet implemented
-
Normal nodes (that operate), for NO Logs indicating failure or recovery attempts.
B.2.b) Failed recovery of a Netdata agent in a RL node
Test Case Quick Notes:
- Kill the VM of any RL node.
- The EMS server will try to connect to the affected VM but fail.
- After a configured number of retries EMS server will give up.
After Application deployment...
- Terminate the VM of a RL node
You need to check the logs of:
-
EMS server, for logs reporting connection failure to a Netdata agent, and then a number of failed attempts to connect to VM.
EMS server log: Failed metric collection attempts from a Netdata agent
......................... Not yet implemented
EMS server log: Failed Netdata agent recovery actions and give up message
......................... Not yet implemented
-
Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.
B.3) Successful recovery of a Netdata agent in a Normal node
Test Case Quick Notes:
- Kill Netdata agent of any Normal node.
- The EMS client of the node will recover the killed Netdata agent after a configured period of time.
- Check EMS client's logs for messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
After Application deployment...
- Connect to a Normal node and kill Netdata agent.
Next, check the logs of:
-
EMS server, for No log messages indicating connection failures to Netdata, or recovery actions.
-
Normal node with killed Netdata, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collections, recovery actions, and successful metric collection.
Normal node - EMS client log: Failed attempts to collect metrics from Local Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3 Collectors::Netdata: Will pause metrics collection from node for 60 seconds: SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=
Normal node - EMS client log: Local Netdata agent recovery actions
SelfHealingPlugin: Retry #0: Recovering node: id=null, address= ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending Netdata agent kill command...... ############## Waiting for 2000ms after Sending Netdata agent kill command...... ############## Sending Netdata agent start command...... ############## Waiting for 10000ms after Sending Netdata agent start command...... ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: OUT> /opt/baguette-client ERR> -U: 1: -U: Syntax error: Unterminated quoted string ERR> 2022-02-16 10:23:29: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Normal node - EMS client log: Successful metrics collection from Local Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Resumed metrics collection from node: SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address= Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
-
Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.
Test Cases for 3-LEVEL topology
PREREQUISITE:
You need a CAMEL model for 3-LEVEL topology:
- with two Requirement Sets:
- for Normal nodes: 4 cores, 4GB RAM, >1 GB Disk, and
- for RL nodes: 1-2 cores, or <2GB RAM, or <1GB Disk,
- with 1-2 COMPONENTS with Requirement Set #1 (Normal nodes)
- with 1-2 COMPONENTS with Requirement Set #2 (RL nodes)
- with three (3) Groupings used in the Metric Model (
GLOBAL
,PER_ZONE
,PER_INSTANCE
).This CAMEL model is common to the following test cases, unless another CAMEL model is specified.
CAMEL model MUST be re-deployed after each test case execution.
B.4.a) Successful recovery of an EMS client in a clustered Normal node
Test Case Quick Notes:
- Kill EMS client of any Normal node except the Aggregator.
- The Aggregator will recover the killed EMS client after a configured period of time.
- Check Aggregator log messages for node leaving cluster, recovery actions, and node joining back.
After Application deployment...
- Connect to a Normal node, except Aggregator, and kill EMS client
Next, check the logs of:
-
EMS server, for Aggregator's query for node credentials.
EMS server log: Aggregator queries for node's credentials
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",.........................
Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
-
Aggregator, for log messages about, (i) EMS client leaving cluster, (ii) recovery actions, and (iii) EMS client joining back to the cluster.
Aggregator log: An EMS client left cluster
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
Aggregator log: EMS client recovery actions
SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SSH client is ready VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending baguette client kill command...... ############## Waiting for 2000ms after Sending baguette client kill command...... ############## Sending baguette client start command...... ############## Waiting for 10000ms after Sending baguette client start command...... SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4 New client config.: ClientConfiguration(nodesWithoutClient=[]) VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.4, port=22, username=ubuntu Stopping SSH client... SSH client stopped OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4 OUT> OUT> pwd OUT> ubuntu@3866738cb0f4:~$ pwd OUT> /home/ubuntu OUT> ubuntu@3866738cb0f4:~$ /opt/baguette-client/bin/kill.sh OUT> Baguette client is not running OUT> ubuntu@3866738cb0f4:~$ /opt/baguette-client/bin/run.sh OUT> Starting baguette client... OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf OUT> LOG_FILE=/opt/baguette-client/logs/output.txt OUT> Baguette client PID: 973 VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
Aggregator log: EMS client joined back to cluster
CLM: MEMBER_ADDED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
-
Normal node whose EMS client killed, for EMS client's logs indicating its restart.
Normal node: EMS client restarts
Starting baguette client... EMS_CONFIG_DIR=/opt/baguette-client/conf LOG_FILE=/opt/baguette-client/logs/output.txt ____ _ _ _____ _ _ _ | _ \ | | | | / ____| (_) | | | |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_ | _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __| | |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_ |____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__| __/ | |___/ Starting BaguetteClient v4.5.0-SNAPSHOT on 3866738cb0f4 with PID 973 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client) No active profile set, falling back to default profiles: default loadCachedClientId: Used cached Client Id: null Password encoder class name is empty. Default instance of PasswordEncoder will be created PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done BrokerConfig: Creating new Broker Service instance: url=ssl://0.0.0.0:61617 ......................... ......................... CLUSTER-JOIN IMU-ZONE GLOBAL:PER_ZONE:PER_INSTANCE start-election=true 192.168.16.4:2002 192.168.16.3:2001 CLUSTER-JOIN ARGS: cluster-id=IMU-ZONE, groupings=GLOBAL:PER_ZONE:PER_INSTANCE, local-node=192.168.16.4:2002, other-nodes=[192.168.16.3:2001] CLUSTER-JOIN ARGS: Groupings: global=GLOBAL, aggregator=PER_ZONE, node=PER_INSTANCE CLM: Local address used for building Atomix: 192.168.16.4:2002 CLM: Building Atomix: Other members: [Node{id=node_3866738cb0f4_2001, address=192.168.16.3:2001}] ......................... ......................... CLUSTER-EXEC broker list Cluster executes command: broker list CLI: Node status and scores: CLI: node_581d745be52c_2001 [AGGREGATOR, 0.6640625, 9e790362-704c-4d9e-aa74-77f76e297816] CLI: node_3866738cb0f4_2002 [CANDIDATE, 0.6640625, 44a5afb7-890a-4090-9f80-c65f046aeddd] Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
-
Other Normal nodes, for logs about, (i) EMS client leaving cluster, (ii) EMS client joining to cluster, but NO logs about recovery actions.
B.4.b) Failed recovery of an EMS client in a clustered Normal node
Test Case Quick Notes:
- Kill the VM of any Normal node, except Aggregator.
- The Aggregator will try to connect to the affected VM but fail.
- After a configured number of retries Aggregator will give up.
After Application deployment...
- Terminate the VM of a Normal node, except the Aggregator's
Next, check the logs of:
-
EMS server, for a recovery Give up message from Aggregator
EMS server log: Aggregator queries for node's credentials
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"cecab3d4-4c09-43b1-b6fa-3534d37bbc8f","zone-id":"IMU-ZONE","address":"192.168.16.4","provider":"AWS","name":"vm2","ssh.port":"22","ssh.username":"ubuntu","ssh.password":"ubuntu","id":"vm2","type":"VM","operatingSystem":"UBUNTU","CLIENT_ID":"VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f",.........................
EMS server log: Aggregator give up message
e.m.e.b.server.ClientShellCommand : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4 e.m.e.b.server.ClientShellCommand : #00000--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4
Note: EMS client disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
-
Aggregator, for messages reporting, (i) an EMS client left cluster, (ii) a number of failed connection attempts to the VM, and (iii) a recovery give up message.
Aggregator log: An EMS client left cluster
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
Aggregator log: EMS client recovery actions and give up message
SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ......................... ......................... SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748)
SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
-
Normal nodes that operate, for logs about EMS client leaving cluster, and NO logs about recovery actions or EMS client joining back.
B.5.a) Successful recovery of EMS client of the cluster Aggregator
Test Case Quick Notes:
- Kill EMS client of the Aggregator.
- The cluster nodes will elect a new Aggregator. Check logs of any cluster node.
- The new Aggregator will recover the killed EMS client after a configured period of time.
- Check new Aggregator log messages for node leaving cluster, being elected as Aggregator, recovery actions, and node joining back.
- Old Aggregator will join back as a Normal node.
After Application deployment...
- Connect to the Aggregator node, and kill EMS client.
Next, check the logs of:
-
EMS server, for message about Aggregator change.
EMS server log: A new Aggregator initialized
e.m.e.b.server.ClientShellCommand : #00003--> Client status changed: CANDIDATE --> INITIALIZING e.m.e.b.server.ClientShellCommand : #00003--> Client grouping changed: PER_INSTANCE --> PER_ZONE e.m.e.b.s.c.c.ClusteringCoordinator : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00003 @ 192.168.16.4 (VM-UBUNTU-vm2-vm2-AWS-vm2-cecab3d4-4c09-43b1-b6fa-3534d37bbc8f) e.m.e.b.server.ClientShellCommand : #00003--> Client status changed: INITIALIZING --> AGGREGATOR
EMS server log: Aggregator queries for node's credentials
e.m.e.b.server.ClientShellCommand : #00003==> PUSH : {"random":"8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a","zone-id":"IMU-ZONE","address":"192.168.16.3","provider":"AWS",.........................
Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
-
New Aggregator, for log messages about, (i) EMS client leaving cluster, (ii) being elected as Aggregator, (iii) recovery actions, and (iv) EMS client joining to cluster.
New Aggregator log: Old Aggregator left cluster - New Aggregator election
CLM: MEMBER_REMOVED: node=node_581d745be52c_2001 BRU: Brokers after cluster change: [] BRU: Broker election requested: broadcasting election message... BRU: **** Broker message received: election BRU: **** BROKER: Starting Broker election: BRU: Member-Score: node_3866738cb0f4_2002 => 0.6640625 d4f2eb55-c355-4715-8a27-9f7c12c32924 BRU: Broker: node_3866738cb0f4_2002
New Aggregator log: Initializing to become the new Aggregator
BRU: Node will become Broker. Initializing... NOTIFY-STATUS-CHANGE: INITIALIZING initialize(): Node starts initializing as Aggregator... ......................... ......................... Notifying Baguette Server i am the new aggregator ......................... ......................... BRU: Node is ready to act as Aggregator. Ready BRU: **** Broker message received: ready node_3866738cb0f4_2002 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn......................... BRU: **** BROKER: New Broker is ready: node_3866738cb0f4_2002, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn......................... BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi40OjYxNjE3P2RhZW1vbj10cn.........................
New Aggregator log: Requesting old Aggregator node's credentials
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.3 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3
New Aggregator log: Recovery actions of old Aggregator
SelfHealingPlugin: Retry #0: Recovering node: id=node_581d745be52c_2001, address=192.168.16.3 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.3, port=22, username=ubuntu Connecting to server... SSH client is ready VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending baguette client kill command...... ############## Waiting for 2000ms after Sending baguette client kill command...... ############## Sending baguette client start command...... ############## Waiting for 10000ms after Sending baguette client start command...... SET-CLIENT-CONFIG rO0ABXNyAClldS5tZWxvZGljLmV2ZW50LnV0aWwuQ2xpZW50Q29uZmlndXJhdGlvbiAe4raCjfZzAgABTAASbm9kZXNXaXRob3V0Q2xpZW50dAAPTGphdmEvdXRpbC9TZXQ7eHBzcgARamF2YS51dGlsLkhhc2hTZXS6RIWVlri3NAMAAHhwdwwAAAAQP0AAAAAAAAB4 New client config.: ClientConfiguration(nodesWithoutClient=[]) VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.16.3, port=22, username=ubuntu Stopping SSH client... SSH client stopped OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4 OUT> OUT> pwd OUT> ubuntu@581d745be52c:~$ pwd OUT> /home/ubuntu OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/kill.sh OUT> Baguette client is not running OUT> ubuntu@581d745be52c:~$ /opt/baguette-client/bin/run.sh OUT> Starting baguette client... OUT> EMS_CONFIG_DIR=/opt/baguette-client/conf OUT> LOG_FILE=/opt/baguette-client/logs/output.txt OUT> Baguette client PID: 1242 VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT
New Aggregator log: Old Aggregator joins back to cluster as plain node
CLM: MEMBER_ADDED: node=node_581d745be52c_2001 BRU: Brokers after cluster change: [Member{id=node_581d745be52c_2001, address=192.168.16.3:2001, properties={aggregator-connection-configuration=eyJncm91cGluZyI6I......................... SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_581d745be52c_2001, address=192.168.16.3
-
Old Aggregator node whose EMS client killed, for EMS client's logs indicating its restart (as a
PER_INSTANCE
node).Normal node: Old Aggregator restarts as a plain Normal node
Starting baguette client... EMS_CONFIG_DIR=/opt/baguette-client/conf LOG_FILE=/opt/baguette-client/logs/output.txt ____ _ _ _____ _ _ _ | _ \ | | | | / ____| (_) | | | |_) | __ _ __ _ _ _ ___| |_| |_ ___ | | | |_ ___ _ __ | |_ | _ < / _` |/ _` | | | |/ _ \ __| __/ _ \ | | | | |/ _ \ '_ \| __| | |_) | (_| | (_| | |_| | __/ |_| || __/ | |____| | | __/ | | | |_ |____/ \__,_|\__, |\__,_|\___|\__|\__\___| \_____|_|_|\___|_| |_|\__| __/ | |___/ Starting BaguetteClient v4.5.0-SNAPSHOT on 581d745be52c with PID 1242 (/opt/baguette-client/jars/baguette-client-4.5.0-SNAPSHOT.jar started by ubuntu in /opt/baguette-client) No active profile set, falling back to default profiles: default loadCachedClientId: Used cached Client Id: null Password encoder class name is empty. Default instance of PasswordEncoder will be created PasswordUtil.setPasswordEncoder(): PasswordEncoder set to: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder PasswordUtil: Initialized default Password Encoder: password.gr.iccs.imu.ems.util.AsterisksPasswordEncoder BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... KeystoreUtil.initializeKeystoresAndCertificate(): Initializing keystores and certificate BrokerConfig.initializeKeyAndCert(): Initializing keystore, truststore and certificate for Broker-SSL... done ......................... ......................... CLM: Joining cluster... NOTIFY-STATUS-CHANGE: CANDIDATE ......................... ......................... Joined to cluster ......................... ......................... CLUSTER-EXEC broker list Cluster executes command: broker list CLI: Node status and scores: CLI: node_3866738cb0f4_2002 [AGGREGATOR, 0.6640625, d4f2eb55-c355-4715-8a27-9f7c12c32924] CLI: node_581d745be52c_2001 [CANDIDATE, 0.6640625, e974ebcd-e11e-4baa-b3cb-fa34242705ff]
-
Other Normal nodes, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, (iii) EMS client joining to cluster, but NO logs about recovery actions.
B.5.b) Failed recovery of EMS client of the cluster Aggregator
Test Case Quick Notes:
- Kill the VM of the Aggregator.
- The cluster nodes will elect a new Aggregator. Check logs of any cluster node.
- The new Aggregator will try to connect to the affected VM but fail.
- After a configured number of retries new Aggregator will give up.
After Application deployment...
- Terminate the VM of the Aggregator
Next, check the logs of:
-
EMS server, for one message about Aggregator change, and one about new Aggregator giving up recovery.
EMS server log: A new Aggregator initialized
e.m.e.b.server.ClientShellCommand : #00004--> Client status changed: CANDIDATE --> INITIALIZING e.m.e.b.server.ClientShellCommand : #00004--> Client grouping changed: PER_INSTANCE --> PER_ZONE e.m.e.b.s.c.c.ClusteringCoordinator : Updated aggregator of zone: IMU-ZONE -- New aggregator: #00004 @ 192.168.16.3 (VM-UBUNTU-vm1-vm1-AWS-vm1-8a20f11c-eaf2-4b6e-b827-d8a25a57cb0a) e.m.e.b.server.ClientShellCommand : #00004--> Client status changed: INITIALIZING --> AGGREGATOR
EMS server log: New Aggregator queries for node's credentials
e.m.e.b.server.ClientShellCommand : #00004==> PUSH : {"random":"4abf9ae2-b7fc-4e8c-b6d9-464623d1b05f","zone-id":"IMU-ZONE","address":"192.168.16.4",.........................
EMS server log: New Aggregator give up message
e.m.e.b.server.ClientShellCommand : #00004--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4 e.m.e.b.server.ClientShellCommand : #00004--> Client Recovery Notification: GIVE_UP: node_3866738cb0f4_2002 @ 192.168.16.4
Note: Aggregator disconnection from EMS server will also be logged in EMS server logs, but no recovery action will be taken by EMS server.
-
New Aggregator, for messages reporting, (i) an EMS client left cluster, (ii) being elected as Aggregator, (iii) a number of failed connection attempts to the VM, and (iv) a recovery give up message.
New Aggregator log: Old Aggregator left cluster - New Aggregator election
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [] BRU: Broker election requested: broadcasting election message... BRU: **** Broker message received: election BRU: **** BROKER: Starting Broker election: BRU: Member-Score: node_581d745be52c_2001 => 0.6640625 e974ebcd-e11e-4baa-b3cb-fa34242705ff BRU: Broker: node_581d745be52c_2001
New Aggregator log: Initializing to become the new Aggregator
CLM: MEMBER_REMOVED: node=node_3866738cb0f4_2002 BRU: Brokers after cluster change: [] BRU: Broker election requested: broadcasting election message... BRU: **** Broker message received: election BRU: **** BROKER: Starting Broker election: BRU: Member-Score: node_581d745be52c_2001 => 0.6640625 e974ebcd-e11e-4baa-b3cb-fa34242705ff BRU: Broker: node_581d745be52c_2001 BRU: Node will become Broker. Initializing... 2022-02-16 12:01:34.448 [INFO ] NOTIFY-STATUS-CHANGE: INITIALIZING initialize(): Node starts initializing as Aggregator... ......................... ......................... Notifying Baguette Server i am the new aggregator ......................... ......................... BRU: Node is ready to act as Aggregator. Ready BRU: **** Broker message received: ready node_581d745be52c_2001 New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn......................... BRU: **** BROKER: New Broker is ready: node_581d745be52c_2001, New config: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn......................... BRU: Node configuration updated: eyJncm91cGluZyI6IlBFUl9aT05FIiwidXJsIjoic3NsOi8vMTkyLjE2OC4xNi4zOjYxNjE3P2RhZW1vbj10cn.........................
New Aggregator log: Requesting old Aggregator node's credentials
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.16.4 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4
New Aggregator log: Failing recovery actions of old Aggregator
SelfHealingPlugin: Retry #0: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) ......................... ......................... SelfHealingPlugin: Retry #3: Recovering node: id=node_3866738cb0f4_2002, address=192.168.16.4 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.16.4, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.16.4 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748)
New Aggregator log: Recovery actions Give Up message
SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=node_3866738cb0f4_2002, address=192.168.16.4 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=node_3866738cb0f4_2002, address=192.168.16.4 NOTIFY-X: RECOVERY GIVE_UP node_3866738cb0f4_2002 @ 192.168.16.4
-
Normal nodes that operate, for log messages about, (i) EMS client leaving cluster, (ii) Aggregator election, but NO logs about recovery actions, or EMS client joining back to cluster.
B.6.a) Successful recovery of Netdata agent in a clustered RL node
Test Case Quick Notes:
- Kill Netdata agent of any RL node.
- The Aggregator will recover the killed Netdata agent after a configured period of time.
- Check Aggregator log messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
After Application deployment...
- Connect to a RL node and kill Netdata agent.
Next, check the logs of:
- EMS server, for NO logs indicating a Netdata failure and recovery.
EMS server log: Aggregator queries for RL node's credentials
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",.........................
- Aggregator, for logs reporting, (i) connection failures to a Netdata agent, (ii) recovery actions, and (iii) successful connection to Netdata agent and collection of metrics.
Aggregator log: Failed metric collection attempts from a RL node's Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3 Collectors::Netdata: Pausing collection from Node: 192.168.96.2
Aggregator log: Requesting RL node's credentials
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2
Aggregator log: Netdata agent recovery actions
SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu Connecting to server... SSH client is ready VmNodeRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending Netdata agent kill command...... ############## Waiting for 2000ms after Sending Netdata agent kill command...... ############## Sending Netdata agent start command...... ############## Waiting for 10000ms after Sending Netdata agent start command...... VmNodeRecoveryTask: runNodeRecovery(): Executed 3 recovery commands VmNodeRecoveryTask: disconnectFromNode(): Disconnecting from node: address=192.168.96.2, port=22, username=ubuntu Stopping SSH client... SSH client stopped Collectors::Netdata: Resuming collection from Node: 192.168.96.2 Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2 OUT> Last login: Sat Feb 12 10:40:09 2022 from 172.29.0.4 OUT> OUT> pwd OUT> ubuntu@ec17d3e87fb4:~$ pwd OUT> /home/ubuntu OUT> ubuntu@ec17d3e87fb4:~$ OUT> < -U netdata -o "pid" --no-headers | xargs kill -9' OUT> OUT> Usage: OUT> kill [options] <pid> [...] OUT> OUT> Options: OUT> <pid> [...] send signal to every <pid> listed OUT> -<signal>, -s, --signal <signal> OUT> specify the <signal> to be sent OUT> -l, --list=[<signal>] list all signal names, or convert one to a name OUT> -L, --table list all signal names in a nice table OUT> OUT> -h, --help display this help and exit OUT> -V, --version output version information and exit OUT> OUT> For more details see kill(1). OUT> ubuntu@ec17d3e87fb4:~$ sudo netdata OUT> 2022-02-16 12:27:55: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults. VmNodeRecoveryTask: redirectSshOutput(): Connection closed: id=OUT
Aggregator log: Successful metrics collection from RL node's Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
- RL node with killed Netdata, check if the Netdata processes have started again.
RL node shell: Recovered Netdata agent process
# ps -ef |grep netdata root 610 29 0 12:27 pts/0 00:00:00 grep --color=auto netd ......................... ......................... # ps -ef |grep netdata netdata 623 1 5 12:27 ? 00:00:51 netdata netdata 625 623 0 12:27 ? 00:00:02 /usr/sbin/netdata --special-spawn-server root 894 623 0 12:28 ? 00:00:05 /usr/libexec/netdata/plugins.d/apps.plugin 1 netdata 1050 623 0 12:28 ? 00:00:04 /usr/libexec/netdata/plugins.d/go.d.plugin 1 root 1105 29 0 12:45 pts/0 00:00:00 grep --color=auto netd
- Normal nodes (that operate), for NO logs indicating connection failures or recovery action.
B.6.b) Failed recovery of Netdata agent in a clustered RL node
Test Case Quick Notes:
- Kill the VM of any RL node.
- The EMS server will try to connect to the affected VM but fail.
- After a configured number of retries EMS server will give up.
After Application deployment...
- Terminate the VM of a RL node
You need to check the logs of:
- EMS server, for NO logs indicating a Netdata failure and recovery, BUT reporting a recovery give up from Aggregator.
EMS server log: Aggregator queries for RL node's credentials
e.m.e.b.server.ClientShellCommand : #00000==> PUSH : {"random":"4b676a58-e00e-4ddf-a21e-b1c0d1382cd6","zone-id":"IMU-ZONE","address":"192.168.96.2","provider":"AWS",.........................
EMS server log: Aggregator give up message
e.m.e.b.server.ClientShellCommand : #00000--> Client notification: CMD=RECOVERY, ARGS=GIVE_UP null @ 192.168.96.2 e.m.e.b.server.ClientShellCommand : #00000--> Client Recovery Notification: GIVE_UP: null @ 192.168.96.2 e.m.e.baguette.server.BaguetteServer : BaguetteServer.onMessage: Marked Node as Failed: 192.168.96.2
- Aggregator, for logs reporting (i) connection failures to a Netdata agent, (ii) a number of failed attempts to connect to VM, and (iii) a recovery give up message.
Aggregator log: Failed metric collection attempts from a RL node's Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0 Collectors::Netdata: Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Collectors::Netdata: Collecting data from url: http://192.168.96.2:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: 192.168.96.2, #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://192.168.96.2:19999/api/v1/allmetrics": connect timed out; nested exception is java.net.SocketTimeoutException: connect timed out -> java.net.SocketTimeoutException: connect timed out Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: 192.168.96.2, num-of-errors=3 Collectors::Netdata: Pausing collection from Node: 192.168.96.2
Aggregator log: Requesting RL node's credentials
SEND: SERVER-GET-NODE-SSH-CREDENTIALS 192.168.96.2 SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=192.168.96.2
Aggregator log: Netdata agent (failing) recovery actions
SelfHealingPlugin: Retry #0: Recovering node: id=null, address=192.168.96.2 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748) Collecting metrics from local node... Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Metrics: extracted=0, published=0, failed=0 Collecting metrics from remote nodes (without EMS client): [192.168.96.2] Node is in ignore list: 192.168.96.2 ......................... ......................... SelfHealingPlugin: Retry #3: Recovering node: id=null, address=192.168.96.2 VmNodeRecoveryTask: connectToNode(): Connecting to node using SSH: address=192.168.96.2, port=22, username=ubuntu Connecting to server... SelfHealingPlugin: EXCEPTION while recovering node: node-address=192.168.96.2 -- Exception: java.net.NoRouteToHostException: No route to host at sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(UnixAsynchronousSocketChannelImpl.java:252) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:198) at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213) at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293) at java.lang.Thread.run(Thread.java:748)
Aggregator log: Netdata agent recovery Give Up message
SelfHealingPlugin: Max retries reached. No more recovery retries for node: id=null, address=192.168.96.2 SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address=192.168.96.2 Collectors::Netdata: Giving up collection from Node: 192.168.96.2 NOTIFY-X: RECOVERY GIVE_UP null @ 192.168.96.2
- Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.
B.7) Successful recovery of local Netdata agent, in a clustered Normal node (including Aggregator)
Test Case Quick Notes:
- Kill Netdata agent of any Normal node.
- The EMS client of the affected node will recover the killed Netdata agent after a configured period of time.
- Check EMS client's log for messages reporting failures to collect metrics, recovery actions, and successful metrics collection.
After Application deployment...
- Connect to a Normal node and kill Netdata agent.
Next, check the logs of:
- EMS server, for No log messages indicating connection failures to a Netdata agent or recovery actions.
- Aggregator, for No log messages indicating connection failures to a Netdata agent or recovery actions.
- Normal node with killed Netdata, check if the Netdata processes have started again. Also check EMS client's log messages reporting failed metric collection attempts, recovery actions, and successful metric collection.
Normal node - EMS client log: Failed attempts to collect metrics from Local Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=1, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=2, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Exception while collecting metrics from node: , #errors=3, exception: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://127.0.0.1:19999/api/v1/allmetrics": Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused) -> java.net.ConnectException: Connection refused (Connection refused) Collectors::Netdata: Too many consecutive errors occurred while attempting to collect metrics from node: , num-of-errors=3 Collectors::Netdata: Will pause metrics collection from node for 60 seconds: SelfHealingPlugin: createRecoveryTask(): Created recovery task for Node: id=null, address=
Normal node - EMS client log: Local Netdata agent recovery actions
SelfHealingPlugin: Retry #0: Recovering node: id=null, address= ShellRecoveryTask: runNodeRecovery(): Executing 3 recovery commands ############## Initial wait...... ############## Waiting for 5000ms after Initial wait...... ############## Sending Netdata agent kill command...... ############## Waiting for 2000ms after Sending Netdata agent kill command...... ############## Sending Netdata agent start command...... ############## Waiting for 10000ms after Sending Netdata agent start command...... ShellRecoveryTask: runNodeRecovery(): Executed 3 recovery commands Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: OUT> /opt/baguette-client ERR> -U: 1: -U: Syntax error: Unterminated quoted string ERR> 2022-02-16 13:21:52: netdata INFO : MAIN : CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Normal node - EMS client log: Successful metrics collection from Local Netdata agent
Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Node is in ignore list: Collectors::Netdata: Resumed metrics collection from node: SelfHealingPlugin: cancelRecoveryTask(): Cancelled recovery task for Node: id=null, address= Collectors::Netdata: Collecting metrics from local node... Collectors::Netdata: Collecting data from url: http://127.0.0.1:19999/api/v1/allmetrics?format=json Collectors::Netdata: Metrics: extracted=0, published=0, failed=0
- Other Normal nodes (that operate), for NO logs indicating connection failures or recovery actions.
Limitations
- Clustering is never used for 2-level monitoring topologies.
- When no Normal nodes (and hence no Aggregator) exist in a cluster, no one will collect metrics from the (orphan) RL nodes.
- When no Normal nodes (and hence no Aggregator) exist in a cluster, no one will recover the (orphan) RL nodes.
- If EMS server fails no one will recover it.
- Metric messages are not cached/redirected, if the next node has failed.