287 Commits

Author SHA1 Message Date
Craig Bryant
44851750e6 Changes to match the new style guidelines 2014-07-07 16:19:53 -06:00
Craig Bryant
797c60f567 Removed dependency because mon-kafka is now on same version of kafka 2014-06-25 15:12:44 -06:00
Craig Bryant
e77566f4f3 Make it configurable how long the KafkaSpout sleeps if there is no message ready.
This required changes in the config classes.

Also, make it not a fixed sleep, but a wait that can get notified if a message does arrive.
2014-06-25 12:02:36 -06:00
Tim Kuhlman
b59b2f3508 Merge pull request #1 from hpcloud-mon/feature/require_storm
Move to requiring storm to run
2014-06-25 11:47:33 -06:00
Tim Kuhlman
1aaac2e643 Removed maven-dependency-plugin 2014-06-23 09:42:31 -06:00
Craig Bryant
a2c3d837f2 Add removal of debs on clean 2014-06-18 13:17:54 -06:00
Tim Kuhlman
ab11ba5859 shell script fixes 2014-06-13 12:03:48 -06:00
Tim Kuhlman
d7c47931fb Make the init script executable 2014-06-13 11:46:03 -06:00
Tim Kuhlman
03a6e0095c Moved to an init script assuming storm is setup, dropped storm jar from deb 2014-06-13 10:46:16 -06:00
Craig Bryant
89eaeec0cc JAH-179 Change KafkaSpout to follow Storm design recommendations by not blocking
Added a reader thread so the thread calling nextTuple did not block. This was causing other problems because Threads are reused between bolts and spouts and the MetricAggregationBolts weren't always getting their tuples and so evaluation wasn't happening.

This is coded so that it will work with the sequence for shutdown if a bolt is moved, but that doesn't seem to happen on just a shutdown.

Also fix a problem where the windows were not slid while the metrics were lagging. If metrics are lagging for several minutes, the windows could get so far out of date that valid metrics are discarded.
2014-06-10 16:25:09 -06:00
Craig Bryant
97d024daa1 JAH-1891 Threshold Engine will not update State properly if user has changed state via the API
Use the new oldState and unchangedExpressions fields in the AlarmCreatedEvent to force the resend of unchanged sub alarms so the alarm will be re-evaluated.
2014-06-01 08:34:15 -06:00
Craig Bryant
4d6311ab68 Changes so it will build without access to any HP resources
Update README.md with dependency on mon-common
2014-05-30 10:24:03 -06:00
Deklan Dieterly
6d360092a0 Update README.md 2014-05-09 13:11:37 -06:00
Jonathan Halterman
9017a73f2b Some README formatting
Change-Id: I1b5421474a14cfdf443d5af8a8a85ffa273f9d4e
2014-05-09 11:12:33 -07:00
Craig Bryant
d3719437f3 Update README.md 2014-05-08 14:50:22 -06:00
Craig Bryant
872e81b4ce Update README.md 2014-05-08 14:49:12 -06:00
hochmuth
aa94ee643c Update README.md 2014-05-08 14:31:08 -06:00
Craig Bryant
f11b66f1ba Make the kafka groupIds match those used by the real thresh so it doesn't get lots of old data when debugging 2014-05-08 08:28:08 -06:00
Craig Bryant
57a2012792 The timestamp on MetricEnvelope was changed to be seconds not milliseconds to be more consistent with other timestamps used across components. Changed the code and tests to adapt to this change.
Update to the build 51 of mon-common that has this change
2014-05-06 16:36:38 -06:00
Craig Bryant
74bde59399 Added test for problem with dimensions on MetricDefinition. Updated to mon-common build 50 2014-05-06 10:38:59 -06:00
Craig Bryant
bad27cf318 Improved log message about metrics too old to be used 2014-05-05 22:36:26 -06:00
Craig Bryant
506a765889 Rev to version 49 of mon-common to fix issue where Alarm Sub Expressions referencing a MetricDefinition with no dimensions didn't work 2014-05-05 21:17:48 -06:00
Craig Bryant
6776df2319 Handle failure to retrieve the alarm by Id by returning null instead of a NullPointerException 2014-05-05 21:15:18 -06:00
Craig Bryant
7c6394b1f8 Use the new timestamp in MetricEnvelope which is the time the API creates the MetricEnvelope. The API immediately hands the MetricEnvelope to Kafka so the Threshold Engine can determine its progess in emptying the Queue.
Change the code that checks for "lagging" metrics to use this timestamp instead of the Metric timestamp since the lagging code deals with emptying the Kafka queue and this timestamp is a much better measure of how backed up the kafka queue is. The metric timestamp is set by the agent and it is much likelier for the time to be off.

Switch to mon-common build 48 which has the new timestamp. The MetricFilteringBolt will work correctly if the API is using an older version of MetricEnvelope without the timestamp, the lagging code just won't be invoked.

Change the tests to work with the new timestamp.

Had to back down to an older version of scala or the Threshold Engine would not start with a java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
2014-05-05 17:53:37 -06:00
Craig Bryant
b570c6c9dd Fix problem where MetricFilteringBolt was outputting the unneeded input tuple. 2014-05-02 13:23:20 -06:00
Craig Bryant
4ff4bfb978 Add in the LICENSE file
Add in the HP copyright notice on all java files

Remove the @author tags
2014-05-01 16:05:25 -06:00
Craig Bryant
10c5bebe48 Update to the newest version of mon-common. Trying to keep current. 2014-05-01 14:22:43 -06:00
Craig Bryant
5835e250cc Change to set lastMinLagMessageSent to the time the first lagging metric is received. The problem is that storm can take a long time to send the first metric after prepare() is called and we want to give the Bolt time to clear the metrics before sending a lagging message. 2014-05-01 14:13:58 -06:00
Craig Bryant
e216fdc574 Merge branch 'master' of git.hpcloud.net:mon/mon-thresh 2014-05-01 13:58:17 -06:00
Craig Bryant
9050956a3b Have to use a different consumer.id for each instance of a KafkaSpout so use the storm taskId. Otherwise, zookeeper complains about a conflicted ephemeral node when there is more than one spout reading from a topic
Have two metric and event spouts to make sure the messages stop coming out. More like the real configuration.
2014-05-01 13:44:43 -06:00
hochmuth
0cae888aa2 Update README.md 2014-05-01 10:11:02 -06:00
Roland Hochmuth
f2ee9f878f Added architecture diagram. 2014-05-01 10:09:20 -06:00
hochmuth
c6799e940f Create README.md 2014-05-01 09:57:23 -06:00
Craig Bryant
a3429536d2 JIRA JAH-17 Modify maven build for mon-thresh that creates a jar that can be used for a production build.
Needed to strip out the storm jar from the consolidated jar.

Needed to remove storm.yaml because Storm complains. Moved the registration of the Serializer to ThresholdingEngine.

Added the storm-core jar to the deb so it can be used for the local mode in mini-mon

Added logback.xml to the deb since storm-core.jar also has a logback.xml and that confuses logback. Ensure our logback.xml is used.

Added the storm-core.jar and logback.xml to the start of thresh in the deb

Had to rework how the AlarmEventForwarder was injected into the AlarmThresholdingBolt because the old way didn't work in a Storm cluster because the TopologyModule wasn't loaded on the worker when prepare was called.
2014-04-30 22:28:12 -06:00
Craig Bryant
63290a82ab JIRA JAH-10 On start-up threshold engine should wait for metrics for all periods prior to transitioning to an alarmed state
Change so that all view slots must have metrics before the SubAlarm transitions to ALARM. Also, change it so SubAlarmStats doesn't transition it to UNDETERMINED on startup until there have been emptyWindowObservationThreshold calls to evaluate(). Previously it only required one for new sub alarms and sub alarms on restart. Want the ThresholdEngine to have same behavior on restart as it would when the Threshold Engine has been running for a long time.
2014-04-28 16:38:35 -06:00
Craig Bryant
8cfaaca7a6 JIRA JAH-11 Keep Alarms from switching to UNDETERMINED if the Threshold Engine has been down for long enough that Kafka has buffered enough Metrics that it takes the Threshold Engine minutes to catch up.
Added code to MetricFilteringBolt to check on the lag between current time and the timestamp in the metric. If too large, a message is sent to the MetricAggregationBolt to hold off on evaluating alarms.

Added unit tests.
2014-04-23 17:12:50 -06:00
Craig Bryant
6deb113700 Check that everything coming in on the EventSpout is Serializable before sending it out so the Spout doesn't die. 2014-04-18 16:02:58 -06:00
Craig Bryant
0aad1652d1 Wasn't setting SubAlarm noState to true when a SubAlarm was received. Code was checking for value, but it was never changed. 2014-04-17 16:35:40 -06:00
Craig Bryant
242eb47723 Reduce the number of EventSpouts running to just 1. Should get rid of the "conflicted ephemeral node" message from zookeeper 2014-04-17 10:57:48 -06:00
Craig Bryant
8eeda856fd Set actionsEnabled for AlarmStateTransitionedEvent. Bumped up to mon-common 35 to get new field and fix issue where Metric wasn't Serializable 2014-04-17 10:55:54 -06:00
Craig Bryant
541330ab46 Change MetricDefinitionAndTenantIdMatcher.match to not take the matches array as a parameter as that is awkward and contrary to standard usage. Simplified MetricDefinitionAndTenantIdMatcherTest tests. 2014-04-17 09:27:42 -06:00
Craig Bryant
e02b6f8483 Changes to handle AlarmUpdatedEvent properly. Now reuses measurements if possible (when only operator or threshold have changed). Added test not run by normal Unit tests.
Go to mon-common version 30. Gives alarmDescription in AlarmUpdatedEvent and new name alarmActionsEnabled
2014-04-17 08:51:33 -06:00
Craig Bryant
feb5433432 Change Alarms.enabled to Alarms.actionsEnabled and remove handling that was appropriate for the alarm being disabled rather than just notifications. Will need some more work to handle updated alarms but will do that separately. 2014-04-15 16:10:51 -06:00
Craig Bryant
6b26b3191c Add some more tests to MetricAggregationBoltTest to make sure the UNDETERMINED state is set properly.
Minor code formatting fixes.
2014-04-15 15:13:09 -06:00
Craig Bryant
b8872f0632 Add handling of the enabled flags for Alarms. This commit just handles Alarms that are disabled on startup. An additional field needs to be added to AlarmUpdatedEvent to handle events being disabled or enabled while it is running.
Get the MetricDefinitionDAOImplTest to run although it doesn't run automatically because it depends on the mini-mon mysql
2014-04-15 12:28:04 -06:00
Craig Bryant
3b1deee576 Changed to use version 30 of mon-common to get the new MetricEnvelopes that doesn't munge the JSON before parsing it. Allows handling of this JSON:
{"metric":{"name":"mon_http_status","dimensions":{"detail":"\"{\\\"deadlocks\\\"","hostname":"persister","url":"http"},"timestamp":1397493558,"value":0.0},"meta":{"tenantId":"82510970543135"}}
2014-04-14 13:48:17 -06:00
Craig Bryant
ec0148e242 Improved the efficiency of the matching of MetricDefinitionAndTenantIdMatcher instances to SubAlarms that define a subset of the Dimensions. Added another Map instead of doing a search of a List of MetricDefinitionAndTenantIds. Just need to calculate the possible sets of Dimenstions that could be matched. 2014-04-14 08:00:43 -06:00
Craig Bryant
87e660eee2 Remove the dropwizard version property now that dropwizard is gone 2014-04-09 16:14:57 -06:00
Craig Bryant
76a2138fbf Changes to allow MetricDefinitions in SubAlarms to match against MetricDefinitions that have at least the same set of Dimensions.
This is to support the use cases of aggregation of Metrics across systems and to handle Metrics where an extra Dimension is added but we still want the old MetricDefinition to be matched.

Added new MetricDefinitionAndTenantIdMatcher class to do the matching. Added its associated test.

Added tests for these cases.
2014-04-09 16:14:06 -06:00
Craig Bryant
cbca836aa5 Remove dropwizard jars from pom.xml. Wasn't using them anymore and it saved a lot of dependencies. 2014-04-09 10:00:59 -06:00