AWS CloudWatch is a web service that enables the cloud user to collect, view, and analyze metrics about your AWS resources and applications. CloudWatch alarms send notifications or trigger autoscale actions based on rules defined by the user. For example, you can get an email from a CloudWatch alarm if the average latency of your web application stays over 2 ms for 3 consecutive 5 minute periods. The variables that CloudWatch let you set are the metric (average), the threshold (2ms) and the duration (3 periods).
I became interested in replicating this capability after a recent discussion / proposal about auto scaling on the CloudStack mailing list. It was clearly some kind of Complex Event Processing (CEP) — and it appeared that there were a number of Open Source tools out there to do this. Among others (Storm, Esper), Riemann stood out as being built for purpose (for monitoring distributed systems) and offered the quickest way to try something out.
As the blurb says “Riemann aggregates events from your servers and applications with a powerful stream processing language”. You use any number of client libraries to send events to a Riemann server. You write rules in the stream processing language (actually Clojure) that the Riemann server executes on your event stream. The result of processing the rule can be another event or an email notification (or any number of actions such as send to pagerduty, logstash, graphite, etc).
Installing Riemann is a breeze; the tricky part is writing the rules. If you are a Clojure noob like me, then it helps to browse through a Clojure guide first to get a feel for the syntax. You append your rules to the etc/riemann.config file. Let’s say we want to send an email whenever the average web application latency exceeds 6 ms over 3 periods of 3 seconds. Here’s one solution:
The keywords fixed-time-window, combine, where, email, etc are Clojure functions provided out of the box by Riemann. We can write our own function tc to make the above general purpose:
We can make the function even more general purpose by letting the user pass in the summarizing function, and the comparison function as in:
;; left as an exercise
(tc 3 3 6.0 folds/std-dev <
(email "itguy@onemorecoolapp.net"))
Read that as : if the standard deviation of the metric falls below 6.0 for 3 consecutive windows of 3 seconds, send an email.
To test this functionality, here’s the client side script I used:
[~/riemann-0.2.4 ]$ irb -r riemann/client
1.9.3-p374 :001 > r = Riemann::Client.new
1.9.3-p374 :002 > t = [5.0, 5.0, 5.0, 4.0, 6.0, 5.0, 8.0, 9.0, 7.0, 7.0, 7.0, 7.0, 8.0, 6.0, 7.0, 5.0, 7.0, 6.0, 9.0, 3.0, 6.0]
1.9.3-p374 :003 > for i in (0..20)
1.9.3-p374 :004?> r << {host: "www1", service: "http req", metric: t[i], state: "ok", description: "request", tags:["http"]}
1.9.3-p374 :005?> sleep 1
1.9.3-p374 :006?> end
This generated this event:
{:service "http req", :state "threshold crossed", :description "service crossed the value of 6.0 over 3 windows of 3 seconds", :metric 7.0, :tags ["http"], :time 1388992717, :ttl nil}
While this is more awesome than CloudWatch alarms (define your own summary, durations can be in second granularity vs minutes), it lacks the precise semantics of CloudWatch alarms:
- The event doesn’t contain the actual measured summary in the periods the threshold was crossed.
- There needs to be a state for the alarm (in CloudWatch it is INSUFFICIENT_DATA, ALARM, OK).
This is indeed fairly easy to do in Riemann; hopefully I can get to work on this and update this space.
This does not constitute a CloudWatch alarm service though: there is no web services API to insert metrics, CRUD alarm definitions or multi-tenancy. Perhaps one could use a Compojure-based API server and embed Riemann, but this is not terribly clear to me how at this point. Multi-tenancy, sharding, load balancing, high availability, etc are topics for a different post.
To complete the autoscale solution for CloudStack would also require Riemann to send notifications to the CloudStack management server about threshold crossings. Using CloStack, this shouldn’t be too hard. More about auto scale in a separate post.
