Add an temporary disable(ON HOLD) option to checks for accepted incidents/planned maintenance.


We have a lot of checks concerning a myriad of different sites and scripts that constantly change directly or indirectly. As expected, things eventually stop working in some way or another; be it something benign such as an url to tag-solution not working, a planned ongoing maintenance making a site unreachable, an critical issue making it unresponsive or something in between. 

The solution(to the check going off) may also vary from a simple blocking / severity mapping to a the check to ignore unimportant urls or a bit more complex like a complete site deploy.

In any case, we soon acknowledge the checks response and start to deal with it but during this time, either because the fix is cumbersome or there's a lot of checks to update, the resolution back to normal may take some time. During this time it would be helpful, as is the case on other classic monitoring tools, to temporary change the state of the check to something that signals that there is a problem and we know about and already work on it; put the check ON HOLD. 

The state would mean that the check would continue to run and collect data but instead of showing the color corresponding to the previous result(green,yellow,orange,red), it would be set in a very distinguished color like blue(ideally sorted behind yellow in the Operational View). I attached a screen just to give you a hint on how to visualize it. This would give everyone working with this a breather so while things are alerting expectations is that something is wrong that we need to deal with but unclear if its being dealt with, within tech teams but also for business looking from the outside. With this you would be able to reflect the current status better and less "cry wolf" where we have benign alerts firing until being updated.

I'd also suggest that this option be easily reachable from the same places as the normal disable is and accessable via the API for easy integration to ITSM tools managing ITIL processes so you could make use of planned maintenance schedule et cetera.

  • Jul 3, 2015

    Admin Response

    We have plans of having an extended maintenance mode where we will be able to use a combination of tags, alerting and events in results to achieve this. This is only in the planning stage for now. I'll get back to you with more information when available.

