Series:
(Disclaimer: The
manuscript is my personal view and is not affiliated to any groups or
organizations)
Monitoring is an integral part of application development
lifecycle. However, often the task of extensive monitoring is left as the last
item and often a left as a TODO item for ever or for a very long time.
Lack of organized and systematic layers of monitoring along
with alerting configuration tasks is ignored. Monitoring in Cloud is not a
whole lot different than monitoring on-prem applications.
Real time monitoring is essential to meet some of the
non-functional QoS traits such as high availability.
Mining on monitoring data should be a regular activity of a
team to accurately predict the health and capacity of various components of the
application.
There are plenty of tools to assist to accomplish the
monitoring and the alerting needs. In general more than one tools are used.
Tools such as Nagios, new relic, MRTG, Ganglia, Cactus, KeyNote, Omniture, Azure
AppInsights, Azure OpInsights, AWS CloudWatch, MSFT System Center, HP SiteScope, etc. are
available. However, the tools need to be properly configured with application
specific Scenarios. GIGO principle is applicable for usage of the tool. Thus it
is imperative that tools are configured correctly.
It is imperative that DevOps take measures to implement a
complete monitoring solution which comes really handy in troubleshooting at the
need of the hour, which better prepares the team for future capacity, which
alerts the team of potential bottlenecks in the application, which provides
ability to monitor the integrated systems, and bottom line which gives a peace
of mind :)
Troubleshooting activities a funnel approach
where the problem is analyzed broadly and then dive into each less abstract and
more specific components.
Usage of the monitoring systems:
- 24x7 needs to monitor only the #2 item (mentioned below). Rest of the tools could be invoked as and when necessary based on different teams and individuals roles and responsibilities
- Usually its morning cup of coffee for DevOps to glance through all the tools to get the status of health of the application along with all its integrated systems.
- Solution/Application architects use these tools for capacity planning, potential bottlenecks, potential improvements of various components, immediate attention seeking sections, etc.
- Service/product managers can derive SLAs metric from these tools.
- Team can observe for anomalies after a new release, abrupt spike in traffic, unanticipated traffic patterns which could be either genuine or could be troublesome such as DDoS, or misuse of the systems, etc.
Different Categories and Layers of Monitoring |
#
|
Categories
|
e.g. Tool
|
Comment
|
1
|
End to end UI health check (Web analytic tools)
|
Clicky
|
Web Analytics tools has the ability to monitor and alert based
on a threshold.
|
2
|
End to end UI health check (Web monitoring tools)
|
KeyNote
|
Web Monitoring tools have the ability to monitor from different
part of the world and alert based on certain threshold.
|
3
|
CDN stats
|
Akamai
|
Stats on potential failures and successes, (500s, 400s, 300s,
200s), and performance of the application sections such.
|
4
|
Web component
|
nagios
|
Pick critical application functionalities, automate the test
case, monitor the health, and alert based on threshold.
|
5
|
App component
|
nagios
|
Similar to #4 but tests App layer and the backend layer
together.
|
6
|
Backend component
|
nagios
|
Similar to #4 but tests only the availability and responsiveness
of the application logic such as an execution of SP.
|
7
|
Web Infra
|
nagios
|
Checks if the process is running and responsive e.g. check for
HTTP 200 instead of only if TCP connection is alive.
|
8
|
App Infra
|
nagios
|
Similar to #7 but for application server infrastructure.
|
9
|
Backend Infra
|
nagios
|
Similar to #7 but for backend server infrastructure.
|
10
|
Integrated Systems Health Check
|
nagios
|
Check for availability of the integrated systems. Level of
details depend on the integrated systems flexibility. Tests such as a ping,
or a round trip HTTP request/response.
|
11
|
Individual box health checks
|
nagios
|
Check for CPU, memory, disk, etc.
|
12
|
Azure health check
|
nagios(*)
|
Check status of Azure components for troubleshooting. This
nagios instances has to be outside Azure :)
|
No comments:
Post a Comment