Thursday, January 1, 2015

(Phase One – Monitoring) Strategy to migrate Software Applications to Cloud



Series:
 (Disclaimer: The manuscript is my personal view and is not affiliated to any groups or organizations)



Monitoring is an integral part of application development lifecycle. However, often the task of extensive monitoring is left as the last item and often a left as a TODO item for ever or for a very long time.

Lack of organized and systematic layers of monitoring along with alerting configuration tasks is ignored. Monitoring in Cloud is not a whole lot different than monitoring on-prem applications.
Real time monitoring is essential to meet some of the non-functional QoS traits such as high availability. 

Mining on monitoring data should be a regular activity of a team to accurately predict the health and capacity of various components of the application.

There are plenty of tools to assist to accomplish the monitoring and the alerting needs. In general more than one tools are used. Tools such as Nagios, new relic, MRTG, Ganglia, Cactus, KeyNote, Omniture, Azure AppInsights, Azure OpInsights, AWS CloudWatch, MSFT System Center, HP SiteScope, etc. are available. However, the tools need to be properly configured with application specific Scenarios. GIGO principle is applicable for usage of the tool. Thus it is imperative that tools are configured correctly.

It is imperative that DevOps take measures to implement a complete monitoring solution which comes really handy in troubleshooting at the need of the hour, which better prepares the team for future capacity, which alerts the team of potential bottlenecks in the application, which provides ability to monitor the integrated systems, and bottom line which gives a peace of mind :)
Troubleshooting activities a funnel approach where the problem is analyzed broadly and then dive into each less abstract and more specific components. 

 

Usage of the monitoring systems:
  •  24x7 needs to monitor only the #2 item (mentioned below). Rest of the tools could be invoked as and when necessary based on different teams and individuals roles and responsibilities
  • Usually its morning cup of coffee for DevOps to glance through all the tools to get the status of health of the application along with all its integrated systems.
  •  Solution/Application architects use these tools for capacity planning, potential bottlenecks, potential improvements of various components, immediate attention seeking sections, etc.
  • Service/product managers can derive SLAs metric from these tools.
  • Team can observe for anomalies after a new release, abrupt spike in traffic, unanticipated traffic patterns which could be either genuine or could be troublesome such as DDoS, or misuse of the systems, etc.

Different Categories and Layers of Monitoring



#
Categories
e.g. Tool
Comment
1
End to end UI health check (Web analytic tools)
Clicky
Web Analytics tools has the ability to monitor and alert based on a threshold.
2
End to end UI health check (Web monitoring tools)
KeyNote
Web Monitoring tools have the ability to monitor from different part of the world and alert based on certain threshold.
3
CDN stats
Akamai
Stats on potential failures and successes, (500s, 400s, 300s, 200s), and performance of the application sections such.
4
Web component
nagios
Pick critical application functionalities, automate the test case, monitor the health, and alert based on threshold.
5
App component
nagios
Similar to #4 but tests App layer and the backend layer together.
6
Backend component
nagios
Similar to #4 but tests only the availability and responsiveness of the application logic such as an execution of SP.
7
Web Infra
nagios
Checks if the process is running and responsive e.g. check for HTTP 200 instead of only if TCP connection is alive.
8
App Infra
nagios
Similar to #7 but for application server infrastructure.
9
Backend Infra
nagios
Similar to #7 but for backend server infrastructure.
10
Integrated Systems Health Check
nagios
Check for availability of the integrated systems. Level of details depend on the integrated systems flexibility. Tests such as a ping, or a round trip HTTP request/response.
11
Individual box health checks
nagios
Check for CPU, memory, disk, etc.
12
Azure health check
nagios(*)
Check status of Azure components for troubleshooting. This nagios instances has to be outside Azure :)

No comments:

Post a Comment