Monday, February 9, 2015

Application Architecture for Cloud


Why architecting an application for Cloud is different than on-prem?

QoS


If one starts at various QoS of an application, the traits for Cloud app are not seemingly different than that for on-prem app. Here is the list of QoS:
Ø  Availability, Resiliency, and Fault Tolerance
Ø  Testability and Manageability
Ø  Performance
Ø  Scalability
Ø  Security (user data and app logs; infra and app)
Ø  Flexibility and Extensibility
Ø  Maintainability and Readability
Ø  Usability and Accessibility
Ø  Functionality and Correctness

But there are certain characteristics (marked in italics) of QoS which needs special attention.
Fault Tolerance: Needs to be applied at the connectivity of a certain logic of an application at certain layer to that of either external system or another layer. E.g. App tier connecting to DB tier. This is also called as transient fault tolerance handling. The connectivity might fail in the first try but could work at on the next try. Because, even though the app tier pulls up a connection from the pool, there might be a window of time where a specific DB connection which routes traffic to a specific node of a DB cluster might not have been revalidated and supplied to an app. When app invokes the DB call it might fail because the node is being rebooted*. The DB cluster will survive and is fully functional but that specific call might fail.

Resiliency: Each layer of an application must think of resiliency. For instance for each process which needs compute instance, there has to be at the least two VMs (for IaaS). Please note that PaaS already provides resiliency in SLA which is another reason why apps should gravitate towards using resiliency. Thus on-prem application which didn’t need to be clustered (e.g. a utility process running only in one node), now need to do so. This brings a whole slew of challenges. What if the process has to be singleton? What if the process is legacy and (to make the situation worse) there is no source code available for it? The application has to make the singleton process cluster aware. For instance may be the legacy process can be wrapper around with a script/program  a wrapper, the wrapper should have health check, and externalize some data points could provide a solution. Each layer has to have its own health check.

High Availability: The application must handle BCDR. The application has to be datacenter (or region) agnostic or, even better, Cloud provider agnostic. Thus a responsive application has to be deployed at least in a different region. There are many elements that needs to be thought through such as RPO, RTO, etc. They key is the data synchronization (user data, application data (such as state, session, etc.), code, etc.) Also a strategy has to be determined (Active/Active, Active/Passive, Active/Passive (only maintenance page), Active/Active (running in reduced capacity and would autoscale when the traffic flows to it), etc. Conspicuously if the application replicates the data synchronously between the geo regions then the performance is going to be impacted.

Testability: Out of various form of testing (detailed in http://theitjourney.blogspot.com/2015/01/phase-one-testing-strategy-to-migrate.html) the infrastructure endurance test stands out.

Manageability: Especially the DevOps team’s infrastructure as code needs to factor in the transient faults especially around connectivity.

Performance: Repeated testing and tuning should assist in determining the correct capacity of each layer (which includes size of VMs) of the application.

Scalability: The application has to utilize auto scale features of the Cloud provider. You could scale up or down (vertical), in or out (horizontal). But in Cloud scale in/out works best.

Security: Security is a shared responsibility between the Cloud provider and the customer. There are several layers of security (http://theitjourney.blogspot.com/2015/01/phase-one-security-strategy-to-migrate.html). Access to the Cloud environment must be scrutinized, and appropriate policies and governance has to be institutionalized.

* à There are “planned” and “unplanned” updates occurring at the VM level (specific to Azure.) Please note this problem is already handled by the managed PaaS services.

Architecture for Cloud

Ø  Not same as on-prem
Ø  SLA encompasses many individual SLAs
Ø  Shared security
Ø  Consider for unplanned downtimes of software components
Ø  Scale Units
Ø  Entails a different thinking!!!

Scale Unit: The application capacity should be alluded in terms of scale unit. As an illustration “DB Unit” is a scale unit used for Azure SQL DB, “Stream units” for Azure media services, etc.

Why different thinking?


Moving away from minimum time between failures to minimum time to recover helps to standardize on h/w thus scale of mass production could be used. Also self-heal applications relieves the hardware to be repaired and rather could be replaced.

No comments:

Post a Comment