Part 1 of this post can be found here…
So, what do we need to know?
All the items above add to the complexity of predicting which services may start to crash or underperform under normal operation or after a minor system change or upgrade – the only way to know where a problem is or if a problem is developing is to know the following pieces of information about every service in your part of the architecture (assuming a federated management architecture approach has been adopted). A more advanced approach is to model these pieces of information against a service chain or higher level business process, built from correlating the lower level service performance into a manageable higher level abstraction.
- Is the service up or down?
- How is the service performing against the Service Level Agreements (SLAs) between itself and its customers? This includes response time, messages per minute etc. This also implies that the service customers have set up SLAs in the first place.
- Is the service throwing any exceptions or other errors?
- Cache hit or miss?
Even if the service is up, SLAs may be being broken while the service is throwing multiple exceptions – the user or customer system will only notice a slow service or consequent poor performance as the service is restarted by the host system. For example, a service may be experiencing a regular stack overflow and resulting memory corruption error, causing the web server process to crash – this will only manifest itself in system log entries or in thrown exceptions to /dev/null or console – from the user’s point of view, the service is slow and not fit for purpose even though it occasionally works.
So, what needs to be in place?
In summary, even though this post has posed a number of questions and not really answered them all, the case for managed services is clear. In order to detect, diagnose and fix a service problem, the system administrator needs to know how the system is performing – in a services based architecture this is achieved through the collection and communication of the required metrics to a system management toolset (or at least their collection into something that can be accessed by a system management toolset).
One way to approach this is through the adding of transparent proxies or embedded plugins to every service, indeed this is the model adopted by the leading vendors in this particular marketplace. In order to support this kind of implementation, the following system design aspects should be considered before embarking on a loosely coupled service based infrastructure project:
- A system management and monitoring toolset is required to monitor the status of the services, SLA and policy management and to support system and failure diagnostics (among other things).
- Every service has to have a transparent proxy or logical equivalent to allow failover and to support greater scalability.
- Every service has to have metrics collected against it, preferably with correlation against a Service Level Agreement (SLA) with identified customers.
While these design ideas don’t solve the whole problem, they do offer a first step up from an ad-hoc un-managed service capability to a basic level of management, a fighting chance.