"Evil" next comes from where you are blind

Image result for mordor



My long term experience with large scale distributed systems monitoring, performance modelling and prediction etc is that "Evil next comes from where you are blind". I.e. The next major performance/scalability problem will always pop up from where you have the least visibility into your system (i.e. where you are blind). In that sense it is entirely predictable! If part of your system is a black box in terms of monitoring/visibility then inevitably that's where the next problem will occur. In Middle Earth the Evil will come from Mordor.

Gaps in monitoring can be tier related (e.g. client/browser, networks, web, application, database, 3rd party services, cloud managed services), due to incomplete monitoring (e.g. some systems may not be able to have agents installed so may be "invisible"),  lack of ability to correlate monitoring data across heterogeneous systems (e.g. lack of transactional monitoring), or lack of breakdown times inside a system (e.g. using only network monitoring so can't see CPU, IO, sync, wait, suspension times inside system).

I spent 2 years trying to build an end-to-end/top-to-bottom performance model for a Telco-like client a while back. They didn't have complete application level monitoring. We tried using logs, SPLUNK, and an APM PoC (AppDynamics) to get sufficient data. In the end we managed to exclude every part of the system as having problems except the part where we had limited monitoring visibility (which was "obviously" where the problem was at the start, and at the end that was the same conclusion except we could 100% exclude all the other parts of the system).

I recently read this interesting blog (by Netsil) on distributed/cloud monitoring approaches and they pick up on some the issues.  The problem is that any network monitoring only based approach will not have sufficient visibility inside the applications to diagnose and model/predict future issues.

I think AWS X-Ray has possibilities as it can give visibility inside the manged AWS services that other agent based solutions can't, can give a breakdown of times including "throttling" (but probably not CPU, IO, etc), and may not be able to monitor client/browser and network times. What would a hybrid (e.g. Dynatrace + X-Ray) solution look like?


Comments

Popular posts from this blog

Chapter 11: AWS Directory Service, Cloud Directory

AWS Solution Architecture Certification Postscript

Chapter 2: Amazon Simple Storage Service (S3) and Amazon Glacier Storage (for storing your beer?)