A leading Cloud Architect tells you 5 things you should monitor and alert to keep your systems healthy.

What Cloud Metrics, KPI’s and Service metrics should you monitor?

Gus Segura
4 min readApr 27, 2021

By Gus Segura

Most Fortune 100 clients know how to keep a cloud system up and runing. After many years of consulting and building these architectures; I will share some of the key metrics I monitor and alert on to help keep my clients cloud services running smoothly.

Key Metrics for Monitoring

So, you have met the timeline and implemented the latest amazing cloud architecture. Its world class and you’re proud of the work you and your team have accomplished. Except, the first time there is a “Critical Incidence” or “Major Incidence” you’re left trying to debug the root cause for days only to find out that its something simple and obvious.

The following are my top 5 metrics to monitor in your cloud architecture and avoid situations like the one above.

  1. Disk Health — Low Level — don’t take Amazon, Google and your Cloud providers word for it — Check yourself

Yes, I know — everything you have is in memory or memory cache these days but is it really? Allot of these systems persist to disk or they would not be called enterprise grade.

So, if your service is data centric; its likely to have disk somewhere. Oracle Database, Kafka Confluent Cluster, Redis, Elastic — They all persist to disk in some way. Don’t rely on your provider to alert you on a partial failure. You may never see the error until it’s too late. i.e. run lower level disk checks.

2. Network — Bytes In, Out and Transfer : latency, latency, latency.

Most Fortune 100 clients monitor network — allot. But when latency hits everyone seems to complain with no solutions or places to consider looking. You get allot of finger pointing going on and not allot of mitigation.

Do yourself a favor. Make sure you have tested your system to understand how the data flows from the Kubernetes Client that can created 100’s of pods to the ingestion points in your data service (as an example). If you don’t understand the latency caused by scaling without health checks or how transfer can affect consumption. It may cost you many hours of troubleshooting and many dollars wasted.

3. CPU — Actual vs Expected : Don’t be afraid to push it to the limit.

Trying to optimize CPU in the cloud can be very tricky. Personally, I have a theory about why providers want you to leave plenty of head room. I’ve architected bare metal and cloud. I’ve replaced a few mother boards — And TONs of disk drives (literally) and memory boards.

Push it — size for load as “Area under the curve” in other words if your system is truly CPU bound — you are paying for it. May as well use it. Again, scale up or out as needed but you should be monitoring CPU as a single node and in aggregate. Find an attribution model that works for you and use it!

4. Memory Usage — Again, You’re paying for it — use it!

Did I mention that I worked in HP Enterprises emerging technology lab. I worked and architected some of the first moon shot SMC solutions on Elastic, Kubernetes and this streaming service called Kafka.

We pushed all types of memory metrics into HPE Open View. Cloud providers try to aggregate this for you into a few metrics they think you need. Again, don’t take their word for it. Look at Java Virtual Memory, App Memory, Available system memory and much more.

5. Service Specific Metrics — Depends on the service

As mentioned — Kafka Confluent, Oracle Database, Elasticsearch, etc. all have specific metrics that are recommended by the vendor. In later articles I will give you my top 5 or 10 per service. We’ll have some fun with Oracle and Kafka for sure. To start, Just go to the something like the JMX metrics page for your service, start reading and jot down anything that resonates. Seriously, don’t short cut this process because your application is different and unique to someone else. What is important to you may differ from me — there will be overlap for sure but you would be surprised what some people miss.

Cloud Architect Insider ( Former Google, Amazon, HPE ) writes about products, strategies, tips to help you make smart choices about your cloud providers and partners. I work primarily with Advantis Data Services now. However, My reporting, recommendations and opinions are mine are reasonably objective. Face It. I’m a Data Scientist with a focus on language analytics. There always some bias.. but I do try :) Peace.

If anything from this article resonates or you would like more information please contact : Advantis Data for more information.

--

--

Gus Segura

Principal Cloud Architect : Former Google, Amazon, HP Enterprise —Now: Advantis Data Services, Futurist, Yoga practitioner, Data Science Engineer.. Optimist.