Have you ever gotten frustrated waiting for a web page to load? A slow system can drive users away. That's why understanding how well a system performs is critical. In this short post, we'll explore how to measure system performance using key indicators, making sure that the system stays smooth and responsive for everyone most of the time.

𝗞𝗲𝘆 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗜𝗻𝗱𝗶𝗰𝗮𝘁𝗼𝗿𝘀 (𝗞𝗣𝗜𝘀)
Just like a doctor uses vitals to check our health, we can use KPIs to assess a system's health. These are basically measurements that tell us how well the system is working. Here are four important ones to keep an eye on:
𝗟𝗮𝘁𝗲𝗻𝗰𝘆: This refers to how long it takes the system to respond to a request. The faster the response (lower latency), the better the experience for users. Imagine a webpage loading in a few seconds instead of minutes.
𝗧𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁: This tells us how much work the system can handle at once. Think of it like the number of lanes open on a busy highway. Higher throughput means the system can handle more users without slowing down.
𝗘𝗿𝗿𝗼𝗿𝘀: This metric simply counts how often the system messes up a request. The fewer errors, the better user experience. We especially want to minimise errors that affect users, like a page that won't load correctly.
𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲 𝗨𝘀𝗮𝗴𝗲: Resource saturation tells us how efficiently the system is using resources like CPU, memory, disk or network. If the system is overloaded, it's like trying to run too many things at once, things may slow down. By monitoring resource usage, we can identify when the system needs an upgrade or adjustment.
𝗛𝗶𝗱𝗱𝗲𝗻 𝗦𝗹𝗼𝘄𝗱𝗼𝘄𝗻𝘀 (𝗧𝗮𝗶𝗹 𝗹𝗮𝘁𝗲𝗻𝗰𝘆)
We've talked about average speed, but there's another important factor to consider - the tail latency. This refers to the slowest requests, the ones that take much longer than the average. Imagine a line at the store where most people get served quickly, but a few get stuck waiting for a longer time. Tail latency can be a real problem, especially as the system gets busier. It often indicates bottlenecks or queues within the system that are causing delays. By measuring tail latency (like the 99th percentile, which means the slowest 1% of requests), we can identify potential issues before they become widespread and keep things running smoothly.
𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 𝗳𝗼𝗿 𝗮 𝗛𝗮𝗽𝗽𝘆 𝗦𝘆𝘀𝘁𝗲𝗺 𝗮𝗻𝗱 𝗛𝗮𝗽𝗽𝘆 𝗨𝘀𝗲𝗿𝘀
By keeping an eye on these key metrics - throughput, errors, resource usage, and especially tail latency - we can ensure that the system we are using is healthy.

What KPIs are you measuring for System Health Check, do reply in the comment section below.