The most important shared resources on virtualized infrastructures having significant impact on application performance are CPU and Disk. The rest infrastructure resources - Memory and Network - are important as well but CPU and Disk performance was typical final root cause of any performance troubleshootings I did over several years. In VMware vSphere we can typically identify CPU Contention by CPU %RDY metric and Storage Contention based on disk response time of normalized I/O's. We can identify such issues during troubleshooting when infrastructure consumers are complaining about application performance. We call it reactive approach. But more mature approach is to identify potential performance issues before application is affected. We call it proactive approach. And that's where performance SLA's and threshold monitoring come in to play.
Infrastructure performance SLA can looks like
- CPU RDY is below 3% (notification threshold 2%)
- If # of vDisk IOPS < 1000 then vDISK Response Time is below 10ms (notification threshold 7ms)
Simple right? These two bullets above should be clearly articulated, explained and agreed between infrastructure service provider and infrastructure consumer building and providing application services on top of infrastructure.
So now how to monitor these performance metrics? I have just found Sunny Dua and Iwan Rahabok blog post covering this topic and step by step problem solution with vRealize Operations 6.x. Sunny and Iwan prepared and shared with community customized vROps supermetrics, views and dashboards for performance capacity planning. To be honest I do not have big experience with vROps so far but it seems to me as very helpful tool for anybody using vRealize Operations as monitoring platform.
Let's try to build and provide mature IT with clearly articulated SLA's and with agreed expectations between service providers and service consumers.