FinOps: IT Infrastructure Optimization
145→114 VMs, norms CPU 70-90% P98 / RAM 70-80% P95 / HDD 65-80% P100, 5 savings categories
What doesn't work
IT infrastructure of 900+ servers grew organically: each product requested dedicated resources with margin, utilization was low (CPU <30%, RAM <40%). Costs grew linearly. No utilization norms: 145 virtual machines, many loaded at 10–20%.
Architectural approach
Implemented FinOps approach with specific utilization norms: CPU 70–90% (P98), RAM 70–80% (P95), HDD 65–80% (P100). VM consolidation: 145→114. Five savings categories: VM optimization, server consolidation, license optimization, decommissioning, open-source migration. Per-system analysis of each product.
What made it hard
System owners inflated resource needs 'just in case' — everyone feared performance degradation during consolidation. Had to switch from averages to percentiles (P95/P98/P100) to prove actual peak loads and safe headroom. Per-system analysis of 145 VMs was manual work — no automated utilization metrics existed, had to collect data from Zabbix/Prometheus by hand.
My role & contribution
CTO / Technical Director
Developed percentile-based utilization norms methodology (P95/P98/P100 instead of averages). Personally analyzed all 145 VMs. Defined 5 savings categories and the 145→114 VM consolidation plan. Implemented monthly FinOps reporting with budget review.
How it looks
How it works
Inventory of 900+ servers → CPU/RAM/HDD utilization measurement → percentile-based norms (P95/P98/P100) → per-system optimization plan. Consolidated 145→114 VMs. Integration with Zabbix/Prometheus/Grafana. Monthly FinOps report: current utilization vs norms, deviations, action plan. Budget review for each new request.
Why this way
Percentile norms instead of average values
Average utilization as metric (avg CPU < 50% = underloaded)
Average utilization hides peaks: a server with avg 30% may have P98 = 95%. Percentile norms (P95/P98/P100) show real load and safe consolidation threshold.
145→114 VMs without degradation. Each system analyzed by 3 metrics × 3 percentiles
Results
- 01
- VMs: 145 → 114 (−21%)
- 02
- Norms: CPU 70–90% P98, RAM 70–80% P95, HDD 65–80% P100
- 03
- 5 savings categories (VMs, servers, licenses, decommission, open-source)
- 04
- ₽172M infrastructure savings (within Digital Platform)
- 05
- Per-system analysis of each product
Impact on business
21% VM reduction (145→114) without performance degradation. Percentile-based utilization norms (P95/P98/P100) — scientific approach instead of averages. 5 savings directions cover the full stack from hardware to licenses. Freed resources reused for new products without purchases.
Algorithms & patterns
Technologies
- Prometheus
- Grafana
- Zabbix
- VMware
- ELK
- FinOps Framework