Skip to content
← All casesFinOpsEnterprise

FinOps: IT Infrastructure Optimization

145→114 VMs, norms CPU 70-90% P98 / RAM 70-80% P95 / HDD 65-80% P100, 5 savings categories

Problem

What doesn't work

IT infrastructure of 900+ servers grew organically: each product requested dedicated resources with margin, utilization was low (CPU <30%, RAM <40%). Costs grew linearly. No utilization norms: 145 virtual machines, many loaded at 10–20%.

Solution

Architectural approach

Implemented FinOps approach with specific utilization norms: CPU 70–90% (P98), RAM 70–80% (P95), HDD 65–80% (P100). VM consolidation: 145→114. Five savings categories: VM optimization, server consolidation, license optimization, decommissioning, open-source migration. Per-system analysis of each product.

Challenges

What made it hard

System owners inflated resource needs 'just in case' — everyone feared performance degradation during consolidation. Had to switch from averages to percentiles (P95/P98/P100) to prove actual peak loads and safe headroom. Per-system analysis of 145 VMs was manual work — no automated utilization metrics existed, had to collect data from Zabbix/Prometheus by hand.

Role

My role & contribution

CTO / Technical Director

Developed percentile-based utilization norms methodology (P95/P98/P100 instead of averages). Personally analyzed all 145 VMs. Defined 5 savings categories and the 145→114 VM consolidation plan. Implemented monthly FinOps reporting with budget review.

Demo

How it looks

Implementation

How it works

Inventory of 900+ servers → CPU/RAM/HDD utilization measurement → percentile-based norms (P95/P98/P100) → per-system optimization plan. Consolidated 145→114 VMs. Integration with Zabbix/Prometheus/Grafana. Monthly FinOps report: current utilization vs norms, deviations, action plan. Budget review for each new request.

Architecture Decision

Why this way

Percentile norms instead of average values

Alternative

Average utilization as metric (avg CPU < 50% = underloaded)

Why it didn't fit

Average utilization hides peaks: a server with avg 30% may have P98 = 95%. Percentile norms (P95/P98/P100) show real load and safe consolidation threshold.

Result

145→114 VMs without degradation. Each system analyzed by 3 metrics × 3 percentiles

Metrics

Results

01
VMs: 145 → 114 (−21%)
02
Norms: CPU 70–90% P98, RAM 70–80% P95, HDD 65–80% P100
03
5 savings categories (VMs, servers, licenses, decommission, open-source)
04
₽172M infrastructure savings (within Digital Platform)
05
Per-system analysis of each product
Business Impact

Impact on business

21% VM reduction (145→114) without performance degradation. Percentile-based utilization norms (P95/P98/P100) — scientific approach instead of averages. 5 savings directions cover the full stack from hardware to licenses. Freed resources reused for new products without purchases.

Methods

Algorithms & patterns

FinOps (Cloud Economics)Capacity PlanningPercentile-based Utilization NormsInfrastructure AuditPer-system Cost Analysis
Stack

Technologies

  • Prometheus
  • Grafana
  • Zabbix
  • VMware
  • ELK
  • FinOps Framework

Ready to discuss?

If you need an architect who builds autonomous AI systems — reach out.

Serbia-based · CET/CEST timezone · EU-aligned working hours · International contracts experience