All Volumes/Volume 08

AI Operations

Operational management of AI systems including monitoring, performance metrics, SLAs, model retraining, capacity planning, and business continuity.

AI Operations Monitoring

Continuous monitoring of AI systems in production is essential to detect performance degradation, security incidents, and operational anomalies. The AI Operations Monitoring framework establishes standards for observability, alerting, and response.

Monitoring Dimensions

Dimension	Metrics	Alert Threshold	Response
Performance	Latency, throughput, error rate	P95 latency > 500ms	Scale infrastructure or optimise model
Accuracy	Precision, recall, F1, accuracy drift	Accuracy drop > 5% from baseline	Investigate data drift, retrain model
Fairness	Demographic parity, equalised odds	Fairness metric breach	Halt inference, escalate to Ethics Board
Security	Failed auth, anomalous inputs, prompt injection attempts	> 10 suspicious requests / hour	Activate incident response, block IPs
Cost	Inference cost per request, total compute spend	Cost > 120% of budget	Optimise model or negotiate pricing
Compliance	Audit log completeness, data retention	Missing audit events	Investigate logging pipeline

Alert Severity Levels

Critical — Immediate human response required (data breach, system compromise, severe bias incident). Page on-call engineer and AI Operations Manager.
High — Response within 1 hour (performance degradation, security anomaly, compliance violation). Notify team lead and log incident.
Medium — Response within 4 hours (cost overrun, minor accuracy drift, non-critical failures). Create ticket and assign owner.
Low — Response within 24 hours (informational alerts, trend warnings). Include in daily operations report.

Service Level Agreement Management

SLAs define the expected service levels for AI systems and establish accountability between AI Operations, business stakeholders, and external customers. All production AI services must have documented SLAs.

Standard SLA Tiers

Tier	Availability	Latency (P95)	Support Hours	Use Case
Tier 1 — Critical	99.99%	< 200ms	24x7	Real-time fraud detection, safety-critical systems
Tier 2 — Standard	99.9%	< 500ms	Business hours + on-call	Customer service chatbots, recommendation engines
Tier 3 — Development	99.0%	< 2000ms	Business hours	Internal tools, prototyping, research

SLA Governance

SLAs are proposed by AI Operations and approved by the AI Steering Committee.
SLA breaches trigger automatic incident creation and root cause analysis.
Quarterly SLA review meetings assess performance trends and adjust commitments.
Customer-facing SLAs require Legal review and Executive approval.

Business Continuity & Disaster Recovery

AI systems supporting critical business functions must have robust business continuity and disaster recovery plans. These plans ensure resilience against infrastructure failures, model degradation, and catastrophic events.

Recovery Objectives

System Criticality	RTO	RPO	Failover Method
Critical (Tier 1)	15 minutes	5 minutes	Active-active with automatic failover
Standard (Tier 2)	1 hour	15 minutes	Warm standby with scripted failover
Development (Tier 3)	4 hours	1 hour	Cold standby with manual recovery

Backup Requirements

Model artifacts versioned and stored in immutable object storage with cross-region replication.

Training datasets backed up daily with 30-day retention minimum.

Configuration and prompt registries backed up with every change.

Audit logs forwarded to central SIEM with 7-year retention.

Disaster recovery drills conducted quarterly with documented results.

Model Version Rollback

All production AI systems must maintain the ability to rollback to the previous model version within 10 minutes. Model versions must be validated and approved before promotion to production.