AI Operations
Operational management of AI systems including monitoring, performance metrics, SLAs, model retraining, capacity planning, and business continuity.
AI Operations Monitoring
Continuous monitoring of AI systems in production is essential to detect performance degradation, security incidents, and operational anomalies. The AI Operations Monitoring framework establishes standards for observability, alerting, and response.
Monitoring Dimensions
| Dimension | Metrics | Alert Threshold | Response |
|---|---|---|---|
| Performance | Latency, throughput, error rate | P95 latency > 500ms | Scale infrastructure or optimise model |
| Accuracy | Precision, recall, F1, accuracy drift | Accuracy drop > 5% from baseline | Investigate data drift, retrain model |
| Fairness | Demographic parity, equalised odds | Fairness metric breach | Halt inference, escalate to Ethics Board |
| Security | Failed auth, anomalous inputs, prompt injection attempts | > 10 suspicious requests / hour | Activate incident response, block IPs |
| Cost | Inference cost per request, total compute spend | Cost > 120% of budget | Optimise model or negotiate pricing |
| Compliance | Audit log completeness, data retention | Missing audit events | Investigate logging pipeline |
Alert Severity Levels
- Critical — Immediate human response required (data breach, system compromise, severe bias incident). Page on-call engineer and AI Operations Manager.
- High — Response within 1 hour (performance degradation, security anomaly, compliance violation). Notify team lead and log incident.
- Medium — Response within 4 hours (cost overrun, minor accuracy drift, non-critical failures). Create ticket and assign owner.
- Low — Response within 24 hours (informational alerts, trend warnings). Include in daily operations report.
Service Level Agreement Management
SLAs define the expected service levels for AI systems and establish accountability between AI Operations, business stakeholders, and external customers. All production AI services must have documented SLAs.
Standard SLA Tiers
| Tier | Availability | Latency (P95) | Support Hours | Use Case |
|---|---|---|---|---|
| Tier 1 — Critical | 99.99% | < 200ms | 24x7 | Real-time fraud detection, safety-critical systems |
| Tier 2 — Standard | 99.9% | < 500ms | Business hours + on-call | Customer service chatbots, recommendation engines |
| Tier 3 — Development | 99.0% | < 2000ms | Business hours | Internal tools, prototyping, research |
SLA Governance
- SLAs are proposed by AI Operations and approved by the AI Steering Committee.
- SLA breaches trigger automatic incident creation and root cause analysis.
- Quarterly SLA review meetings assess performance trends and adjust commitments.
- Customer-facing SLAs require Legal review and Executive approval.
Business Continuity & Disaster Recovery
AI systems supporting critical business functions must have robust business continuity and disaster recovery plans. These plans ensure resilience against infrastructure failures, model degradation, and catastrophic events.
Recovery Objectives
| System Criticality | RTO | RPO | Failover Method |
|---|---|---|---|
| Critical (Tier 1) | 15 minutes | 5 minutes | Active-active with automatic failover |
| Standard (Tier 2) | 1 hour | 15 minutes | Warm standby with scripted failover |
| Development (Tier 3) | 4 hours | 1 hour | Cold standby with manual recovery |
Backup Requirements
Model Version Rollback
All production AI systems must maintain the ability to rollback to the previous model version within 10 minutes. Model versions must be validated and approved before promotion to production.