TL;DR
- Managed Operations is Yobitel's 24/7 NOC for AI infrastructure — covering Yobitel NeoCloud, customer-built AWS EKS / Azure AKS / GCP GKE clusters, on-premise Kubernetes, bare-metal GPU clusters, and hybrid estates that span more than one of the above.
- Three SLA tiers — Standard, Premium, and Mission-critical — with RTO and RPO commitments scaling per tier, P1 response in under 15 minutes on Mission-critical, and named Technical Account Managers from Premium upward.
- Onboarding is a structured engagement (discovery -> baseline -> watch -> incident response) with a typical four-to-eight-week ramp before the SLA goes live on the customer's estate.
- Yobitel reads the customer's existing Prometheus and Grafana, integrates incident webhooks into the customer's PagerDuty or Opsgenie, and follows ITIL-aligned change management. The customer keeps ownership of every system; Yobitel runs the operations rota.
- Pricing is per-node or per-cluster per month in USD; the FOCUS 1.1 billing export carries the same column shape as the rest of the Yobitel stack so Managed Operations spend lands in the customer's FinOps pipeline alongside compute.
Overview#
Production AI infrastructure rarely fails dramatically. It drifts. Certificates expire on a Sunday. A GPU node's NVLink throughput collapses to half. A Kubernetes upgrade silently breaks the DCGM exporter. A billing job hangs and a workspace overruns its budget by 4x in the time it takes anyone to notice. None of these failures are interesting individually; each of them is the kind of thing that a 24/7 operations rota with the right runbook catches in minutes and that an under-staffed in-house team catches in days. Managed Operations is Yobitel's service for owning that rota on the customer's behalf, against the customer's own estate.
The service is genuinely cross-estate. A customer might run Yobibyte on NeoCloud for production inference, EKS in `eu-west-2` for application services, an on-premise Kubernetes cluster for regulated training, and a bare-metal GPU pod for research. Managed Operations covers all four under one contract, one incident webhook, one monthly review, and one named TAM. The watch is structured around incident severity (P1 through P4), SLA tier, and a runbook library that grows with the customer.
Yobitel runs the rota; the customer keeps every system. The customer's Kubernetes clusters, cloud accounts, on-premise hardware, and data remain customer-owned and customer-controlled. Yobitel reads the customer's existing Prometheus, Grafana, OpenTelemetry, and log-aggregation surfaces; integrates incident webhooks into the customer's PagerDuty, Opsgenie, or ServiceNow; and follows the customer's change-approval workflow alongside Yobitel's ITIL-aligned process. There is no Yobitel-installed agent that has read-write access the customer cannot revoke.
Yobitel Communications, the UK-headquartered AI infrastructure company that delivers Managed Operations, sells the service as a per-node or per-cluster monthly subscription in USD. UK NCSC OFFICIAL alignment is the default posture for the operations team; ISO 27001 and SOC 2 Type II cover the wider control set. Premium and Mission-critical tiers include a named TAM as the single point of contact for both operational matters and capacity planning.
Quick start — onboarding a customer estate#
Onboarding follows a structured four-stage flow that takes most customers four to eight weeks from contract sign to SLA going live on the watched estate. The stages run in sequence and the customer's TAM and operations lead are the joint owners; the customer's existing platform and SRE teams stay in the loop throughout.
Stage one — discovery. The TAM and an operations engineer walk through the customer's estate: which clusters, which clouds, which on-premise hardware, which observability stack, which incident-management tooling, which change-approval workflow, which compliance pin. The output is an estate map and a baseline runbook list pulled from Yobitel's existing library, annotated with what the customer already has covered and what the watch will need.
Stage two — baseline. Yobitel reads the customer's Prometheus and Grafana, integrates the incident webhook into the customer's PagerDuty or Opsgenie, registers a Yobitel pager rotation, and walks the joint team through the standard P1-P4 severity definitions. The baseline stage produces the first month of run-of-rota data — actual on-call load, the noisiest alerts, the most-fired runbooks — without yet pulling the trigger on the SLA.
Stage three — watch. The SLA goes live. Yobitel's NOC takes the on-call rotation; the customer's team stays available for warm hand-off during the first month of live watch. Monthly service reviews start at this stage and continue for the life of the engagement.
Stage four — incident response. The first real incident under SLA is handled jointly; the post-mortem is co-owned. By the end of the second month under SLA, the customer's team is typically able to step out of warm hand-off and rely on the NOC for the watched envelope.
Onboarding lands faster when the customer's observability surface is already centralised — even if the dashboards are imperfect. Yobitel can extend an existing Grafana setup in days; rebuilding observability from scratch adds weeks to the baseline stage.
Concepts#
Managed Operations exposes a small set of concepts that match how an operations leader thinks about a 24/7 watch. The mental model is incident severity at the centre, with watch tier, runbook library, and on-call schedule around it.
- Watch tier — the SLA tier the engagement is signed at. Standard (business-hours watch with on-call escalation), Premium (24/7 watch with named TAM), or Mission-critical (24/7 watch with redundant on-call rotation, sub-15-minute P1 response, and quarterly tabletop exercises).
- Incident severity — the P1-P4 classification applied at the moment an alert fires. P1 is customer-impacting and time-critical (e.g. production inference down, billing pipeline halted). P2 is degraded but contained (e.g. one node out of a cluster). P3 is non-customer-impacting (e.g. certificate expiry warning). P4 is informational.
- Runbook — the documented response to a specific alert or incident class. Yobitel maintains a runbook library that the customer's specific runbooks extend; runbooks are versioned in the customer's Git and Yobitel's reviewed and approved through the standard change-approval workflow.
- On-call schedule — the rota and escalation path. Yobitel's NOC primary, customer's secondary (optional), customer's escalation owner (required), TAM (Premium and Mission-critical only). The schedule is published in the customer's incident-management tool.
- Capacity planning cycle — the monthly review that surfaces utilisation, drift, and capacity recommendations. Drives reservation, scale-out, and right-sizing decisions; for Yobibyte and NeoCloud customers, the cycle integrates with the same FOCUS export the customer sees.
- Scope envelope — the explicit list of clusters, clouds, on-premise hardware, and software surfaces under SLA. Anything outside the envelope is best-effort or excluded. The envelope changes through the standard change-management process, not in-flight.
SLA tiers and commitments#
Three SLA tiers cover the bulk of customer needs. The tier defines the watch window, the response commitments, the recovery commitments, and the included surfaces; the customer's actual scope envelope sits inside the tier.
| Tier | Watch window | P1 response | P2 response | RTO target | RPO target | Included surfaces |
|---|---|---|---|---|---|---|
| Standard | Business hours plus on-call escalation | < 60 minutes | < 4 hours | 4 hours | 1 hour | Up to 50 nodes or 4 clusters; single-region; primary observability stack. |
| Premium | 24/7 | < 30 minutes | < 2 hours | 1 hour | 15 minutes | Up to 500 nodes or 20 clusters; multi-region; named TAM; quarterly business review. |
| Mission-critical | 24/7 with redundant rotation | < 15 minutes | < 1 hour | 15 minutes | 5 minutes | Up to 5,000 nodes or 100 clusters; multi-region with DR; named TAM; quarterly tabletop exercises; executive escalation path. |
Supported infrastructure#
Managed Operations covers a broad envelope of customer-built and Yobitel-built infrastructure. The supported list below is the envelope under SLA today; anything outside the envelope is handled as Professional Services rather than ongoing operations.
| Category | Supported | Notes |
|---|---|---|
| Yobitel-managed | Yobibyte workspaces, NeoCloud reservations, AI Applications (MediQuery and the wider suite) | Bundled at a reduced rate when Yobibyte or NeoCloud is the primary contract. |
| Hyperscaler Kubernetes | AWS EKS, Azure AKS, GCP GKE, Oracle OKE | GPU node pools supported across all four hyperscalers. |
| On-premise Kubernetes | Vanilla Kubernetes, Rancher RKE2, Red Hat OpenShift, SUSE Rancher | Includes air-gapped clusters with no internet egress. |
| Bare-metal GPU | NVIDIA DGX SuperPOD, HGX systems, AMD MI300X reference designs | Includes facility coordination via NeoCloud Operations on partner-build estates. |
| Observability | Prometheus, Grafana, OpenTelemetry, Loki, Tempo, customer-managed Mimir or Cortex | Yobitel reads the customer's existing stack rather than installing a parallel one. |
| Incident management | PagerDuty, Opsgenie, ServiceNow, Atlassian Jira Service Management | Webhook integration; no read-write agent installed. |
| Change approval | ITIL-aligned process integrating with ServiceNow, Jira, or customer-built tooling | Yobitel's change workflow runs alongside the customer's. |
| Identity | OIDC (Okta, Microsoft Entra ID, Auth0, Keycloak, Google Workspace) + SCIM 2.0 | RBAC for the customer's runbook access and the TAM's read scope. |
Engagement size by infrastructure scale#
The sizing below comes from production engagements at the small, mid-market, and enterprise scale. Node counts are aggregate across the watched estate; the relevant tier is driven primarily by criticality, not node count.
| Estate scale | Watched nodes / clusters | Recommended tier | Yobitel engagement size | Indicative price band |
|---|---|---|---|---|
| Small (single product) | 10 - 50 nodes / 1 - 4 clusters | Standard | Shared NOC pool; on-call escalation | $5K - $15K / month |
| Mid-market (multi-product) | 50 - 250 nodes / 4 - 12 clusters | Premium | Dedicated TAM; quarterly business review | $25K - $80K / month |
| Enterprise (platform) | 250 - 1,000 nodes / 12 - 50 clusters | Premium or Mission-critical | Dedicated TAM; redundant on-call rotation | $80K - $250K / month |
| Mission-critical (regulated) | 1,000 - 5,000 nodes / 50 - 100 clusters | Mission-critical | Dedicated TAM; quarterly tabletop; executive escalation | $250K - $1M+ / month |
| Multi-tenant operator | 5,000+ nodes / 100+ clusters | Mission-critical with custom envelope | Multiple TAMs by region; integrated NOC | Custom |
Scope envelope — in and out#
Managed Operations is a service contract, so the relevant question is not 'what is the limit?' but 'what is in scope?'. The table below is the standard envelope; customer-specific extensions and exclusions are documented in the signed scope statement and reviewed quarterly.
| Area | In scope (default) | Out of scope (default) | Notes |
|---|---|---|---|
| Infrastructure availability | Cluster, node, and pod-level availability | Application-level availability of customer code | Customer code is the customer's; Yobitel monitors the platform it runs on. |
| Incident response | Infrastructure incidents at P1-P4 | Customer-code incidents | Yobitel can extend scope to application code under Premium with custom runbooks. |
| Patching | Cluster, OS, and platform component patching on change-approval cadence | Customer application image patching | Customer-application patching covered under Professional Services. |
| Security | Continuous vulnerability scanning, patch coordination, identity drift detection | Application-layer pen testing | Pen testing covered under Professional Services. |
| Capacity planning | Monthly capacity review with reservation and right-sizing recommendations | Procurement execution | Yobitel makes recommendations; customer executes procurement (often via Omniscient Compute). |
| Cost optimisation | FOCUS-export analysis with quarterly findings | Implementation of cost-optimisation changes | Changes implemented through standard change-management. |
| Disaster recovery | DR runbook and testing on Premium and Mission-critical; quarterly exercises on Mission-critical | DR strategy design | Strategy design covered under Professional Services. |
| Compliance audit | Evidence collection for SOC 2, ISO 27001, NCSC, HIPAA | Audit certification itself | Yobitel provides evidence; the customer's auditor certifies. |
Pricing#
Managed Operations is priced per-node or per-cluster per month in USD on a tiered base rate. The base rate covers the SLA tier; per-incident surcharges apply only to engagements outside the standard scope envelope. Pricing is delivered as a FOCUS 1.1 line item alongside the rest of the Yobitel stack so spend rolls up into the customer's FinOps pipeline without manual reconciliation.
| Tier | Per-node $/month | Per-cluster $/month base | Included incidents per quarter | Notes |
|---|---|---|---|---|
| Standard | $45 - $80 | $1,500 - $3,000 | Up to 8 P1/P2 | Single-region, business-hours watch. |
| Premium | $95 - $150 | $5,000 - $10,000 | Up to 24 P1/P2 | 24/7 watch, named TAM, quarterly review. |
| Mission-critical | $180 - $320 | $15,000 - $35,000 | Unlimited | 24/7 redundant watch, sub-15-minute P1, tabletop exercises. |
| Yobibyte/NeoCloud bundled discount | 20 - 30% off the per-node rate | — | Per tier | Applied when Yobibyte or NeoCloud is the primary contract. |
| On-premise bare-metal surcharge | $10 - $30 per node | — | Per tier | Covers facility coordination and on-site hand-off. |
The per-node rate scales down with watched fleet size; large enterprise contracts (1,000+ nodes) typically settle in the lower band per node, while small contracts (under 50 nodes) settle higher per node to cover fixed TAM and rota overhead.
Security and compliance#
Managed Operations is delivered by a UK-headquartered team operating under NCSC Cloud Security Principles. UK NCSC OFFICIAL alignment is the default posture for the watch; EU and US engagements layer GDPR, EU AI Act high-risk-system obligations (where the watched estate is in scope), HIPAA, and SOC 2 Type II on top. The operations team's own controls (background checks, BYOD restrictions, separation of duties, immutable audit logging) sit under ISO 27001 and SOC 2 Type II.
Access to the customer's estate is read-only by default. Any write access required for incident response is requested through the customer's change-approval workflow with the relevant runbook attached; the customer can pre-approve named runbooks for in-incident write access, or require live approval for every write. Every action Yobitel takes on the customer's estate is logged to the customer's audit surface, not a Yobitel-only one.
- NCSC Cloud Security Principles — default posture for the operations team and the customer-watched envelope.
- G-Cloud — listed under Cloud Support (Lot 3); orderable through the Crown Commercial Service framework.
- Cyber Essentials Plus — current certificate for the operations team.
- ISO 27001:2022 — current certificate covering the operations team and its tooling.
- SOC 2 Type II — annual third-party audit covering security, availability, confidentiality.
- ITIL-aligned change management — integrating with the customer's existing approval workflow.
- GDPR / UK DPA 2018 — DPA, sub-processor list, EU SCCs available.
- EU AI Act — for customers running high-risk AI systems, Managed Operations provides the operational-resilience evidence layer.
- HIPAA — BAA available for healthcare-customer engagements.
- Read-only-by-default — write access requested through the customer's change-approval workflow; pre-approved runbooks available.
Alternatives#
Managed Operations is one option for running a 24/7 watch on AI infrastructure. The honest read: an in-house SRE team gives full control but takes 6-18 months to staff and burns continuously on rota cost; a hyperscaler-managed service covers the cloud-native primitives well but loses depth on GPU, fabric, inference engines, and ML pipelines, and is single-cloud by definition; a general MSP can cover availability but rarely has AI-infrastructure depth or AI-specific runbook libraries. Managed Operations sits in the middle as the contract that covers AI-infrastructure breadth across cloud, on-premise, and hybrid estates, with UK NCSC OFFICIAL as the default sovereignty posture and Yobibyte/NeoCloud bundled-discount economics where the customer already runs on Yobitel surfaces.
| Concern | Yobitel Managed Operations | In-house SRE | AWS / Azure / GCP managed services | General MSP |
|---|---|---|---|---|
| AI infrastructure depth | GPU, fabric, inference engines, ML pipelines covered natively | Whatever you hire | Generic cloud-service depth | Limited |
| Cross-estate (cloud + on-prem + hybrid) | Yes | Yes if you build it | Single-cloud focus | Yes |
| Sovereignty posture | UK NCSC OFFICIAL default, EU and US tiers | Whatever you build | Cloud's posture | Variable |
| Integration with customer Prometheus + PagerDuty | Yes, read-only by default | Customer-owned | Cloud-native tools | Variable |
| FOCUS-aligned billing export | Yes | DIY | Cloud-native billing | Limited |
| Yobibyte / NeoCloud bundled discount | Yes | N/A | N/A | N/A |
| P1 response SLA | Under 15 minutes (Mission-critical) | Whatever you staff | Variable | Variable |
| Named TAM from Premium tier | Yes | N/A | Enterprise support only | Variable |
| Read-only-by-default + runbook-approved write | Yes | Customer-owned | Cloud-managed | Variable |
| Knowledge transfer at exit | Documented runbook library handed back | N/A | Limited | Variable |
Common incident classes#
Managed Operations is a service contract rather than a product, so 'troubleshooting' is reframed as the most common incident classes the watch handles. The classes below cover the bulk of paged incidents on a typical customer estate; the runbook library covers them with documented response, fix, and post-mortem templates.
| Incident class | Typical cause | Runbook response |
|---|---|---|
| Capacity exhaustion | Workload growth exceeded reservation; on-demand and spot exhausted in the region. | Engage customer's TAM for emergency reservation expansion; route burst traffic to sibling region if sovereignty allows; page customer's escalation owner on the third occurrence in a quarter. |
| Drift | Cluster configuration has drifted from the documented baseline (e.g. node pool resized, CNI plugin upgraded outside change management). | Open a P3 ticket against the customer's change-approval workflow; restore baseline through the standard change process; update runbook if drift is recurring. |
| Certificate expiry | TLS certificate, OIDC signing key, or Kubernetes control-plane cert approaching expiry without rotation. | P2 ticket 14 days before expiry, P1 in the 24 hours before. Yobitel coordinates rotation through the standard change workflow. |
| Billing overrun | Workspace, reservation, or workload exceeds the configured USD spend cap or trends to exceed it. | P2 page to customer's billing owner with FOCUS export analysis attached; suggest mitigation (reservation, right-sizing, throttle). |
| Fabric degradation | InfiniBand or RoCEv2 link error rate elevated on a training cluster. | P1 page; coordinate with NeoCloud NOC if the fabric is Yobitel-operated; pause distributed training to avoid NCCL collective corruption. |
| Identity federation drift | OIDC IdP rotated keys or changed audience; workspace cannot validate tokens. | P1 page; coordinate with customer's identity team for re-federation; runbook covers Okta, Entra ID, Auth0, Keycloak. |
| Inference cold-start cascade | Scale-to-zero endpoints saw correlated traffic surge; cold-start time exceeded SLO. | P2 page; warm replicas raised; runbook updated if cascade pattern is recurring. |
| Patch gap | Critical CVE published for a component on the watched estate. | P1 or P2 depending on CVSS; coordinate patch through customer's change-approval workflow. |
| Quota tripped during incident | Hyperscaler API quota or NeoCloud reservation quota tripped while scaling out under load. | P1 page; engage customer's TAM and the hyperscaler for emergency quota increase; runbook covers AWS, Azure, GCP, NeoCloud. |
| Audit-trail gap | Audit export pipeline halted (bucket policy drift, KMS key rotation). | P3 ticket; resolve through change workflow; post-mortem covers detection-to-resolution gap. |
Where Managed Operations fits in the Yobitel stack#
Managed Operations is the day-two operations layer that wraps the rest of the Yobitel stack. Professional Services delivers the day-one build; NeoCloud Operations delivers the partner-build sovereign facility; Yobibyte is the managed inference surface; NeoCloud is the sovereign capacity layer; Customer Excellency owns the strategic relationship. Managed Operations is the contract that keeps the result running.
Most customers buy Managed Operations alongside one or more of the other Yobitel surfaces. A customer running Yobibyte on NeoCloud often adds Managed Operations to cover the customer's own application services and any non-Yobitel infrastructure in the same envelope. A customer running a partner-built NeoCloud often contracts Managed Operations as the day-two operate phase. A customer running entirely on hyperscaler Kubernetes can contract Managed Operations without ever adopting another part of the Yobitel stack — the watch covers their estate as-is.
The boundary with Professional Services is deliberate. Professional Services is the engineering engagement for net-new builds, migrations, and bespoke implementations; Managed Operations is the ongoing watch on what is in production. Most engagements that start as Professional Services convert at least partially into Managed Operations at production hand-off; the same engineers in the consulting engagement are not the same engineers running the rota, but the runbook library carries through from one to the other.
References
- Managed Operations service page · Yobitel
- Yobitel NeoCloud · Yobitel
- Yobibyte platform · Yobitel
- Professional Services · Yobitel
- NCSC Cloud Security Principles · NCSC
- ITIL framework · AXELOS