Blog

We Let AI Run a Mid-Market ISP's NOC for 90 Days. Here's What Broke.

GoZupees Team

· May 1, 2026 · 9 min read

We Let AI Run a Mid-Market ISP's NOC for 90 Days. Here's What Broke.

Most "AI in production" case studies are published the day everything works. This one is being published after ninety days of running AI agents inside a real mid-market ISP's network operations centre — including the weeks where things didn't work, the day we had to roll back a release at 02:14 in the morning, and the customer call that taught us a lesson about confidence calibration we're still paying for.

The ISP in question runs a network serving roughly 90,000 premises across two regions. It is a real customer. The numbers below are real. We've anonymised the network and stripped a few details that would identify them, but everything else stands.

If you run a NOC and you're being told by every vendor on the planet that AI will transform your operations, this is what that actually looks like at week one, week six, and week twelve. None of it is a smooth curve.

What we deployed, and where

The scope was deliberately narrow. We didn't replace the NOC. We embedded AI agents into three specific workflows where the data was clean enough and the consequences of failure were bounded enough to learn fast without setting fire to anything customer-facing.

The three workflows were L1 ticket triage on inbound customer-reported faults, automated correlation of network alarms against known maintenance windows, and a voice agent fielding the 03:00–07:00 graveyard shift on the support line. Everything else — change management, escalations, anything touching the core — stayed human.

The good — what actually worked

The voice agent on the overnight shift was the unambiguous win. The ISP had been paying for an offshore answering service that did little more than take a name and a callback number, then forward an email to the morning team. By week three, the AI agent was doing real diagnostic work — pulling line stats, checking the OLT, identifying whether the issue was the customer's equipment or the network — and resolving roughly half the calls without a human ever touching them.

The ticket triage agent took longer to dial in but ended up materially changing throughput. By week ten, it was correctly classifying intent, severity, and routing on the first pass for the vast majority of inbound tickets, freeing up the L1 team to spend their time on the work that actually requires judgement.

52%

overnight calls resolved without a human

38%

drop in mean time to ticket assignment

£18.4k

monthly run-rate saving by week 12

The alarm correlation agent was the quiet success. NOCs are drowning in alarms. Most are noise. The agent learned to suppress alarms that correlated with scheduled maintenance, with sympathetic events on adjacent equipment, or with known weather patterns affecting last-mile equipment. The ticket count dropped meaningfully and the engineers stopped getting paged for things that didn't need them.

The embarrassing — what we got wrong

Three things broke in ways we didn't predict. None of them were model failures. All of them were us.

Confidence calibration on the voice agent

In week four, the agent told a customer that her connection issue was caused by a fault on her side and that she should reset her router. It then ended the call. The actual issue was a node-level outage affecting roughly 600 premises that we hadn't yet detected through the monitoring stack. The agent's confidence in its diagnosis was 0.94. It was wrong.

This wasn't a hallucination. The agent reasoned correctly given the data it could see. The problem was that it had no signal that its data might be incomplete. We were treating high confidence as a permission to act decisively. We should have been treating it as a permission to act decisively only when corroborated by network-wide context.

The 02:14 rollback

Week six. We pushed an update to the alarm correlation logic that was meant to suppress duplicate alarms within a sixty-second window. The implementation was subtly wrong and ended up suppressing alarms across a five-minute window. For about ninety minutes in the early hours, real faults were being filed and immediately silenced before the on-call engineer ever saw them.

The on-call engineer noticed because three customers had called in already by the time he checked his queue. We rolled back. Nobody lost service in a way that customers experienced as catastrophic, but two faults that should have been picked up within minutes weren't picked up for over an hour.

The lesson was unglamorous and entirely predictable: AI agents in production need the same change management discipline as any other piece of network infrastructure. Staged rollouts. Canary deployments. Shadow mode runs against historical data before anything touches live alarms. We knew this. We did it for the first deployment. We got cocky on the third.

The trust collapse with the L1 team

The most uncomfortable failure wasn't technical. In weeks five and six, the triage agent was getting things wrong often enough that the L1 engineers stopped trusting any of its classifications. They started re-triaging every ticket from scratch, which meant the AI was adding latency rather than removing it.

What we eventually figured out was that the engineers weren't reacting to the actual error rate. They were reacting to a small number of visibly stupid errors — tickets where the AI had clearly missed something obvious in the customer's text. One bad classification poisoned trust in twenty good ones. The fix was to expose the agent's reasoning trace to the engineers so they could see why it had classified something a particular way, and fast-feedback corrections so a thumbs-down would actually retrain behaviour within hours rather than weeks.

The metrics — what 90 days actually looked like

Here's the comparison table the operations director asked us to build at week twelve. These are real numbers, averaged across the final two weeks of the pilot versus the same two-week window from the previous quarter.

Metric	Before	After 90 days
Mean time to ticket assignment	14 min	8 min 40s
Overnight call abandonment	31%	7%
False-positive alarm rate	42%	19%
L1 engineer NPS (internal)	+12	+34
Customer CSAT (post-call)	4.2 / 5	4.4 / 5

The CSAT number is the one we watched most closely, because it was the one we were most prepared to see go backwards. It didn't. It nudged up. We think this is because the AI handles the easy calls so well that customers who would have waited 20 minutes for a human to answer at 04:00 now get answered in 8 seconds, and that experience swamps the small number of cases where the AI handed off awkwardly.

The honest answer to "can AI handle mission-critical NOC work?" is: yes, in a narrow scope, with the same operational discipline you'd apply to any other piece of network infrastructure.

What we'd tell another ISP starting from scratch

1

Pick workflows with bounded blast radius first. Overnight call handling and alarm correlation are reversible. Direct customer-impacting changes to network state are not. Earn the right to do the second by surviving the first.
2

Run shadow mode for two weeks before going live. Replay historical tickets and calls through the agent. Compare its decisions to what your team actually did. The disagreements are where you'll learn what the agent is missing.
3

Cross-reference confidence against network-wide context. A 0.95 confidence score on a single ticket means nothing if there's a cluster-level event the agent can't see. Build the cluster check in before you let the agent close anything out.
4

Show your engineers the reasoning, not just the answer. Trust collapses fast when the AI gets a visible call wrong. Trust rebuilds when engineers can see the chain of logic and override it cleanly. Black-box agents do not survive contact with experienced operators.
5

Treat AI changes like network changes. Staged rollout. Canary. Rollback plan. The 02:14 incident wouldn't have happened if we'd applied the same discipline to the agent's logic update that we apply to any router config push.

Takeaways

AI in a NOC works in narrow, bounded workflows where the data is clean and the failure mode is reversible. It does not work as a general-purpose replacement for L1 engineers.
The wins are real and measurable — overnight call resolution, alarm noise reduction, ticket triage throughput. The 90-day numbers were good enough that the customer extended into a wider rollout.
The failures were operational, not algorithmic. Confidence calibration, change management discipline, and trust-building with the human team were the three things we got wrong, and the three things any other ISP will get wrong if they're not warned.
If a vendor tells you their AI is ready to run your NOC end-to-end, walk away. If they tell you it can take twenty percent of the work off your team's plate over ninety days, with a clear rollback plan and visible reasoning, that's a conversation worth having.

See the full playbook

Book a 20-minute walkthrough

We'll show you the agent designs, the failure modes we logged, and the rollback plan that saved us at 02:14 — using your own NOC's call patterns and ticket data.

Book demo →

Helix-CX

Bedrock™

AI Rev Ops OS Podcast - Season 2