The Cold Email Disaster Recovery SOP Every Agency Should Have

Deliverability crises aren't a question of if — they're a question of when. Domains burn, inboxes get flagged, and clients get angry. The agencies that survive these moments without losing clients are the ones who have a documented response process before the crisis happens.

Part 1: Prevention infrastructure

The best disaster recovery starts before disaster hits.

Monitoring baseline

Run weekly placement tests on every active sending domain. This catches problems before they become crises. A domain that's been steadily in promotions for two weeks is a warning sign; a domain that suddenly drops to spam is a crisis. Weekly testing gives you the early warning signal.

Backup inbox pool

Maintain a reserve of pre-warmed inboxes equal to 25–30% of your active inbox pool. These should be warmed, configured correctly, and ready to deploy within hours. This is your emergency capacity — don't use it for normal campaigns.

Infrastructure segregation

Never put all of one client's campaigns on the same domain. Spread each client across multiple domains and keep different clients on different domain pools. When one domain fails, it only affects part of one client's campaign — not everything.

Documentation

Every client domain should have documented: auth configuration, sending limits, warmup date, sending history, and the current backup infrastructure designated for that client.

Part 2: Crisis detection

Alert triggers

Define what triggers a deliverability incident for your agency:

Open rate drops more than 30% week-over-week on any domain
Bounce rate exceeds 3% in any campaign
Placement test shows spam on any active sending domain
Client reports emails going to spam
ESP dashboard shows sending account warnings

Immediate response (first 30 minutes)

Pause all sequences on the affected infrastructure
Run the full diagnostic stack: auth checks, blacklist check, placement test, tracking domain check
Classify the issue: technical / reputation-recoverable / reputation-replace
Do not communicate with client until classification is complete

Part 3: Resolution paths

Path A: Technical fix (resolution in hours)

Fix the broken auth record or tracking configuration
Verify DNS propagation
Rerun placement test to confirm fix
Resume sending if placement test passes
Send client a brief update: issue identified and resolved, no campaign impact

Path B: Reputation recovery (resolution in 2–8 weeks)

Move campaign to backup infrastructure immediately
Submit blacklist delisting requests with documentation
Reduce or halt sending on damaged domain
Begin 2-week pause on damaged inboxes
Retest after 2 weeks; if clean, resume at 20% volume and ramp over 4 weeks
Send client a technical summary (see communication template below)

Path C: Full replacement (resolution in hours with pre-warmed, or 4–6 weeks with fresh)

Deploy backup pre-warmed inboxes to client campaign
Migrate campaign sequences to new infrastructure
Run placement test on new infrastructure before first send
Begin blacklist removal on retired infrastructure (for future use)
Send client update: infrastructure upgraded, campaign continuing

Part 4: Client communication templates

Initial communication (within 2 hours of detection)

"We've identified a technical issue affecting deliverability on one of your sending domains. We're investigating and will have a full update within 2 hours. Sequences are paused on the affected domain while we assess."

Technical summary (within 24 hours)

"Root cause: [specific issue — e.g., DKIM key rotation by our ESP wasn't reflected in DNS]. Impact: emails sent from [domain] between [dates] may have had reduced placement rates. Resolution: [fix applied]. Campaign status: migrated to backup infrastructure, sends continuing normally. No action needed from you."

Part 5: Post-mortem

After every incident, document:

What happened and when it was detected
How long between failure and detection
How long between detection and resolution
Campaign impact (emails sent during failure, estimated placement drop)
What would have caught this earlier
What process change prevents this

Most agencies experience the same failure modes multiple times because they don't run post-mortems. One documented incident review is worth more than any amount of monitoring setup.

Pre-crisis readiness checklist

Weekly placement tests scheduled for all active domains
Backup inbox pool maintained at 25–30% of active pool
Each client's domains documented with auth config and limits
Alert triggers defined and monitored
Client communication templates written and accessible
Path A / B / C resolution playbooks documented
Designated person responsible for incident response

Agencies that handle crises best are the ones with pre-warmed backup infrastructure already deployed. WarmInboxes is one source for pre-warmed inboxes that can be migrated to within hours. The time to set this up is before a crisis — not during one.