SMTP Failover System #
Advanced multi-server email delivery with automatic failover protection.
🔄 Overview #
The SMTP Failover system ensures reliable email delivery by:
- Supporting up to 3 SMTP servers
- Automatically switching to backup servers when primary fails
- Monitoring server health continuously
- Implementing circuit breaker pattern
- Logging all failover events
🏗️ Architecture #
Server Priority System #
Priority 1 (Primary)
↓ (fails)
Priority 2 (Secondary)
↓ (fails)
Priority 3 (Tertiary)
↓ (all fail)
Test Mode (logs only)
Components #
- SmtpFailoverManager: Orchestrates failover logic
- SmtpServer Model: Stores server configurations
- Health Check System: Monitors server status
- Circuit Breaker: Prevents cascading failures
- Failover Logs: Tracks all server switches
⚙️ Configuration #
Adding SMTP Servers #
Navigate to Settings → Email Settings:
Primary Server (Priority 1) #
Your main production server:
Name: SendGrid Primary
Host: smtp.sendgrid.net
Port: 587
Encryption: TLS
Priority: 1
Username: apikey
Password: your-api-key
From: noreply@yourdomain.com
Secondary Server (Priority 2) #
First backup option:
Name: Mailgun Backup
Host: smtp.mailgun.org
Port: 587
Encryption: TLS
Priority: 2
Username: your-username
Password: your-password
From: noreply@yourdomain.com
Tertiary Server (Priority 3) #
Last resort server:
Name: Amazon SES Fallback
Host: email-smtp.us-east-1.amazonaws.com
Port: 587
Encryption: TLS
Priority: 3
Username: aws-access-key
Password: aws-secret-key
From: noreply@yourdomain.com
Server Management #
Testing Connections #
Always test each server after configuration:
- Click "Test Connection" button
- Verify response time
- Check for any errors
- Confirm test email received
Enabling/Disabling Servers #
- Enable: Server becomes available for use
- Disable: Server is skipped in failover chain
- Disabled servers won't receive health checks
Reordering Priorities #
Change server priority dynamically:
- Click "Priority" dropdown
- Select new priority level
- Other servers automatically adjust
🔍 How Failover Works #
Normal Operation #
- Email Request: Message sends email
- Primary Check: System checks if Priority 1 server is healthy
- Send Attempt: Email sent through primary server
- Success: Email delivered, stats updated
Failover Scenario #
- Primary Fails: Connection timeout or error
- Log Failure: Record failure in health log
- Check Secondary: Verify Priority 2 server status
- Switch Servers: Route email through secondary
- Log Failover: Record server switch event
- Send Email: Deliver through backup server
Complete Failure #
When all servers fail:
- Test Mode: System enters test mode
- Log Only: Emails logged but not sent
- Alert Admin: Notification sent (if configured)
- Queue Emails: Messages queued for retry
🏥 Health Monitoring #
Automatic Health Checks #
Health checks run every 5 minutes via cron:
// Scheduled in Plugin.php
$schedule->job(new SmtpHealthCheck())
->everyFiveMinutes()
->withoutOverlapping();
Health Check Process #
- Connection Test: Attempt SMTP connection
- Authentication: Verify credentials work
- Response Time: Measure connection speed
- Update Status: Mark server healthy/unhealthy
- Reset Failures: Clear count if successful
Manual Health Checks #
Test server immediately:
- Go to Email Settings
- Click "Test Connection"
- Review response details
🔌 Circuit Breaker Pattern #
How It Works #
Prevents repeated attempts to failed servers:
if ($server->failure_count >= 3) {
// Circuit open - skip server
$server->circuit_breaker_until = now()->addMinutes(30);
}
States #
-
Closed (Normal)
- Server operating normally
- All requests routed through
- Failure count: 0
-
Open (Failed)
- Server has failed 3+ times
- All requests skip this server
- Duration: 30 minutes
-
Half-Open (Testing)
- After timeout, allow one test
- Success: Reset to Closed
- Failure: Return to Open
Configuration #
// config/campaign.php
'smtp_failover' => [
'circuit_breaker_threshold' => 3, // Failures before opening
'circuit_breaker_timeout' => 30, // Minutes before retry
'circuit_breaker_success_threshold' => 1 // Successes to close
]
📊 Monitoring & Logs #
Failover Events Log #
View in Email Settings dashboard:
| Time | From Server | To Server | Reason | Status |
|---|---|---|---|---|
| 2:30 PM | SendGrid | Mailgun | Connection timeout | Success |
| 1:15 PM | Mailgun | SES | Authentication failed | Success |
Health Check Logs #
Track server health history:
SELECT * FROM smtp_health_logs
WHERE server_id = 1
ORDER BY created_at DESC;
Metrics to Monitor #
- Failure Rate: Failures per hour
- Response Time: Average connection time
- Success Rate: Successful sends percentage
- Failover Frequency: How often switching occurs
🚨 Failure Scenarios #
Connection Timeout #
Symptoms:
- Slow or no response from server
- Timeout errors in logs
Automatic Response:
- Increment failure count
- Switch to next server
- Log timeout event
Manual Fix:
- Check network connectivity
- Verify firewall rules
- Test with telnet
Authentication Failed #
Symptoms:
- 535 Authentication error
- Invalid credentials message
Automatic Response:
- Mark server as failed
- Try next server
- Alert administrator
Manual Fix:
- Verify username/password
- Check API key validity
- Review account status
Rate Limit Exceeded #
Symptoms:
- 429 Too Many Requests
- Rate limit error messages
Automatic Response:
- Temporary server switch
- Queue messages for retry
- Implement backoff
Manual Fix:
- Increase rate limits
- Distribute load across servers
- Implement sending windows
IP Blacklisted #
Symptoms:
- 550 Blocked errors
- Reputation warnings
Automatic Response:
- Failover to clean IP
- Log reputation issue
Manual Fix:
- Check blacklists
- Request delisting
- Improve sending practices
🛠️ Best Practices #
Server Selection #
-
Diversify Providers: Use different companies
- Reduces single point of failure
- Different infrastructure
- Varied rate limits
-
Geographic Distribution:
- Primary: US East
- Secondary: US West
- Tertiary: Europe
-
Cost Optimization:
- Primary: Bulk discount provider
- Secondary: Pay-per-use
- Tertiary: Free tier/credits
Configuration Tips #
- Consistent From Address: Use same sender across servers
- Matching SPF/DKIM: Configure for all servers
- Similar Rate Limits: Prevent overwhelming backups
- Regular Testing: Monthly failover drills
Monitoring Strategy #
-
Set Alerts:
if ($failoverCount > 5) { alert("Excessive failovers detected"); } -
Track Patterns:
- Time of day failures
- Specific error types
- Recovery times
-
Review Logs Weekly:
- Identify recurring issues
- Optimize server order
- Plan capacity upgrades
🔧 Advanced Configuration #
Custom Failover Logic #
Extend the failover manager:
class CustomFailoverManager extends SmtpFailoverManager
{
protected function selectServer($servers)
{
// Custom logic for server selection
// e.g., based on time of day, recipient domain, etc.
}
}
Webhook Integration #
Get notified of failover events:
Event::listen('campaign.smtp.failover', function($event) {
Http::post('https://your-webhook.com/failover', [
'from_server' => $event->fromServer->name,
'to_server' => $event->toServer->name,
'reason' => $event->reason
]);
});
Load Balancing #
Distribute load across healthy servers:
'failover_mode' => 'load_balance', // or 'failover'
'load_balance_ratio' => [
1 => 0.6, // 60% to primary
2 => 0.3, // 30% to secondary
3 => 0.1 // 10% to tertiary
]
📈 Performance Impact #
Failover Overhead #
- Detection time: <1 second
- Switch time: <2 seconds
- Total delay: 2-3 seconds per failover
Optimization Tips #
- Quick Timeouts: Set aggressive connection timeouts
- Parallel Checks: Test backup servers preemptively
- Smart Caching: Cache healthy server status
- Queue Priority: Prioritize transactional emails
🔍 Troubleshooting #
All Servers Failing #
- Check Network: Verify outbound SMTP allowed
- Test Manually: Use telnet to test connections
- Review Credentials: Ensure all passwords current
- Check Blacklists: Verify IPs not blocked
Frequent Failovers #
- Increase Timeouts: Allow more connection time
- Review Rate Limits: May be hitting limits
- Check Server Load: Servers may be overloaded
- Optimize Batch Size: Reduce emails per batch
Emails Not Sending #
- Verify Test Mode: Ensure not in test-only mode
- Check Queue: Look for stuck queue jobs
- Review Logs: Check for specific errors
- Test Individual Servers: Isolate problem server
📊 Reporting #
Generate Failover Report #
-- Daily failover summary
SELECT
DATE(created_at) as date,
from_server_id,
to_server_id,
COUNT(*) as failover_count,
AVG(response_time) as avg_response
FROM smtp_failover_logs
GROUP BY DATE(created_at), from_server_id, to_server_id
ORDER BY date DESC;
Server Reliability Score #
$reliability = ($successfulSends / $totalAttempts) * 100;
$uptime = ($healthyChecks / $totalChecks) * 100;
$score = ($reliability * 0.7) + ($uptime * 0.3);
🎯 Optimization Strategies #
Peak Time Management #
Configure different servers for peak times:
if (now()->hour >= 9 && now()->hour <= 17) {
// Business hours - use high-capacity server
$priority = ['sendgrid', 'ses', 'mailgun'];
} else {
// Off-peak - use cost-effective options
$priority = ['ses', 'mailgun', 'sendgrid'];
}
Geographic Routing #
Route based on recipient location:
if ($recipient->country == 'US') {
// Use US-based servers
} elseif ($recipient->country == 'EU') {
// Use EU-compliant servers
}
Next: API Reference →