SMTP Failover System #

Advanced multi-server email delivery with automatic failover protection.

🔄 Overview #

The SMTP Failover system ensures reliable email delivery by:

  • Supporting up to 3 SMTP servers
  • Automatically switching to backup servers when primary fails
  • Monitoring server health continuously
  • Implementing circuit breaker pattern
  • Logging all failover events

🏗️ Architecture #

Server Priority System #

Priority 1 (Primary)
    ↓ (fails)
Priority 2 (Secondary)
    ↓ (fails)
Priority 3 (Tertiary)
    ↓ (all fail)
Test Mode (logs only)

Components #

  1. SmtpFailoverManager: Orchestrates failover logic
  2. SmtpServer Model: Stores server configurations
  3. Health Check System: Monitors server status
  4. Circuit Breaker: Prevents cascading failures
  5. Failover Logs: Tracks all server switches

⚙️ Configuration #

Adding SMTP Servers #

Navigate to Settings → Email Settings:

Primary Server (Priority 1) #

Your main production server:

Name: SendGrid Primary
Host: smtp.sendgrid.net
Port: 587
Encryption: TLS
Priority: 1
Username: apikey
Password: your-api-key
From: noreply@yourdomain.com

Secondary Server (Priority 2) #

First backup option:

Name: Mailgun Backup
Host: smtp.mailgun.org
Port: 587
Encryption: TLS
Priority: 2
Username: your-username
Password: your-password
From: noreply@yourdomain.com

Tertiary Server (Priority 3) #

Last resort server:

Name: Amazon SES Fallback
Host: email-smtp.us-east-1.amazonaws.com
Port: 587
Encryption: TLS
Priority: 3
Username: aws-access-key
Password: aws-secret-key
From: noreply@yourdomain.com

Server Management #

Testing Connections #

Always test each server after configuration:

  1. Click "Test Connection" button
  2. Verify response time
  3. Check for any errors
  4. Confirm test email received

Enabling/Disabling Servers #

  • Enable: Server becomes available for use
  • Disable: Server is skipped in failover chain
  • Disabled servers won't receive health checks

Reordering Priorities #

Change server priority dynamically:

  1. Click "Priority" dropdown
  2. Select new priority level
  3. Other servers automatically adjust

🔍 How Failover Works #

Normal Operation #

  1. Email Request: Message sends email
  2. Primary Check: System checks if Priority 1 server is healthy
  3. Send Attempt: Email sent through primary server
  4. Success: Email delivered, stats updated

Failover Scenario #

  1. Primary Fails: Connection timeout or error
  2. Log Failure: Record failure in health log
  3. Check Secondary: Verify Priority 2 server status
  4. Switch Servers: Route email through secondary
  5. Log Failover: Record server switch event
  6. Send Email: Deliver through backup server

Complete Failure #

When all servers fail:

  1. Test Mode: System enters test mode
  2. Log Only: Emails logged but not sent
  3. Alert Admin: Notification sent (if configured)
  4. Queue Emails: Messages queued for retry

🏥 Health Monitoring #

Automatic Health Checks #

Health checks run every 5 minutes via cron:

// Scheduled in Plugin.php
$schedule->job(new SmtpHealthCheck())
    ->everyFiveMinutes()
    ->withoutOverlapping();

Health Check Process #

  1. Connection Test: Attempt SMTP connection
  2. Authentication: Verify credentials work
  3. Response Time: Measure connection speed
  4. Update Status: Mark server healthy/unhealthy
  5. Reset Failures: Clear count if successful

Manual Health Checks #

Test server immediately:

  1. Go to Email Settings
  2. Click "Test Connection"
  3. Review response details

🔌 Circuit Breaker Pattern #

How It Works #

Prevents repeated attempts to failed servers:

if ($server->failure_count >= 3) {
    // Circuit open - skip server
    $server->circuit_breaker_until = now()->addMinutes(30);
}

States #

  1. Closed (Normal)

    • Server operating normally
    • All requests routed through
    • Failure count: 0
  2. Open (Failed)

    • Server has failed 3+ times
    • All requests skip this server
    • Duration: 30 minutes
  3. Half-Open (Testing)

    • After timeout, allow one test
    • Success: Reset to Closed
    • Failure: Return to Open

Configuration #

// config/campaign.php
'smtp_failover' => [
    'circuit_breaker_threshold' => 3,      // Failures before opening
    'circuit_breaker_timeout' => 30,       // Minutes before retry
    'circuit_breaker_success_threshold' => 1 // Successes to close
]

📊 Monitoring & Logs #

Failover Events Log #

View in Email Settings dashboard:

Time From Server To Server Reason Status
2:30 PM SendGrid Mailgun Connection timeout Success
1:15 PM Mailgun SES Authentication failed Success

Health Check Logs #

Track server health history:

SELECT * FROM smtp_health_logs 
WHERE server_id = 1 
ORDER BY created_at DESC;

Metrics to Monitor #

  1. Failure Rate: Failures per hour
  2. Response Time: Average connection time
  3. Success Rate: Successful sends percentage
  4. Failover Frequency: How often switching occurs

🚨 Failure Scenarios #

Connection Timeout #

Symptoms:

  • Slow or no response from server
  • Timeout errors in logs

Automatic Response:

  • Increment failure count
  • Switch to next server
  • Log timeout event

Manual Fix:

  • Check network connectivity
  • Verify firewall rules
  • Test with telnet

Authentication Failed #

Symptoms:

  • 535 Authentication error
  • Invalid credentials message

Automatic Response:

  • Mark server as failed
  • Try next server
  • Alert administrator

Manual Fix:

  • Verify username/password
  • Check API key validity
  • Review account status

Rate Limit Exceeded #

Symptoms:

  • 429 Too Many Requests
  • Rate limit error messages

Automatic Response:

  • Temporary server switch
  • Queue messages for retry
  • Implement backoff

Manual Fix:

  • Increase rate limits
  • Distribute load across servers
  • Implement sending windows

IP Blacklisted #

Symptoms:

  • 550 Blocked errors
  • Reputation warnings

Automatic Response:

  • Failover to clean IP
  • Log reputation issue

Manual Fix:

  • Check blacklists
  • Request delisting
  • Improve sending practices

🛠️ Best Practices #

Server Selection #

  1. Diversify Providers: Use different companies

    • Reduces single point of failure
    • Different infrastructure
    • Varied rate limits
  2. Geographic Distribution:

    • Primary: US East
    • Secondary: US West
    • Tertiary: Europe
  3. Cost Optimization:

    • Primary: Bulk discount provider
    • Secondary: Pay-per-use
    • Tertiary: Free tier/credits

Configuration Tips #

  1. Consistent From Address: Use same sender across servers
  2. Matching SPF/DKIM: Configure for all servers
  3. Similar Rate Limits: Prevent overwhelming backups
  4. Regular Testing: Monthly failover drills

Monitoring Strategy #

  1. Set Alerts:

    if ($failoverCount > 5) {
       alert("Excessive failovers detected");
    }
  2. Track Patterns:

    • Time of day failures
    • Specific error types
    • Recovery times
  3. Review Logs Weekly:

    • Identify recurring issues
    • Optimize server order
    • Plan capacity upgrades

🔧 Advanced Configuration #

Custom Failover Logic #

Extend the failover manager:

class CustomFailoverManager extends SmtpFailoverManager
{
    protected function selectServer($servers)
    {
        // Custom logic for server selection
        // e.g., based on time of day, recipient domain, etc.
    }
}

Webhook Integration #

Get notified of failover events:

Event::listen('campaign.smtp.failover', function($event) {
    Http::post('https://your-webhook.com/failover', [
        'from_server' => $event->fromServer->name,
        'to_server' => $event->toServer->name,
        'reason' => $event->reason
    ]);
});

Load Balancing #

Distribute load across healthy servers:

'failover_mode' => 'load_balance', // or 'failover'
'load_balance_ratio' => [
    1 => 0.6,  // 60% to primary
    2 => 0.3,  // 30% to secondary
    3 => 0.1   // 10% to tertiary
]

📈 Performance Impact #

Failover Overhead #

  • Detection time: <1 second
  • Switch time: <2 seconds
  • Total delay: 2-3 seconds per failover

Optimization Tips #

  1. Quick Timeouts: Set aggressive connection timeouts
  2. Parallel Checks: Test backup servers preemptively
  3. Smart Caching: Cache healthy server status
  4. Queue Priority: Prioritize transactional emails

🔍 Troubleshooting #

All Servers Failing #

  1. Check Network: Verify outbound SMTP allowed
  2. Test Manually: Use telnet to test connections
  3. Review Credentials: Ensure all passwords current
  4. Check Blacklists: Verify IPs not blocked

Frequent Failovers #

  1. Increase Timeouts: Allow more connection time
  2. Review Rate Limits: May be hitting limits
  3. Check Server Load: Servers may be overloaded
  4. Optimize Batch Size: Reduce emails per batch

Emails Not Sending #

  1. Verify Test Mode: Ensure not in test-only mode
  2. Check Queue: Look for stuck queue jobs
  3. Review Logs: Check for specific errors
  4. Test Individual Servers: Isolate problem server

📊 Reporting #

Generate Failover Report #

-- Daily failover summary
SELECT 
    DATE(created_at) as date,
    from_server_id,
    to_server_id,
    COUNT(*) as failover_count,
    AVG(response_time) as avg_response
FROM smtp_failover_logs
GROUP BY DATE(created_at), from_server_id, to_server_id
ORDER BY date DESC;

Server Reliability Score #

$reliability = ($successfulSends / $totalAttempts) * 100;
$uptime = ($healthyChecks / $totalChecks) * 100;
$score = ($reliability * 0.7) + ($uptime * 0.3);

🎯 Optimization Strategies #

Peak Time Management #

Configure different servers for peak times:

if (now()->hour >= 9 && now()->hour <= 17) {
    // Business hours - use high-capacity server
    $priority = ['sendgrid', 'ses', 'mailgun'];
} else {
    // Off-peak - use cost-effective options
    $priority = ['ses', 'mailgun', 'sendgrid'];
}

Geographic Routing #

Route based on recipient location:

if ($recipient->country == 'US') {
    // Use US-based servers
} elseif ($recipient->country == 'EU') {
    // Use EU-compliant servers
}

Next: API Reference →