Worker Performance Monitoring #

Complete guide to monitoring and optimizing queue worker performance for maintaining SLA compliance.

Note: This content is derived from the original docs/worker-monitoring.md file that was included with the plugin. It provides detailed worker monitoring capabilities beyond the basic performance optimization covered in the Performance Guide.

Overview #

The Server Monitor plugin includes comprehensive worker capacity monitoring to ensure SLA compliance for both free (5-minute) and premium (1-minute) endpoint checks. The system provides real-time monitoring, proactive alerts, and scaling recommendations.

Worker Performance Dashboard #

Access Requirements #

URL: /albrightlabs/servermonitor/workerperformance

Access Control: Albright Labs staff only

  • Must have @albrightlabs.com email address
  • Requires albrightlabs.servermonitor.view_performance permission

Dashboard Features #

Real-time SLA Compliance:

  • Premium plan compliance (1-minute SLA)
  • Free plan compliance (5-minute SLA)
  • Current worker counts and recommendations
  • Queue backlog monitoring

Performance Metrics:

  • Last hour dispatch performance
  • 24-hour SLA breach trending
  • Job execution response times
  • Critical dispatch detection

Scaling Recommendations:

  • Current vs recommended worker counts
  • Action required alerts
  • Step-by-step scaling instructions

Key Performance Indicators #

SLA Compliance Thresholds #

Warning Levels:

  • Good: ≥95% compliance (green status)
  • Warning: 85-94% compliance (yellow status)
  • Critical: <85% compliance (red status)

Premium SLA (1-minute):

  • Servers must be checked within 60 seconds of dispatch
  • Target: 95% of checks completed within SLA
  • Alert: Compliance drops below 95%

Free SLA (5-minute):

  • Servers must be checked within 300 seconds of dispatch
  • Target: 95% of checks completed within SLA
  • Alert: Compliance drops below 95%

Queue Backlog Thresholds #

Premium Queue:

  • Normal: 0-30 jobs waiting
  • Warning: 30-50 jobs waiting
  • Critical: 50+ jobs waiting

Free Queue:

  • Normal: 0-60 jobs waiting
  • Warning: 60-100 jobs waiting
  • Critical: 100+ jobs waiting

Calculations Queue:

  • Normal: 0-10 jobs waiting
  • Warning: 10+ jobs waiting

Automated Monitoring #

Capacity Monitoring Job #

Frequency: Every 5 minutes via scheduled job

What It Monitors:

// MonitorWorkerCapacityJob checks:
- SLA compliance rates
- Queue backlog sizes
- Dispatch performance metrics
- Worker response times

Alert Triggers:

  • SLA compliance below thresholds
  • Queue backlogs exceeding limits
  • Critical dispatch times
  • Extended response times

Alert System #

Email Recipients: All users with @albrightlabs.com email addresses

Alert Types:

  • Critical: Immediate action required (SLA breach)
  • Warning: Performance degradation detected
  • Info: Scaling recommendations

Alert Content:

Subject: CRITICAL: Server Monitor Worker Capacity Alert

CRITICAL ALERTS:
- Premium SLA breach! Only 82% of servers checked within 1 minute.
- Premium queue backlog at 67 jobs!

IMMEDIATE ACTION REQUIRED:
1. Check Worker Performance Dashboard: [URL]
2. Add more workers by updating Supervisor configuration
3. Increase 'numprocs' for premium workers
4. Reload Supervisor configuration

Detailed instructions available in dashboard.

Worker Capacity Guidelines #

Premium Workers (1-minute SLA) #

Capacity Formula:

Workers Needed = (Total Premium Servers ÷ 60) × Avg Processing Time × 1.5

Example Calculations:

  • 120 premium servers
  • 2-second average processing time
  • Safety factor: 1.5x
Workers = (120 ÷ 60) × 2 × 1.5 = 6 workers minimum

Performance Targets:

  • Each worker: ~60 checks per minute
  • Response time: <45 seconds average
  • Queue backlog: <30 jobs
  • SLA compliance: ≥95%

Free Workers (5-minute SLA) #

Capacity Formula:

Workers Needed = (Total Free Servers ÷ 300) × Avg Processing Time × 1.2

Example Calculations:

  • 600 free servers
  • 3-second average processing time
  • Safety factor: 1.2x
Workers = (600 ÷ 300) × 3 × 1.2 = 7.2 = 8 workers minimum

Performance Targets:

  • Each worker: ~12 checks per minute
  • Response time: <4 minutes average
  • Queue backlog: <60 jobs
  • SLA compliance: ≥95%

Scaling Procedures #

When to Scale Up #

Immediate Scaling Required:

  • SLA compliance drops below 85%
  • Queue backlog exceeds critical thresholds
  • Multiple consecutive SLA breaches
  • Dashboard shows red/critical status

Proactive Scaling Indicators:

  • SLA compliance 90-95% (warning zone)
  • Queue backlog approaching thresholds
  • Response times increasing
  • New servers added to monitoring

Step-by-Step Scaling #

1. Access Server:

ssh user@yourserver.com

2. Edit Supervisor Configuration:

sudo nano /etc/supervisor/conf.d/servermonitor-workers.conf

3. Increase Worker Count:

# Before (2 premium workers)
[program:servermonitor-premium]
numprocs=2

# After (4 premium workers)
[program:servermonitor-premium]
numprocs=4

4. Apply Changes:

sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl restart servermonitor-premium:*

5. Verify Changes:

sudo supervisorctl status | grep servermonitor-premium

Should show 4 workers running:

servermonitor-premium:servermonitor-premium_00   RUNNING
servermonitor-premium:servermonitor-premium_01   RUNNING
servermonitor-premium:servermonitor-premium_02   RUNNING
servermonitor-premium:servermonitor-premium_03   RUNNING

Scaling Down #

When to Scale Down:

  • Consistently low queue backlogs (<10 jobs)
  • SLA compliance consistently >98%
  • Server count has decreased
  • During low-usage periods

Gradual Reduction:

# Scale down gradually (4 → 3 workers)
sudo nano /etc/supervisor/conf.d/servermonitor-workers.conf
# Change numprocs=3

sudo supervisorctl reread
sudo supervisorctl update

Performance Metrics #

Real-Time Metrics #

Dashboard Metrics (updated every 30 seconds):

  • Current SLA compliance percentages
  • Active worker counts
  • Queue backlog sizes
  • Recent SLA breaches (last 6 hours)

Historical Metrics:

  • 24-hour SLA breach trends
  • Hourly breakdown of performance issues
  • Worker scaling history
  • Peak load analysis

API Monitoring #

Worker Status API:

curl "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY"

Response Format:

{
    "success": true,
    "status": "healthy|warning|critical",
    "timestamp": "2025-01-01T12:00:00Z",
    "sla_compliance": {
        "premium": {"percentage": 98.5, "checked": 74, "total": 75},
        "free": {"percentage": 100.0, "checked": 60, "total": 60}
    },
    "queue_status": {"premium": 5, "free": 12},
    "recent_breaches": 2,
    "dashboard_url": "https://yoursite.com/.../workerperformance"
}

Console Monitoring #

Check Capacity Command:

php artisan servermonitor:check-capacity

Sample Output:

Checking worker capacity...

Quick Summary:
==============
Premium Queue: 15 jobs
Free Queue: 42 jobs
SLA Breaches (last hour): 3

⚠️  ACTION REQUIRED: Worker capacity needs attention!
Visit the Worker Performance Dashboard for details:
https://yoursite.com/albrightlabs/servermonitor/workerperformance

Troubleshooting Performance Issues #

High Queue Backlog #

Symptoms:

  • Dashboard shows queue warnings
  • SLA compliance dropping
  • Increased response times

Diagnosis:

# Check actual queue sizes
redis-cli llen queues:premium
redis-cli llen queues:free

# Check worker status
sudo supervisorctl status | grep servermonitor

# Check worker logs for errors
tail -f storage/logs/worker-premium.log

Solutions:

  1. Add more workers (primary solution)
  2. Check for stuck jobs: php artisan queue:failed
  3. Restart workers: sudo supervisorctl restart servermonitor-premium:*
  4. Clear stuck jobs: php artisan queue:clear (caution!)

SLA Breaches #

Common Causes:

  • Insufficient worker capacity
  • Network issues to monitored endpoints
  • Database performance problems
  • Queue driver issues

Investigation Steps:

# Check recent breaches
tail -100 storage/logs/laravel.log | grep "SLA BREACH"

# Check worker response times
grep "response_time_seconds" storage/logs/laravel.log | tail -20

# Check for slow endpoints
grep "timeout" storage/logs/worker-premium.log

Memory Issues #

Symptoms:

  • Workers consuming excessive memory
  • Workers crashing or restarting
  • Overall system slowdown

Monitoring:

# Check worker memory usage
ps aux | grep "queue:work" | awk '{print $6, $11}' | sort -nr

# Monitor over time
watch 'ps aux | grep "queue:work" | grep -v grep'

Solutions:

# Restart workers to clear memory
sudo supervisorctl restart servermonitor-premium:*

# Reduce max-jobs per worker
# Edit supervisor config: --max-jobs=500 (reduce from 1000)

# Add memory limits
# Edit supervisor config: -d memory_limit=256M

Advanced Monitoring #

Custom Alerts #

Integration with External Monitoring:

#!/bin/bash
# custom-monitoring.sh

STATUS=$(curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY")
HEALTH=$(echo $STATUS | jq -r '.status')

case $HEALTH in
    "critical")
        # Send to PagerDuty, Slack, etc.
        curl -X POST https://hooks.slack.com/... \
            -d '{"text":"CRITICAL: ServerMonitor worker capacity alert!"}'
        ;;
    "warning")
        # Send warning notification
        echo "WARNING: Worker capacity issues detected" | mail admin@company.com
        ;;
    "healthy")
        # All good, no action needed
        ;;
esac

Automated Scaling #

Auto-scaling Script (advanced):

#!/bin/bash
# auto-scale.sh - Run every 5 minutes

API_STATUS=$(curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY")
PREMIUM_QUEUE=$(echo $API_STATUS | jq -r '.queue_status.premium')
FREE_QUEUE=$(echo $API_STATUS | jq -r '.queue_status.free')

# Scale up premium workers if queue is backing up
if [ "$PREMIUM_QUEUE" -gt 50 ]; then
    # Add 2 more premium workers
    CURRENT_WORKERS=$(supervisorctl status | grep servermonitor-premium | wc -l)
    NEW_COUNT=$((CURRENT_WORKERS + 2))

    # Update supervisor config programmatically
    sed -i "s/numprocs=.*/numprocs=$NEW_COUNT/" /etc/supervisor/conf.d/servermonitor-workers.conf
    supervisorctl reread
    supervisorctl update

    echo "Scaled premium workers to $NEW_COUNT due to queue backlog: $PREMIUM_QUEUE"
fi

# Scale down if queue is consistently low
if [ "$PREMIUM_QUEUE" -lt 5 ] && [ "$FREE_QUEUE" -lt 10 ]; then
    # Consider scaling down (implement with more sophisticated logic)
    echo "Queues are low, consider scaling down workers"
fi

Best Practices #

Monitoring Schedule #

Daily:

  • Check Worker Performance Dashboard
  • Review SLA compliance trends
  • Monitor queue backlogs

Weekly:

  • Analyze performance patterns
  • Plan capacity for upcoming changes
  • Review scaling events

Monthly:

  • Assess overall performance trends
  • Plan infrastructure improvements
  • Review alert thresholds

Capacity Planning #

Before Adding Servers:

  1. Check current capacity utilization
  2. Calculate additional worker requirements
  3. Plan scaling timeline
  4. Communicate with team

Server Growth Planning:

Current: 100 premium servers, 2 workers (50 servers/worker)
Adding: 50 more premium servers
New total: 150 premium servers
Workers needed: 150 ÷ 50 = 3 workers minimum
Recommended: 4 workers (with safety margin)

Performance Optimization #

Worker Efficiency:

  • Monitor individual worker performance
  • Identify and optimize slow endpoints
  • Use connection pooling for database
  • Implement proper error handling

Infrastructure Optimization:

  • Use Redis for queue driver
  • Optimize database queries
  • Implement proper caching
  • Monitor network latency

Integration with Other Systems #

External Monitoring #

Prometheus Integration:

# Export metrics for Prometheus
curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY" \
    | jq -r '"servermonitor_sla_compliance{type=\"premium\"} " + (.sla_compliance.premium.percentage | tostring)'

Grafana Dashboard:

  • Import worker metrics
  • Create SLA compliance graphs
  • Set up alert thresholds
  • Monitor trends over time

Log Aggregation #

ELK Stack Integration:

# Send worker logs to Elasticsearch
tail -f storage/logs/worker-premium.log | \
    while read line; do
        echo $line | \
        jq -R -s '{message: ., timestamp: now, service: "servermonitor-premium"}' | \
        curl -X POST "http://elasticsearch:9200/servermonitor/_doc" \
            -H "Content-Type: application/json" -d @-
    done

Previous: ← Migration Guide | Next: Performance Optimization →