Worker Performance Monitoring #

Complete guide to monitoring and optimizing queue worker performance for maintaining SLA compliance.

Note: This content is derived from the original docs/worker-monitoring.md file that was included with the plugin. It provides detailed worker monitoring capabilities beyond the basic performance optimization covered in the Performance Guide.

Overview #

The Server Monitor plugin includes comprehensive worker capacity monitoring to ensure SLA compliance for both free (5-minute) and premium (1-minute) endpoint checks. The system provides real-time monitoring, proactive alerts, and scaling recommendations.

Worker Performance Dashboard #

Access Requirements #

URL: /albrightlabs/servermonitor/workerperformance

Access Control: Albright Labs staff only

Must have @albrightlabs.com email address
Requires albrightlabs.servermonitor.view_performance permission

Dashboard Features #

Real-time SLA Compliance:

Premium plan compliance (1-minute SLA)
Free plan compliance (5-minute SLA)
Current worker counts and recommendations
Queue backlog monitoring

Performance Metrics:

Last hour dispatch performance
24-hour SLA breach trending
Job execution response times
Critical dispatch detection

Scaling Recommendations:

Current vs recommended worker counts
Action required alerts
Step-by-step scaling instructions

Key Performance Indicators #

SLA Compliance Thresholds #

Warning Levels:

Good: ≥95% compliance (green status)
Warning: 85-94% compliance (yellow status)
Critical: <85% compliance (red status)

Premium SLA (1-minute):

Servers must be checked within 60 seconds of dispatch
Target: 95% of checks completed within SLA
Alert: Compliance drops below 95%

Free SLA (5-minute):

Servers must be checked within 300 seconds of dispatch
Target: 95% of checks completed within SLA
Alert: Compliance drops below 95%

Queue Backlog Thresholds #

Premium Queue:

Normal: 0-30 jobs waiting
Warning: 30-50 jobs waiting
Critical: 50+ jobs waiting

Free Queue:

Normal: 0-60 jobs waiting
Warning: 60-100 jobs waiting
Critical: 100+ jobs waiting

Calculations Queue:

Normal: 0-10 jobs waiting
Warning: 10+ jobs waiting

Automated Monitoring #

Capacity Monitoring Job #

Frequency: Every 5 minutes via scheduled job

What It Monitors:

// MonitorWorkerCapacityJob checks:
- SLA compliance rates
- Queue backlog sizes
- Dispatch performance metrics
- Worker response times

Alert Triggers:

SLA compliance below thresholds
Queue backlogs exceeding limits
Critical dispatch times
Extended response times

Alert System #

Email Recipients: All users with @albrightlabs.com email addresses

Alert Types:

Critical: Immediate action required (SLA breach)
Warning: Performance degradation detected
Info: Scaling recommendations

Alert Content:

Subject: CRITICAL: Server Monitor Worker Capacity Alert

CRITICAL ALERTS:
- Premium SLA breach! Only 82% of servers checked within 1 minute.
- Premium queue backlog at 67 jobs!

IMMEDIATE ACTION REQUIRED:
1. Check Worker Performance Dashboard: [URL]
2. Add more workers by updating Supervisor configuration
3. Increase 'numprocs' for premium workers
4. Reload Supervisor configuration

Detailed instructions available in dashboard.

Worker Capacity Guidelines #

Premium Workers (1-minute SLA) #

Capacity Formula:

Workers Needed = (Total Premium Servers ÷ 60) × Avg Processing Time × 1.5

Example Calculations:

120 premium servers
2-second average processing time
Safety factor: 1.5x

Workers = (120 ÷ 60) × 2 × 1.5 = 6 workers minimum

Performance Targets:

Each worker: ~60 checks per minute
Response time: <45 seconds average
Queue backlog: <30 jobs
SLA compliance: ≥95%

Free Workers (5-minute SLA) #

Capacity Formula:

Workers Needed = (Total Free Servers ÷ 300) × Avg Processing Time × 1.2

Example Calculations:

600 free servers
3-second average processing time
Safety factor: 1.2x

Workers = (600 ÷ 300) × 3 × 1.2 = 7.2 = 8 workers minimum

Performance Targets:

Each worker: ~12 checks per minute
Response time: <4 minutes average
Queue backlog: <60 jobs
SLA compliance: ≥95%

Scaling Procedures #

When to Scale Up #

Immediate Scaling Required:

SLA compliance drops below 85%
Queue backlog exceeds critical thresholds
Multiple consecutive SLA breaches
Dashboard shows red/critical status

Proactive Scaling Indicators:

SLA compliance 90-95% (warning zone)
Queue backlog approaching thresholds
Response times increasing
New servers added to monitoring

Step-by-Step Scaling #

1. Access Server:

ssh user@yourserver.com

2. Edit Supervisor Configuration:

sudo nano /etc/supervisor/conf.d/servermonitor-workers.conf

3. Increase Worker Count:

# Before (2 premium workers)
[program:servermonitor-premium]
numprocs=2

# After (4 premium workers)
[program:servermonitor-premium]
numprocs=4

4. Apply Changes:

sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl restart servermonitor-premium:*

5. Verify Changes:

sudo supervisorctl status | grep servermonitor-premium

Should show 4 workers running:

servermonitor-premium:servermonitor-premium_00   RUNNING
servermonitor-premium:servermonitor-premium_01   RUNNING
servermonitor-premium:servermonitor-premium_02   RUNNING
servermonitor-premium:servermonitor-premium_03   RUNNING

Scaling Down #

When to Scale Down:

Consistently low queue backlogs (<10 jobs)
SLA compliance consistently >98%
Server count has decreased
During low-usage periods

Gradual Reduction:

# Scale down gradually (4 → 3 workers)
sudo nano /etc/supervisor/conf.d/servermonitor-workers.conf
# Change numprocs=3

sudo supervisorctl reread
sudo supervisorctl update

Performance Metrics #

Real-Time Metrics #

Dashboard Metrics (updated every 30 seconds):

Current SLA compliance percentages
Active worker counts
Queue backlog sizes
Recent SLA breaches (last 6 hours)

Historical Metrics:

24-hour SLA breach trends
Hourly breakdown of performance issues
Worker scaling history
Peak load analysis

API Monitoring #

Worker Status API:

curl "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY"

Response Format:

{
    "success": true,
    "status": "healthy|warning|critical",
    "timestamp": "2025-01-01T12:00:00Z",
    "sla_compliance": {
        "premium": {"percentage": 98.5, "checked": 74, "total": 75},
        "free": {"percentage": 100.0, "checked": 60, "total": 60}
    },
    "queue_status": {"premium": 5, "free": 12},
    "recent_breaches": 2,
    "dashboard_url": "https://yoursite.com/.../workerperformance"
}

Console Monitoring #

Check Capacity Command:

php artisan servermonitor:check-capacity

Sample Output:

Checking worker capacity...

Quick Summary:
==============
Premium Queue: 15 jobs
Free Queue: 42 jobs
SLA Breaches (last hour): 3

⚠️  ACTION REQUIRED: Worker capacity needs attention!
Visit the Worker Performance Dashboard for details:
https://yoursite.com/albrightlabs/servermonitor/workerperformance

Troubleshooting Performance Issues #

High Queue Backlog #

Symptoms:

Dashboard shows queue warnings
SLA compliance dropping
Increased response times

Diagnosis:

# Check actual queue sizes
redis-cli llen queues:premium
redis-cli llen queues:free

# Check worker status
sudo supervisorctl status | grep servermonitor

# Check worker logs for errors
tail -f storage/logs/worker-premium.log

Solutions:

Add more workers (primary solution)
Check for stuck jobs: php artisan queue:failed
Restart workers: sudo supervisorctl restart servermonitor-premium:*
Clear stuck jobs: php artisan queue:clear (caution!)

SLA Breaches #

Common Causes:

Insufficient worker capacity
Network issues to monitored endpoints
Database performance problems
Queue driver issues

Investigation Steps:

# Check recent breaches
tail -100 storage/logs/laravel.log | grep "SLA BREACH"

# Check worker response times
grep "response_time_seconds" storage/logs/laravel.log | tail -20

# Check for slow endpoints
grep "timeout" storage/logs/worker-premium.log

Memory Issues #

Symptoms:

Workers consuming excessive memory
Workers crashing or restarting
Overall system slowdown

Monitoring:

# Check worker memory usage
ps aux | grep "queue:work" | awk '{print $6, $11}' | sort -nr

# Monitor over time
watch 'ps aux | grep "queue:work" | grep -v grep'

Solutions:

# Restart workers to clear memory
sudo supervisorctl restart servermonitor-premium:*

# Reduce max-jobs per worker
# Edit supervisor config: --max-jobs=500 (reduce from 1000)

# Add memory limits
# Edit supervisor config: -d memory_limit=256M

Advanced Monitoring #

Custom Alerts #

Integration with External Monitoring:

#!/bin/bash
# custom-monitoring.sh

STATUS=$(curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY")
HEALTH=$(echo $STATUS | jq -r '.status')

case $HEALTH in
    "critical")
        # Send to PagerDuty, Slack, etc.
        curl -X POST https://hooks.slack.com/... \
            -d '{"text":"CRITICAL: ServerMonitor worker capacity alert!"}'
        ;;
    "warning")
        # Send warning notification
        echo "WARNING: Worker capacity issues detected" | mail admin@company.com
        ;;
    "healthy")
        # All good, no action needed
        ;;
esac

Automated Scaling #

Auto-scaling Script (advanced):

#!/bin/bash
# auto-scale.sh - Run every 5 minutes

API_STATUS=$(curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY")
PREMIUM_QUEUE=$(echo $API_STATUS | jq -r '.queue_status.premium')
FREE_QUEUE=$(echo $API_STATUS | jq -r '.queue_status.free')

# Scale up premium workers if queue is backing up
if [ "$PREMIUM_QUEUE" -gt 50 ]; then
    # Add 2 more premium workers
    CURRENT_WORKERS=$(supervisorctl status | grep servermonitor-premium | wc -l)
    NEW_COUNT=$((CURRENT_WORKERS + 2))

    # Update supervisor config programmatically
    sed -i "s/numprocs=.*/numprocs=$NEW_COUNT/" /etc/supervisor/conf.d/servermonitor-workers.conf
    supervisorctl reread
    supervisorctl update

    echo "Scaled premium workers to $NEW_COUNT due to queue backlog: $PREMIUM_QUEUE"
fi

# Scale down if queue is consistently low
if [ "$PREMIUM_QUEUE" -lt 5 ] && [ "$FREE_QUEUE" -lt 10 ]; then
    # Consider scaling down (implement with more sophisticated logic)
    echo "Queues are low, consider scaling down workers"
fi

Best Practices #

Monitoring Schedule #

Daily:

Check Worker Performance Dashboard
Review SLA compliance trends
Monitor queue backlogs

Weekly:

Analyze performance patterns
Plan capacity for upcoming changes
Review scaling events

Monthly:

Assess overall performance trends
Plan infrastructure improvements
Review alert thresholds

Capacity Planning #

Before Adding Servers:

Check current capacity utilization
Calculate additional worker requirements
Plan scaling timeline
Communicate with team

Server Growth Planning:

Current: 100 premium servers, 2 workers (50 servers/worker)
Adding: 50 more premium servers
New total: 150 premium servers
Workers needed: 150 ÷ 50 = 3 workers minimum
Recommended: 4 workers (with safety margin)

Performance Optimization #

Worker Efficiency:

Monitor individual worker performance
Identify and optimize slow endpoints
Use connection pooling for database
Implement proper error handling

Infrastructure Optimization:

Use Redis for queue driver
Optimize database queries
Implement proper caching
Monitor network latency

Integration with Other Systems #

External Monitoring #

Prometheus Integration:

# Export metrics for Prometheus
curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY" \
    | jq -r '"servermonitor_sla_compliance{type=\"premium\"} " + (.sla_compliance.premium.percentage | tostring)'

Grafana Dashboard:

Import worker metrics
Create SLA compliance graphs
Set up alert thresholds
Monitor trends over time

Log Aggregation #

ELK Stack Integration:

# Send worker logs to Elasticsearch
tail -f storage/logs/worker-premium.log | \
    while read line; do
        echo $line | \
        jq -R -s '{message: ., timestamp: now, service: "servermonitor-premium"}' | \
        curl -X POST "http://elasticsearch:9200/servermonitor/_doc" \
            -H "Content-Type: application/json" -d @-
    done

Previous: ← Migration Guide | Next: Performance Optimization →