Worker Performance Monitoring #
Complete guide to monitoring and optimizing queue worker performance for maintaining SLA compliance.
Note: This content is derived from the original
docs/worker-monitoring.mdfile that was included with the plugin. It provides detailed worker monitoring capabilities beyond the basic performance optimization covered in the Performance Guide.
Overview #
The Server Monitor plugin includes comprehensive worker capacity monitoring to ensure SLA compliance for both free (5-minute) and premium (1-minute) endpoint checks. The system provides real-time monitoring, proactive alerts, and scaling recommendations.
Worker Performance Dashboard #
Access Requirements #
URL: /albrightlabs/servermonitor/workerperformance
Access Control: Albright Labs staff only
- Must have
@albrightlabs.comemail address - Requires
albrightlabs.servermonitor.view_performancepermission
Dashboard Features #
Real-time SLA Compliance:
- Premium plan compliance (1-minute SLA)
- Free plan compliance (5-minute SLA)
- Current worker counts and recommendations
- Queue backlog monitoring
Performance Metrics:
- Last hour dispatch performance
- 24-hour SLA breach trending
- Job execution response times
- Critical dispatch detection
Scaling Recommendations:
- Current vs recommended worker counts
- Action required alerts
- Step-by-step scaling instructions
Key Performance Indicators #
SLA Compliance Thresholds #
Warning Levels:
- Good: ≥95% compliance (green status)
- Warning: 85-94% compliance (yellow status)
- Critical: <85% compliance (red status)
Premium SLA (1-minute):
- Servers must be checked within 60 seconds of dispatch
- Target: 95% of checks completed within SLA
- Alert: Compliance drops below 95%
Free SLA (5-minute):
- Servers must be checked within 300 seconds of dispatch
- Target: 95% of checks completed within SLA
- Alert: Compliance drops below 95%
Queue Backlog Thresholds #
Premium Queue:
- Normal: 0-30 jobs waiting
- Warning: 30-50 jobs waiting
- Critical: 50+ jobs waiting
Free Queue:
- Normal: 0-60 jobs waiting
- Warning: 60-100 jobs waiting
- Critical: 100+ jobs waiting
Calculations Queue:
- Normal: 0-10 jobs waiting
- Warning: 10+ jobs waiting
Automated Monitoring #
Capacity Monitoring Job #
Frequency: Every 5 minutes via scheduled job
What It Monitors:
// MonitorWorkerCapacityJob checks:
- SLA compliance rates
- Queue backlog sizes
- Dispatch performance metrics
- Worker response times
Alert Triggers:
- SLA compliance below thresholds
- Queue backlogs exceeding limits
- Critical dispatch times
- Extended response times
Alert System #
Email Recipients: All users with @albrightlabs.com email addresses
Alert Types:
- Critical: Immediate action required (SLA breach)
- Warning: Performance degradation detected
- Info: Scaling recommendations
Alert Content:
Subject: CRITICAL: Server Monitor Worker Capacity Alert
CRITICAL ALERTS:
- Premium SLA breach! Only 82% of servers checked within 1 minute.
- Premium queue backlog at 67 jobs!
IMMEDIATE ACTION REQUIRED:
1. Check Worker Performance Dashboard: [URL]
2. Add more workers by updating Supervisor configuration
3. Increase 'numprocs' for premium workers
4. Reload Supervisor configuration
Detailed instructions available in dashboard.
Worker Capacity Guidelines #
Premium Workers (1-minute SLA) #
Capacity Formula:
Workers Needed = (Total Premium Servers ÷ 60) × Avg Processing Time × 1.5
Example Calculations:
- 120 premium servers
- 2-second average processing time
- Safety factor: 1.5x
Workers = (120 ÷ 60) × 2 × 1.5 = 6 workers minimum
Performance Targets:
- Each worker: ~60 checks per minute
- Response time: <45 seconds average
- Queue backlog: <30 jobs
- SLA compliance: ≥95%
Free Workers (5-minute SLA) #
Capacity Formula:
Workers Needed = (Total Free Servers ÷ 300) × Avg Processing Time × 1.2
Example Calculations:
- 600 free servers
- 3-second average processing time
- Safety factor: 1.2x
Workers = (600 ÷ 300) × 3 × 1.2 = 7.2 = 8 workers minimum
Performance Targets:
- Each worker: ~12 checks per minute
- Response time: <4 minutes average
- Queue backlog: <60 jobs
- SLA compliance: ≥95%
Scaling Procedures #
When to Scale Up #
Immediate Scaling Required:
- SLA compliance drops below 85%
- Queue backlog exceeds critical thresholds
- Multiple consecutive SLA breaches
- Dashboard shows red/critical status
Proactive Scaling Indicators:
- SLA compliance 90-95% (warning zone)
- Queue backlog approaching thresholds
- Response times increasing
- New servers added to monitoring
Step-by-Step Scaling #
1. Access Server:
ssh user@yourserver.com
2. Edit Supervisor Configuration:
sudo nano /etc/supervisor/conf.d/servermonitor-workers.conf
3. Increase Worker Count:
# Before (2 premium workers)
[program:servermonitor-premium]
numprocs=2
# After (4 premium workers)
[program:servermonitor-premium]
numprocs=4
4. Apply Changes:
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl restart servermonitor-premium:*
5. Verify Changes:
sudo supervisorctl status | grep servermonitor-premium
Should show 4 workers running:
servermonitor-premium:servermonitor-premium_00 RUNNING
servermonitor-premium:servermonitor-premium_01 RUNNING
servermonitor-premium:servermonitor-premium_02 RUNNING
servermonitor-premium:servermonitor-premium_03 RUNNING
Scaling Down #
When to Scale Down:
- Consistently low queue backlogs (<10 jobs)
- SLA compliance consistently >98%
- Server count has decreased
- During low-usage periods
Gradual Reduction:
# Scale down gradually (4 → 3 workers)
sudo nano /etc/supervisor/conf.d/servermonitor-workers.conf
# Change numprocs=3
sudo supervisorctl reread
sudo supervisorctl update
Performance Metrics #
Real-Time Metrics #
Dashboard Metrics (updated every 30 seconds):
- Current SLA compliance percentages
- Active worker counts
- Queue backlog sizes
- Recent SLA breaches (last 6 hours)
Historical Metrics:
- 24-hour SLA breach trends
- Hourly breakdown of performance issues
- Worker scaling history
- Peak load analysis
API Monitoring #
Worker Status API:
curl "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY"
Response Format:
{
"success": true,
"status": "healthy|warning|critical",
"timestamp": "2025-01-01T12:00:00Z",
"sla_compliance": {
"premium": {"percentage": 98.5, "checked": 74, "total": 75},
"free": {"percentage": 100.0, "checked": 60, "total": 60}
},
"queue_status": {"premium": 5, "free": 12},
"recent_breaches": 2,
"dashboard_url": "https://yoursite.com/.../workerperformance"
}
Console Monitoring #
Check Capacity Command:
php artisan servermonitor:check-capacity
Sample Output:
Checking worker capacity...
Quick Summary:
==============
Premium Queue: 15 jobs
Free Queue: 42 jobs
SLA Breaches (last hour): 3
⚠️ ACTION REQUIRED: Worker capacity needs attention!
Visit the Worker Performance Dashboard for details:
https://yoursite.com/albrightlabs/servermonitor/workerperformance
Troubleshooting Performance Issues #
High Queue Backlog #
Symptoms:
- Dashboard shows queue warnings
- SLA compliance dropping
- Increased response times
Diagnosis:
# Check actual queue sizes
redis-cli llen queues:premium
redis-cli llen queues:free
# Check worker status
sudo supervisorctl status | grep servermonitor
# Check worker logs for errors
tail -f storage/logs/worker-premium.log
Solutions:
- Add more workers (primary solution)
- Check for stuck jobs:
php artisan queue:failed - Restart workers:
sudo supervisorctl restart servermonitor-premium:* - Clear stuck jobs:
php artisan queue:clear(caution!)
SLA Breaches #
Common Causes:
- Insufficient worker capacity
- Network issues to monitored endpoints
- Database performance problems
- Queue driver issues
Investigation Steps:
# Check recent breaches
tail -100 storage/logs/laravel.log | grep "SLA BREACH"
# Check worker response times
grep "response_time_seconds" storage/logs/laravel.log | tail -20
# Check for slow endpoints
grep "timeout" storage/logs/worker-premium.log
Memory Issues #
Symptoms:
- Workers consuming excessive memory
- Workers crashing or restarting
- Overall system slowdown
Monitoring:
# Check worker memory usage
ps aux | grep "queue:work" | awk '{print $6, $11}' | sort -nr
# Monitor over time
watch 'ps aux | grep "queue:work" | grep -v grep'
Solutions:
# Restart workers to clear memory
sudo supervisorctl restart servermonitor-premium:*
# Reduce max-jobs per worker
# Edit supervisor config: --max-jobs=500 (reduce from 1000)
# Add memory limits
# Edit supervisor config: -d memory_limit=256M
Advanced Monitoring #
Custom Alerts #
Integration with External Monitoring:
#!/bin/bash
# custom-monitoring.sh
STATUS=$(curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY")
HEALTH=$(echo $STATUS | jq -r '.status')
case $HEALTH in
"critical")
# Send to PagerDuty, Slack, etc.
curl -X POST https://hooks.slack.com/... \
-d '{"text":"CRITICAL: ServerMonitor worker capacity alert!"}'
;;
"warning")
# Send warning notification
echo "WARNING: Worker capacity issues detected" | mail admin@company.com
;;
"healthy")
# All good, no action needed
;;
esac
Automated Scaling #
Auto-scaling Script (advanced):
#!/bin/bash
# auto-scale.sh - Run every 5 minutes
API_STATUS=$(curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY")
PREMIUM_QUEUE=$(echo $API_STATUS | jq -r '.queue_status.premium')
FREE_QUEUE=$(echo $API_STATUS | jq -r '.queue_status.free')
# Scale up premium workers if queue is backing up
if [ "$PREMIUM_QUEUE" -gt 50 ]; then
# Add 2 more premium workers
CURRENT_WORKERS=$(supervisorctl status | grep servermonitor-premium | wc -l)
NEW_COUNT=$((CURRENT_WORKERS + 2))
# Update supervisor config programmatically
sed -i "s/numprocs=.*/numprocs=$NEW_COUNT/" /etc/supervisor/conf.d/servermonitor-workers.conf
supervisorctl reread
supervisorctl update
echo "Scaled premium workers to $NEW_COUNT due to queue backlog: $PREMIUM_QUEUE"
fi
# Scale down if queue is consistently low
if [ "$PREMIUM_QUEUE" -lt 5 ] && [ "$FREE_QUEUE" -lt 10 ]; then
# Consider scaling down (implement with more sophisticated logic)
echo "Queues are low, consider scaling down workers"
fi
Best Practices #
Monitoring Schedule #
Daily:
- Check Worker Performance Dashboard
- Review SLA compliance trends
- Monitor queue backlogs
Weekly:
- Analyze performance patterns
- Plan capacity for upcoming changes
- Review scaling events
Monthly:
- Assess overall performance trends
- Plan infrastructure improvements
- Review alert thresholds
Capacity Planning #
Before Adding Servers:
- Check current capacity utilization
- Calculate additional worker requirements
- Plan scaling timeline
- Communicate with team
Server Growth Planning:
Current: 100 premium servers, 2 workers (50 servers/worker)
Adding: 50 more premium servers
New total: 150 premium servers
Workers needed: 150 ÷ 50 = 3 workers minimum
Recommended: 4 workers (with safety margin)
Performance Optimization #
Worker Efficiency:
- Monitor individual worker performance
- Identify and optimize slow endpoints
- Use connection pooling for database
- Implement proper error handling
Infrastructure Optimization:
- Use Redis for queue driver
- Optimize database queries
- Implement proper caching
- Monitor network latency
Integration with Other Systems #
External Monitoring #
Prometheus Integration:
# Export metrics for Prometheus
curl -s "https://yoursite.com/api/servermonitor/worker-status/YOUR_API_KEY" \
| jq -r '"servermonitor_sla_compliance{type=\"premium\"} " + (.sla_compliance.premium.percentage | tostring)'
Grafana Dashboard:
- Import worker metrics
- Create SLA compliance graphs
- Set up alert thresholds
- Monitor trends over time
Log Aggregation #
ELK Stack Integration:
# Send worker logs to Elasticsearch
tail -f storage/logs/worker-premium.log | \
while read line; do
echo $line | \
jq -R -s '{message: ., timestamp: now, service: "servermonitor-premium"}' | \
curl -X POST "http://elasticsearch:9200/servermonitor/_doc" \
-H "Content-Type: application/json" -d @-
done
Previous: ← Migration Guide | Next: Performance Optimization →