Kubernetes Components and Health Monitoring
Learn about various kubernetes pods and jobs, and health monitoring recommendations on self-hosted Opal.
This guide provides high-level recommendations for self-hosted Opal customers on the various Kubernetes pods and jobs, their schedules, and recommended health monitoring practices.
Overview
Opal consists of several types of Kubernetes workloads:
- Deployments: Long-running pods that handle API requests and background task processing
- CronJobs: Scheduled jobs that run periodically to sync data, clean up resources, and process background tasks
- Jobs: One-time jobs that run during upgrades
Long-Running Deployments
These are long-running pods that run continuously. We recommend monitoring pods to be in a “Running” state and watching for unexpected restarts. Refer below for specific deployments and recommendations.
1. Web Backend opal-web
opal-webPurpose: Main API server handling HTTP requests from the frontend and external integrations.
Health Monitoring:
- Ensure pods are running and ready (not in CrashLoopBackOff, Error, or Pending states)
- Monitor pod restart count
Recommended Alerts:
- Pods not running or ready
- Pod restart count increasing
- Pods stuck in error or pending states
2. Event Consumers opal-web-event-consumers
opal-web-event-consumersPurpose: Processes events from external systems and internal event streams.
Health Monitoring:
- Ensure pods are running and ready
Recommended Alerts:
- Pods not running or ready
- Pod restart count increasing
3. Task Workers opal-web-task-workers
opal-web-task-workersPurpose: Processes background tasks from specialized queues. Different task worker types handle different operations such as general async work, event streaming, sync operations, and propagation tasks.
Health Monitoring:
- Ensure all task worker pods are running and ready
- Monitor restart counts across all task worker deployments
Recommended Alerts:
- Any task worker pods not running or ready
- Pod restart count increasing
- Task worker pods stuck in error or pending states
Note: Some task workers have longer termination grace periods to allow long-running tasks to complete.
Scheduled CronJobs
These are automated jobs that run periodically on a fixed schedule to sync data and clean up resources. Generally, we recommend monitoring for full sync completions within its expected timeframe (typically 2x the schedule interval), and setting up alerts for job failures. Refer below for specific CronJobs.
1. Sync Jobs
Purpose: Performs sync of user and resource data from connected systems (IDPs, cloud providers, HR systems).
Schedules:
- Regular sync: Every 4 hours
(0 */4 * * *) - Daily sync: Daily at 9:30 AM UTC
(30 9 * * *) - High-frequency sync: Every 5 minutes
(* /5 * * * *)
Starting Deadline: 120-180 seconds
Health Monitoring:
- Ensure regular sync completes successfully within 8 hours (2x schedule interval)
- Ensure daily sync completes successfully within 48 hours (2x schedule interval)
- Ensure high-frequency sync completes successfully within 20 minutes (4x schedule interval)
Recommended Alerts:
- No successful full sync completion in the past 8 hours
- No successful daily sync completion in the past 48 hours
- No successful high-frequency sync completion in the past 10 minutes
- Sync job failure (exit code != 0)
2. Event Streaming Jobs
Purpose: Manages event stream operations including publishing events to external systems, requeuing failed messages, deactivating unused streams, and cleaning up old messages.
Schedules:
- Event stream producer: Every 1 minute
(*/1 * * * *) - Event stream re-queuer: Every 15 minutes
(*/15 * * * *) - Event stream deactivator: Every 30 minutes
(*/30 * * * *) - Event stream messages cleanup: Daily at 12:00 PM UTC
(0 12 * * *) - Event stream notifier: Daily at 1:00 PM UTC
(0 13 * * *)
Starting Deadline: 120 seconds
Health Monitoring:
- Ensure event stream producer completes successfully within 2 minutes (2x schedule interval)
- Ensure event stream re-queuer completes successfully within 30 minutes (2x schedule interval)
- Ensure event stream deactivator completes successfully within 60 minutes (2x schedule interval)
- Ensure event stream cleanup and notifier complete successfully within 48 hours (2x schedule interval)
Recommended Alerts:
- Event stream producer no successful completion in the past 2 minutes
- Event stream re-queuer no successful completion in the past 30 minutes
- Event stream deactivator no successful completion in the past 60 minutes
- Event stream cleanup/notifier no successful completion in the past 48 hours
- Event streaming job failure (exit code != 0)
3. Recommendations recommendations-subscores
recommendations-subscoresPurpose: Calculates and updates recommendation subscores for resources and groups to support access recommendations and risk analysis.
Schedule: Every 5 minutes (*/5 * * * *)
Starting Deadline: 120 seconds
Health Monitoring:
- Ensure job completes successfully within 10 minutes (2x schedule interval)
Recommended Alerts:
- No successful completion in the past 10 minutes
- Job failure (exit code != 0)
4. Metrics Collection metrics-collector
metrics-collectorPurpose: Collects and aggregates metrics for reporting and analytics.
Schedule: Daily at 6:30 AM UTC (30 6 * * *)
Starting Deadline: 120 seconds
Health Monitoring:
- Ensure job completes successfully within 48 hours (2x schedule interval)
Recommended Alerts:
- No successful completion in the past 48 hours
- Job failure (exit code != 0)
5. Scheduled Tasks Cleanup scheduled-tasks-cleanup
scheduled-tasks-cleanupPurpose: Cleans up old completed scheduled tasks from the database.
Schedule: Every 5 minutes (*/5 * * * *)
Starting Deadline: 120 seconds
Health Monitoring:
- Ensure job completes successfully within 48 hours (2x schedule interval)
Recommended Alerts:
- No successful completion in the past 48 hours
- Job failure (exit code != 0)
One-Time Jobs
These jobs execute once to perform critical set up tasks. You should monitor that these jobs complete without errors during the upgrade window.
Oneoff oneoff
oneoffPurpose: Runs one-time database migrations and setup tasks during Helm upgrades.
Trigger: Runs automatically as a post-upgrade Helm hook**
Health Monitoring:
- Monitor job completion status
- Check for job failures during upgrades
Recommended Alerts:
- Job failure during upgrades (exit code != 0)
- Job running longer than expected
General Health Monitoring Recommendations
For All Deployments
- Pod Status: Monitor for pods in
CrashLoopBackOff,Error, orPendingstates - Resource Usage: CPU usage > 80% of request, Memory usage > 80% of limit
- Restart Count: Alert if a pod restarts more than 3 times in an hour
- Availability: Ensure pods are running as expected
For All CronJobs
- Execution Status: Ensure jobs have completed successfully within 2x-4x their schedule interval
- Failure Monitoring: Alert on job failures (exit code != 0)
- Concurrency: All CronJobs use
Forbidconcurrency policy - ensure previous jobs complete before new ones start
Database Migrations
Monitor for database migration failures or delays during pod startup and upgrades, as these may indicate database performance issues.
Common Issues to Watch For
- Database Connection Issues: All components depend on PostgreSQL. Monitor database connectivity.
- Redis Connection Issues: Task workers and event consumers depend on Redis. Monitor Redis connectivity.
- Resource Constraints: High memory/CPU usage may cause pods to be evicted or OOMKilled.
- Network Issues: Pods need network access to external systems (IDPs, cloud providers) for syncing.
Monitoring Best Practices
- Set up alerts for all critical components (web backend, sync jobs)
- Monitor logs for error patterns and exceptions
- Track metrics for processing times, and error rates
- Set up dashboards for:
- Pod health and resource usage
- CronJob execution status and duration
- Application metrics (request rates, error rates, latency)
- Use Kubernetes events to monitor for scheduling issues, pod evictions, etc.
Additional Notes
- Every pod, including all CronJobs, include database migrations as init containers
- All components use the same Docker image (
opal-web_backend) with different command arguments - Health endpoints (
/api/healthand/api/readiness) are configured on all deployments via liveness and readiness probes
Updated 9 days ago
