Kubernetes Components and Health Monitoring

This guide provides high-level recommendations for self-hosted Opal customers on the various Kubernetes pods and jobs, their schedules, and recommended health monitoring practices.

Overview

Opal consists of several types of Kubernetes workloads:

Deployments: Long-running pods that handle API requests and background task processing
CronJobs: Scheduled jobs that run periodically to sync data, clean up resources, and process background tasks
Jobs: One-time jobs that run during upgrades

Long-Running Deployments

These are long-running pods that run continuously. We recommend monitoring pods to be in a “Running” state and watching for unexpected restarts. Refer below for specific deployments and recommendations.

1. Web Backend `opal-web`

Purpose: Main API server handling HTTP requests from the frontend and external integrations.

Health Monitoring:

Ensure pods are running and ready (not in CrashLoopBackOff, Error, or Pending states)
Monitor pod restart count

Recommended Alerts:

Pods not running or ready
Pod restart count increasing
Pods stuck in error or pending states

2. Event Consumers `opal-web-event-consumers`

Purpose: Processes events from external systems and internal event streams.

Health Monitoring:

Ensure pods are running and ready

Recommended Alerts:

Pods not running or ready
Pod restart count increasing

3. Task Workers `opal-web-task-workers`

Purpose: Processes background tasks from specialized queues. Different task worker types handle different operations such as general async work, event streaming, sync operations, and propagation tasks.

Health Monitoring:

Ensure all task worker pods are running and ready
Monitor restart counts across all task worker deployments

Recommended Alerts:

Any task worker pods not running or ready
Pod restart count increasing
Task worker pods stuck in error or pending states

Note: Some task workers have longer termination grace periods to allow long-running tasks to complete.

Scheduled CronJobs

These are automated jobs that run periodically on a fixed schedule to sync data and clean up resources. Generally, we recommend monitoring for full sync completions within its expected timeframe (typically 2x the schedule interval), and setting up alerts for job failures. Refer below for specific CronJobs.

1. Sync Jobs

Purpose: Performs sync of user and resource data from connected systems (IDPs, cloud providers, HR systems).

Schedules:

Regular sync: Every 4 hours (0 */4 * * *)
Daily sync: Daily at 9:30 AM UTC (30 9 * * *)
High-frequency sync: Every 5 minutes (* /5 * * * *)

Starting Deadline: 120-180 seconds

Health Monitoring:

Ensure regular sync completes successfully within 8 hours (2x schedule interval)
Ensure daily sync completes successfully within 48 hours (2x schedule interval)
Ensure high-frequency sync completes successfully within 20 minutes (4x schedule interval)

Recommended Alerts:

No successful full sync completion in the past 8 hours
No successful daily sync completion in the past 48 hours
No successful high-frequency sync completion in the past 10 minutes
Sync job failure (exit code != 0)

2. Event Streaming Jobs

Purpose: Manages event stream operations including publishing events to external systems, requeuing failed messages, deactivating unused streams, and cleaning up old messages.

Schedules:

Event stream producer: Every 1 minute (*/1 * * * *)
Event stream re-queuer: Every 15 minutes (*/15 * * * *)
Event stream deactivator: Every 30 minutes (*/30 * * * *)
Event stream messages cleanup: Daily at 12:00 PM UTC (0 12 * * *)
Event stream notifier: Daily at 1:00 PM UTC (0 13 * * *)

Starting Deadline: 120 seconds

Health Monitoring:

Ensure event stream producer completes successfully within 2 minutes (2x schedule interval)
Ensure event stream re-queuer completes successfully within 30 minutes (2x schedule interval)
Ensure event stream deactivator completes successfully within 60 minutes (2x schedule interval)
Ensure event stream cleanup and notifier complete successfully within 48 hours (2x schedule interval)

Recommended Alerts:

Event stream producer no successful completion in the past 2 minutes
Event stream re-queuer no successful completion in the past 30 minutes
Event stream deactivator no successful completion in the past 60 minutes
Event stream cleanup/notifier no successful completion in the past 48 hours
Event streaming job failure (exit code != 0)

3. Recommendations `recommendations-subscores`

Purpose: Calculates and updates recommendation subscores for resources and groups to support access recommendations and risk analysis.

Schedule: Every 5 minutes (*/5 * * * *)

Starting Deadline: 120 seconds

Health Monitoring:

Ensure job completes successfully within 10 minutes (2x schedule interval)

Recommended Alerts:

No successful completion in the past 10 minutes
Job failure (exit code != 0)

4. Metrics Collection `metrics-collector`

Purpose: Collects and aggregates metrics for reporting and analytics.

Schedule: Daily at 6:30 AM UTC (30 6 * * *)

Starting Deadline: 120 seconds

Health Monitoring:

Ensure job completes successfully within 48 hours (2x schedule interval)

Recommended Alerts:

No successful completion in the past 48 hours
Job failure (exit code != 0)

5. Scheduled Tasks Cleanup `scheduled-tasks-cleanup`

Purpose: Cleans up old completed scheduled tasks from the database.

Schedule: Every 5 minutes (*/5 * * * *)

Starting Deadline: 120 seconds

Health Monitoring:

Ensure job completes successfully within 48 hours (2x schedule interval)

Recommended Alerts:

No successful completion in the past 48 hours
Job failure (exit code != 0)

One-Time Jobs

These jobs execute once to perform critical set up tasks. You should monitor that these jobs complete without errors during the upgrade window.

Oneoff `oneoff`

Purpose: Runs one-time database migrations and setup tasks during Helm upgrades.

Trigger: Runs automatically as a post-upgrade Helm hook**

Health Monitoring:

Monitor job completion status
Check for job failures during upgrades

Recommended Alerts:

Job failure during upgrades (exit code != 0)
Job running longer than expected

General Health Monitoring Recommendations

For All Deployments

Pod Status: Monitor for pods in CrashLoopBackOff, Error, orPendingstates
Resource Usage: CPU usage > 80% of request, Memory usage > 80% of limit
Restart Count: Alert if a pod restarts more than 3 times in an hour
Availability: Ensure pods are running as expected

For All CronJobs

Execution Status: Ensure jobs have completed successfully within 2x-4x their schedule interval
Failure Monitoring: Alert on job failures (exit code != 0)
Concurrency: All CronJobs use Forbid concurrency policy - ensure previous jobs complete before new ones start

Database Migrations

Monitor for database migration failures or delays during pod startup and upgrades, as these may indicate database performance issues.

Common Issues to Watch For

Database Connection Issues: All components depend on PostgreSQL. Monitor database connectivity.
Redis Connection Issues: Task workers and event consumers depend on Redis. Monitor Redis connectivity.
Resource Constraints: High memory/CPU usage may cause pods to be evicted or OOMKilled.
Network Issues: Pods need network access to external systems (IDPs, cloud providers) for syncing.

Monitoring Best Practices

Set up alerts for all critical components (web backend, sync jobs)
Monitor logs for error patterns and exceptions
Track metrics for processing times, and error rates
Set up dashboards for:
- Pod health and resource usage
- CronJob execution status and duration
- Application metrics (request rates, error rates, latency)
Use Kubernetes events to monitor for scheduling issues, pod evictions, etc.

Additional Notes

Every pod, including all CronJobs, include database migrations as init containers
All components use the same Docker image (opal-web_backend) with different command arguments
Health endpoints (/api/health and /api/readiness) are configured on all deployments via liveness and readiness probes

Overview

Long-Running Deployments

1. Web Backend opal-web

2. Event Consumers opal-web-event-consumers

3. Task Workers opal-web-task-workers

Scheduled CronJobs

1. Sync Jobs

2. Event Streaming Jobs

3. Recommendations recommendations-subscores

4. Metrics Collection metrics-collector

5. Scheduled Tasks Cleanup scheduled-tasks-cleanup

One-Time Jobs

Oneoff oneoff

General Health Monitoring Recommendations

For All Deployments

For All CronJobs

Database Migrations

Common Issues to Watch For

Monitoring Best Practices

Additional Notes

1. Web Backend `opal-web`

2. Event Consumers `opal-web-event-consumers`

3. Task Workers `opal-web-task-workers`

3. Recommendations `recommendations-subscores`

4. Metrics Collection `metrics-collector`

5. Scheduled Tasks Cleanup `scheduled-tasks-cleanup`

Oneoff `oneoff`