Skip to content

Architecture Overview

OpenAICE follows a layered architecture with strict separation between data collection (adapters), state management (bus), decision-making (policy engine), and safety (guardrails).

System Architecture

graph TD
    subgraph Adapters["🔌 Adapters (Edge)"]
        PROM[Prometheus]
        K8S[Kubernetes API]
        SLURM[Slurm CLI/REST]
        GPU[dcgm-exporter]
        REPLAY[Replay Files]
    end

    subgraph Core["⚙️ Core Engine"]
        NORM[Normalizer]
        BUS[State Bus]
        CLASS[Workload Classifier]
        POLICY[Policy Engine]
        REC[Recommender]
        GUARD[Guardrails]
        AUDIT[Audit Log]
    end

    subgraph Output["📤 Output"]
        CLI[CLI / Rich Tables]
        API[FastAPI REST]
        LOG[Audit JSONL]
    end

    PROM --> NORM
    K8S --> NORM
    SLURM --> NORM
    GPU --> NORM
    REPLAY --> NORM

    NORM -->|State Fragments| BUS
    BUS -->|Canonical Entities| CLASS
    CLASS --> POLICY
    POLICY -->|Recommendations| REC
    REC --> GUARD
    GUARD --> AUDIT
    GUARD --> CLI
    GUARD --> API
    GUARD --> LOG

Pipeline Flow

The control plane operates as a sequential pipeline on each evaluation cycle:

sequenceDiagram
    participant A as Adapters
    participant N as Normalizer
    participant S as State Bus
    participant C as Classifier
    participant P as Policy Engine
    participant G as Guardrails
    participant O as Output

    A->>N: Raw records
    N->>S: State fragments (validated)
    S->>S: Merge by entity_id
    S->>C: Canonical entities
    C->>C: Classify workload type
    C->>P: Classified entities
    P->>P: Match rules × entities
    P->>G: Candidate recommendations
    G->>G: Confidence, freshness, cooldown, blast-radius
    G->>O: Approved recommendations + explanations

Key Design Principles

1. Integration Logic Stays at the Edge

Adapters handle all tool-specific translation. The core engine never sees raw Prometheus metrics or Kubernetes API objects — only canonical entities.

2. Policy is Data, Not Code

Decision rules live in policies/rules.yaml, not in Python source. This means:

  • Non-engineers can review and modify policies
  • Rules can be version-controlled independently
  • Different environments can use different rule sets

3. Safety is Non-Negotiable

Every recommendation passes through 4 guardrails:

Guardrail What it checks
Confidence threshold Entity confidence ≥ minimum for the action
Data freshness Telemetry data is recent enough to act on
Cooldown window Enough time has passed since last action on this entity
Blast-radius limit Total entities affected per cycle doesn't exceed the cap

4. Explainability by Default

Every recommendation includes:

  • rule_id — which YAML rule fired
  • reason — human-readable explanation
  • signals_used — what telemetry signals triggered it
  • objectives_impacted — which business objectives are affected
  • confidence_score — how confident the system is

Module Responsibilities

Module File Responsibility
Normalizer core/normalizer.py Converts raw adapter records into typed StateFragment objects
State Bus core/state_bus.py In-memory store that merges fragments by entity_id, tracks freshness
Classifier core/workload_classifier.py Assigns scenario family (e.g., online_inference, hpc_research)
Policy Engine core/policy_engine.py Evaluates YAML rules against classified entities
Recommender core/recommender.py Scores recommendations and generates structured explanations
Guardrails core/guardrails.py Enforces safety constraints before any recommendation is surfaced
Audit Log core/audit_log.py Writes immutable JSONL records for every recommendation lifecycle event