Skip to content

OpenAICE

Auto Infrastructure Configuration Engine

An adapter-based, recommendation-first control plane that unifies observability, orchestration, and policy across Kubernetes, Slurm, and hybrid AI infrastructure environments.

Adapter-Based Architecture

Pluggable adapters for Prometheus, Kubernetes, Slurm, GPU (dcgm-exporter), and more. Tool-specific logic stays at the edge — the core engine is tool-agnostic.

Safety-First Recommendations

Observe → Recommend → Approve → Auto-Act control ladder. Every recommendation carries confidence scores, risk levels, and structured explanations.

Canonical State Model

12 entity types normalize diverse infrastructure signals into a unified graph. One policy engine reasons across Kubernetes inference and Slurm HPC training.

Explainable Decisions

Every recommendation includes rule_id, reason, signals_used, confidence_score, and objectives_impacted. No black-box automation.

YAML-Driven Policy Engine

5 built-in decision rules covering queue pressure scaling, GPU batching optimization, node quarantine, idle scale-to-zero, and low-confidence failsafe.

Telemetry Replay

Deterministic golden tests via recorded telemetry scenarios. Develop and validate without live infrastructure.


Quick Example

# Run the K8s inference queue pressure replay scenario
python -m openaice.cli.cli replay \
  --scenario examples/telemetry-replay/k8s-inference-queue-pressure
═══ OpenAICE Replay Results ═══
Scenario: k8s-inference-queue-pressure
Entities loaded: 3
Recommendations: 1

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ ID            ┃ Entity        ┃ Action         ┃ Risk   ┃ Confidence ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ rec-f4c1be0f  │ inference-api │ scale_replicas │ medium │       0.91 │
└───────────────┴───────────────┴────────────────┴────────┴────────────┘

Explanations:
  rec-f4c1be0f: p95 latency exceeded target and queue depth rising
    Signals: latency_p95_ms, queue_depth, available_replicas
    Objectives: latency, reliability

Scenario Coverage

OpenAICE covers 8 scenario families across the AI infrastructure landscape:

Scenario Family Key Entities Example Workloads
K8s Online Inference Service, Deployment, GPU Model serving, API endpoints
Batch Inference Job, Queue Offline scoring, batch prediction
Distributed Training Job, GPU, Node Multi-node training runs
HPC / Research Job, Node, Queue, GPU Slurm-scheduled GPU research
LLM Serving Service, ModelRevision vLLM, Triton, TGI endpoints
Managed Cloud Service, Deployment Cloud ML platform endpoints
Hybrid All entity types K8s + Slurm unified management
Governance / Ops TenantScope, SchedulerDomain Multi-tenant policy enforcement

License

OpenAICE is licensed under the Apache 2.0 License.