OpenAICE¶

Auto Infrastructure Configuration Engine

An adapter-based, recommendation-first control plane that unifies observability, orchestration, and policy across Kubernetes, Slurm, and hybrid AI infrastructure environments.

Get Started → View on GitHub

Adapter-Based Architecture¶

Pluggable adapters for Prometheus, Kubernetes, Slurm, GPU (dcgm-exporter), and more. Tool-specific logic stays at the edge — the core engine is tool-agnostic.

Safety-First Recommendations¶

Observe → Recommend → Approve → Auto-Act control ladder. Every recommendation carries confidence scores, risk levels, and structured explanations.

Canonical State Model¶

12 entity types normalize diverse infrastructure signals into a unified graph. One policy engine reasons across Kubernetes inference and Slurm HPC training.

Explainable Decisions¶

Every recommendation includes rule_id, reason, signals_used, confidence_score, and objectives_impacted. No black-box automation.

YAML-Driven Policy Engine¶

5 built-in decision rules covering queue pressure scaling, GPU batching optimization, node quarantine, idle scale-to-zero, and low-confidence failsafe.

Telemetry Replay¶

Deterministic golden tests via recorded telemetry scenarios. Develop and validate without live infrastructure.

Quick Example¶

# Run the K8s inference queue pressure replay scenario
python -m openaice.cli.cli replay \
  --scenario examples/telemetry-replay/k8s-inference-queue-pressure

═══ OpenAICE Replay Results ═══
Scenario: k8s-inference-queue-pressure
Entities loaded: 3
Recommendations: 1

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ ID            ┃ Entity        ┃ Action         ┃ Risk   ┃ Confidence ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ rec-f4c1be0f  │ inference-api │ scale_replicas │ medium │       0.91 │
└───────────────┴───────────────┴────────────────┴────────┴────────────┘

Explanations:
  rec-f4c1be0f: p95 latency exceeded target and queue depth rising
    Signals: latency_p95_ms, queue_depth, available_replicas
    Objectives: latency, reliability

Scenario Coverage¶

OpenAICE covers 8 scenario families across the AI infrastructure landscape:

Scenario Family	Key Entities	Example Workloads
K8s Online Inference	Service, Deployment, GPU	Model serving, API endpoints
Batch Inference	Job, Queue	Offline scoring, batch prediction
Distributed Training	Job, GPU, Node	Multi-node training runs
HPC / Research	Job, Node, Queue, GPU	Slurm-scheduled GPU research
LLM Serving	Service, ModelRevision	vLLM, Triton, TGI endpoints
Managed Cloud	Service, Deployment	Cloud ML platform endpoints
Hybrid	All entity types	K8s + Slurm unified management
Governance / Ops	TenantScope, SchedulerDomain	Multi-tenant policy enforcement

License¶

OpenAICE is licensed under the Apache 2.0 License.