Canonical State Model¶
The canonical state model is the universal bridge between diverse infrastructure systems and the policy engine. All adapters normalize their data into these entity types — the policy engine never sees raw API payloads.
Entity Type Hierarchy¶
classDiagram
class CanonicalEntity {
+str entity_id
+EntityType entity_type
+str source_type
+SchedulerDomainType scheduler_domain
+WorkloadType workload_type
+HealthState health_state
+float confidence_score
+datetime observed_at
}
class ServiceEntity {
+float latency_p95_ms
+float latency_p99_ms
+float throughput
+int queue_depth
+float error_rate
}
class DeploymentEntity {
+int replica_count
+int available_replicas
+int desired_replicas
+float cpu_utilization
+float memory_utilization
}
class GPUEntity {
+float gpu_utilization
+float gpu_memory_used
+float temperature_celsius
+int ecc_errors
}
class NodeEntity {
+float cpu_utilization
+float memory_utilization
+int gpu_count
}
class JobEntity {
+str job_state
+int assigned_gpu_count
+float gpu_utilization
}
CanonicalEntity <|-- ServiceEntity
CanonicalEntity <|-- DeploymentEntity
CanonicalEntity <|-- GPUEntity
CanonicalEntity <|-- NodeEntity
CanonicalEntity <|-- JobEntity
All 12 Entity Types¶
| Entity Type | Key Fields | Primary Source |
|---|---|---|
service |
latency, throughput, queue depth, error rate | Prometheus, Generic Serving |
deployment |
replicas (current/available/desired), resource utilization | Kubernetes |
gpu |
utilization, memory, temperature, ECC errors | dcgm-exporter |
node |
CPU, memory, GPU count, health state | Kubernetes, Slurm |
job |
job state, assigned GPUs, utilization | Slurm, Kubernetes |
queue |
pending/active jobs, queue depth | Slurm |
model_revision |
version, serving state, rollout phase | Kubernetes, Custom |
scheduler_domain |
domain type (k8s/slurm), capacity | Runtime adapters |
tenant_scope |
namespace/partition, quota, usage | Kubernetes, Slurm |
experiment_tracker |
run state, metrics, artifacts | MLflow, W&B (future) |
data_pipeline_stage |
stage state, throughput, lag | Custom (future) |
config_snapshot |
config hash, drift detection | Custom (future) |
Scheduler Domains¶
| Value | Infrastructure |
|---|---|
kubernetes |
K8s clusters |
slurm |
Slurm HPC clusters |
cloud_managed |
Cloud ML platforms |
standalone |
Bare-metal or VM deployments |
Health States¶
| State | Meaning | Policy Effect |
|---|---|---|
healthy |
Operating normally | No action needed |
warning |
Early degradation signals | Monitor more closely |
degraded |
Performance impacted | Recommend corrective action |
critical |
Immediate attention required | Urgent recommendation (high priority) |
unknown |
Insufficient data | Lower confidence score |
State Fragments¶
Adapters don't create full entities directly — they emit state fragments that get merged by the State Bus:
StateFragment(
source_type="prometheus",
entity_type=EntityType.SERVICE,
entity_id="inference-api",
observed_at=datetime.utcnow(),
fields={"latency_p95_ms": 142.0, "queue_depth": 46},
labels={"namespace": "prod", "cluster": "us-east-1"},
)
The State Bus:
- Creates new entities from the first fragment for an
entity_id - Merges subsequent fragments (newer data wins)
- Rejects stale fragments (older than the current entity)
- Tracks freshness per entity