Canonical State Model¶

The canonical state model is the universal bridge between diverse infrastructure systems and the policy engine. All adapters normalize their data into these entity types — the policy engine never sees raw API payloads.

Entity Type Hierarchy¶

classDiagram
    class CanonicalEntity {
        +str entity_id
        +EntityType entity_type
        +str source_type
        +SchedulerDomainType scheduler_domain
        +WorkloadType workload_type
        +HealthState health_state
        +float confidence_score
        +datetime observed_at
    }

    class ServiceEntity {
        +float latency_p95_ms
        +float latency_p99_ms
        +float throughput
        +int queue_depth
        +float error_rate
    }

    class DeploymentEntity {
        +int replica_count
        +int available_replicas
        +int desired_replicas
        +float cpu_utilization
        +float memory_utilization
    }

    class GPUEntity {
        +float gpu_utilization
        +float gpu_memory_used
        +float temperature_celsius
        +int ecc_errors
    }

    class NodeEntity {
        +float cpu_utilization
        +float memory_utilization
        +int gpu_count
    }

    class JobEntity {
        +str job_state
        +int assigned_gpu_count
        +float gpu_utilization
    }

    CanonicalEntity <|-- ServiceEntity
    CanonicalEntity <|-- DeploymentEntity
    CanonicalEntity <|-- GPUEntity
    CanonicalEntity <|-- NodeEntity
    CanonicalEntity <|-- JobEntity

All 12 Entity Types¶

Entity Type	Key Fields	Primary Source
`service`	latency, throughput, queue depth, error rate	Prometheus, Generic Serving
`deployment`	replicas (current/available/desired), resource utilization	Kubernetes
`gpu`	utilization, memory, temperature, ECC errors	dcgm-exporter
`node`	CPU, memory, GPU count, health state	Kubernetes, Slurm
`job`	job state, assigned GPUs, utilization	Slurm, Kubernetes
`queue`	pending/active jobs, queue depth	Slurm
`model_revision`	version, serving state, rollout phase	Kubernetes, Custom
`scheduler_domain`	domain type (k8s/slurm), capacity	Runtime adapters
`tenant_scope`	namespace/partition, quota, usage	Kubernetes, Slurm
`experiment_tracker`	run state, metrics, artifacts	MLflow, W&B (future)
`data_pipeline_stage`	stage state, throughput, lag	Custom (future)
`config_snapshot`	config hash, drift detection	Custom (future)

Scheduler Domains¶

Value	Infrastructure
`kubernetes`	K8s clusters
`slurm`	Slurm HPC clusters
`cloud_managed`	Cloud ML platforms
`standalone`	Bare-metal or VM deployments

Health States¶

State	Meaning	Policy Effect
`healthy`	Operating normally	No action needed
`warning`	Early degradation signals	Monitor more closely
`degraded`	Performance impacted	Recommend corrective action
`critical`	Immediate attention required	Urgent recommendation (high priority)
`unknown`	Insufficient data	Lower confidence score

State Fragments¶

Adapters don't create full entities directly — they emit state fragments that get merged by the State Bus:

StateFragment(
    source_type="prometheus",
    entity_type=EntityType.SERVICE,
    entity_id="inference-api",
    observed_at=datetime.utcnow(),
    fields={"latency_p95_ms": 142.0, "queue_depth": 46},
    labels={"namespace": "prod", "cluster": "us-east-1"},
)

The State Bus:

Creates new entities from the first fragment for an entity_id
Merges subsequent fragments (newer data wins)
Rejects stale fragments (older than the current entity)
Tracks freshness per entity