Knowledge Graph health & migration
The KG consolidation pipeline is a queue-driven background worker. When it gets stuck — embedder timeouts, schema drift, slow disk — the agent stops being able to write decisions to the graph. Five MCP admin tools surface the pipeline’s state and let an operator unblock it without touching the database file:
| Tool | Line | Use it when… |
|---|
okto_pulse_kg_health | 12484 | You want a single-call summary: queue depth, oldest pending age, dead-letter count, decay tick freshness, and relevance score health. |
okto_pulse_kg_dead_letter_list | 12527 | Consolidations are failing and you need to see what bounced. |
okto_pulse_kg_dead_letter_reprocess | 12570 | The underlying issue is fixed and you want to retry the failed entries. |
okto_pulse_kg_migrate_schema | 12643 | You upgraded Pulse and need to bring the board’s graph.lbug to the current schema version. |
okto_pulse_kg_tick_run_now | 12716 | You don’t want to wait for the daily decay tick to recompute relevance scores. |
Permissions are dotted-string flags in the granular registry. kg_health and kg_dead_letter_list are read-only and ride on the standard kg.query.* / kg.admin.settings_read flags most presets ship with. The two that mutate state — kg_dead_letter_reprocess and kg_migrate_schema (and the historical-consolidation CLI path) — are gated by kg.admin.historical_consolidation and the broader kg.admin.* namespace. Operator presets layer these in deliberately. Source: okto-pulse-core/src/okto_pulse/core/mcp/server.py:12484–12857 and core/infra/permissions.py:PERMISSION_REGISTRY under the kg.admin key. Citations: 80-pulse-feature-inventory.md:493–501.
For consolidation flow itself, see consolidation. For the schema, see overview.
okto_pulse_kg_health
The board’s pipeline health summary. The Pulse dashboard polls this every 30 seconds.
Input:
board_id: "brd_abc123"
The MCP tool returns a 12-field aggregate computed in-process (cheap to poll). Implemented in okto-pulse-core/src/okto_pulse/core/services/kg_health_service.py:get_kg_health. Real shape:
Output:
{
"queue_depth": 12,
"oldest_pending_age_s": 41.8,
"dead_letter_count": 0,
"total_nodes": 4178,
"default_score_count": 312,
"default_score_ratio": 0.0747,
"avg_relevance": 0.612,
"top_disconnected_nodes": [
{"node_id": "ent_91", "node_type": "Entity", "degree": 0},
{"node_id": "lrn_03", "node_type": "Learning", "degree": 0}
],
"schema_version": "1.0",
"contradict_warn_count": 2,
"last_decay_tick_at": "2026-05-07T03:00:00Z",
"nodes_recomputed_in_last_tick": 388
}
Field meanings (per docstring at server.py:12484 and the service implementation):
| Field | What it means |
|---|
queue_depth | Pending consolidation rows in the SQLite queue. High values mean enqueues are landing faster than the worker drains them. |
oldest_pending_age_s | Age, in seconds, of the oldest pending consolidation row. null when the queue is empty. |
dead_letter_count | Rows that exceeded kg_queue_max_attempts (default 5). Inspect with kg_dead_letter_list. |
total_nodes | Total node count in the board’s graph.lbug. |
default_score_count / default_score_ratio | Nodes still at the default relevance_score (never recomputed). Ratio above ~0.7 means the decay tick isn’t keeping up — see kg_tick_run_now. |
avg_relevance | Mean relevance_score across all nodes. |
top_disconnected_nodes | Lowest-degree nodes (by edge count). Useful for spotting orphaned consolidations. |
schema_version | Health response schema version. Use kg_migrate_schema or kg_schema_info when checking the graph schema itself. |
contradict_warn_count | Running count of contradict_penalty cap events. A spike means the curator should reconcile. |
last_decay_tick_at | Most recent decay-tick run. |
nodes_recomputed_in_last_tick | Number of nodes recomputed by the most recent decay tick. |
CLI equivalent — note that the CLI uses a different layered check set rather than this 12-field aggregate:
okto-pulse verify-pipeline brd_abc123 --json
cli.py:683–744 (cmd_verify_pipeline) runs 5 layered checks: queue depth, graph file presence + node count, graph-vs-SQLite ref mirror, outbox staleness, global discovery file. Exit code 0 if healthy, 1 if any layer fails. Use the CLI for monitoring scripts; use kg_health from agents.
okto_pulse_kg_dead_letter_list
Consolidation entries that exceeded kg_queue_max_attempts (default 5) land in the dead-letter table. Pulse never auto-reprocesses them — an operator must inspect and replay.
Input:
board_id: "brd_abc123"
limit: 50
offset: 0
Output:
{
"board_id": "brd_abc123",
"total": 3,
"entries": [
{
"id": "dl_001",
"source_type": "spec",
"source_id": "spec_007",
"first_failed_at": "2026-05-06T22:11:04Z",
"attempts": 5,
"last_error": "embedder.timeout: sentence-transformers exceeded 30s",
"session_id": "kgs_01HV..."
},
{
"id": "dl_002",
"source_type": "card",
"source_id": "card_91",
"first_failed_at": "2026-05-07T01:22:18Z",
"attempts": 5,
"last_error": "graph.lock_timeout: failed to acquire write lock"
}
]
}
okto_pulse_kg_dead_letter_reprocess
Move dead-letter entries back to the active queue for another attempt. Use after the underlying cause is resolved (embedder up, disk space available, schema migrated).
Input:
board_id: "brd_abc123"
entry_ids: ["dl_001", "dl_002"] # or omit to reprocess all
Output:
{
"board_id": "brd_abc123",
"requeued": 2,
"skipped": 0,
"entry_ids": ["dl_001", "dl_002"]
}
The reprocess increments the attempts counter back to 0 for each requeued entry. If the same root cause persists, entries will land back in dead-letter after another kg_queue_max_attempts failures.
okto_pulse_kg_migrate_schema
Run schema migrations on a board’s graph.lbug. Use after upgrading okto-pulse-core to a version with a higher schema version than the file on disk.
Input:
board_id: "brd_abc123"
target_version: "0.3.3" # optional — defaults to the runtime's current schema version
dry_run: false
Output:
{
"board_id": "brd_abc123",
"from_version": "0.3.1",
"to_version": "0.3.3",
"migrations_applied": [
{"id": "0001_add_belongs_to_multi_pair", "took_ms": 412},
{"id": "0002_hnsw_metric_to_cosine", "took_ms": 1840}
],
"node_count_before": 4178,
"node_count_after": 4178,
"ok": true
}
Migrations are idempotent: a board already at target_version returns migrations_applied: [] and ok: true.
Always take a copy of ~/.okto-pulse/boards/{board_id}/graph.lbug before running a migration with dry_run: false. Migrations rewrite the file in place. The okto-pulse kg backfill --apply flow is a safer rebuild path when a migration corrupts data.
okto_pulse_kg_tick_run_now
Trigger the decay tick worker immediately instead of waiting for the schedule (default daily, kg_decay_tick_interval_minutes = 1440). The tick recomputes relevance_score for nodes whose last_recomputed_at is older than kg_decay_tick_staleness_days (default 7).
Input:
board_id: "brd_abc123"
Output:
{
"board_id": "brd_abc123",
"started_at": "2026-05-07T15:30:11Z",
"completed_at": "2026-05-07T15:30:14Z",
"nodes_recomputed": 312,
"nodes_skipped_fresh": 3866
}
The decay formula is documented in kg/workers/kg_decay_tick.py. Note that find_similar_decisions uses a separate search reranking formula at retrieval time — do not conflate the two (80-pulse-feature-inventory.md:957).
Hot-reloadable settings
You don’t need to restart Pulse to change pipeline tuning:
| Setting group | Hot-reload mechanism |
|---|
kg_queue_* (worker count, claim timeout, max attempts, alert threshold, recovery scan) | APScheduler re-reads the value with a 5-second debounce. No action required. |
kg_decay_tick_* (interval, staleness, max age) | PUT /settings/runtime — applies on next tick boundary. |
Source: 80-pulse-feature-inventory.md:790. The full settings table lives in Knowledge Graph and the inventory.
CLI fallbacks
The MCP tools are the primary surface, but three CLI commands cover deeper recovery scenarios:
okto-pulse verify-pipeline <board_id>
cli.py:683–744. Wraps the same 5 checks as kg_health but exits 1 on failure — useful in CI / monitoring scripts.
okto-pulse verify-pipeline brd_abc123 --json
okto-pulse kg backfill <board_id>
cli.py:747–902. Runs the Layer 1 deterministic KG worker against every artifact on the board.
# Dry run: report what would be emitted, no writes
okto-pulse kg backfill brd_abc123
# Apply: enqueue all artifacts, drain the consolidation queue
okto-pulse kg backfill brd_abc123 --apply
# Filter to one artifact type
okto-pulse kg backfill brd_abc123 --apply --artifact-type spec
This is the recovery path when the graph is structurally out of sync (e.g., after a partial migration or a manual file restore). It rebuilds the deterministic skeleton; it does not replay cognitive-agent decisions.
okto-pulse kg dedup-entities <board_id>
cli.py:908–936. Consolidate duplicate nodes per (node_type, source_artifact_ref).
okto-pulse kg dedup-entities brd_abc123 --dry-run
okto-pulse kg dedup-entities brd_abc123
kg dedup-entities writes by default. Always run with --dry-run first.
Underlying REST endpoints
The MCP tools wrap REST endpoints exposed by the API server. They are documented here for completeness — most callers should prefer the MCP tools.
| Method | Path | Equivalent MCP tool |
|---|
GET | /kg/health | kg_health |
GET | /kg/queue/health | kg_health (subset) |
POST | /kg/tick/run-now | kg_tick_run_now |
GET | /kg/dead-letter/... | kg_dead_letter_list |
POST | /kg/dead-letter/reprocess | kg_dead_letter_reprocess |
Source: 80-pulse-feature-inventory.md:729–735.
Common operational scenarios
”Consolidations stopped landing”
- Call
kg_health — check queue_depth, oldest_pending_age_s, and dead_letter_count.
- If
dead_letter_count > 0, call kg_dead_letter_list to see error reasons.
- Fix the root cause (embedder, disk, schema mismatch).
- Call
kg_dead_letter_reprocess with the entry ids.
- Re-call
kg_health. Expect dead_letter_count: 0 and queue_depth draining.
”Just upgraded Pulse, dashboard shows schema drift”
- Inspect the graph schema with
kg_schema_info or run the migration tool when a schema error points to drift.
- Take a backup of
graph.lbug.
- Call
kg_migrate_schema with default target_version (= runtime version).
- Re-check the graph schema. Expect it to match the runtime schema.
”Relevance scores look stale”
- Call
kg_tick_run_now. Expect nodes_recomputed > 0.
- If always 0 nodes recomputed, increase
kg_decay_tick_staleness_days lower bound or check that nodes are being touched at all.
”Need to rebuild the deterministic skeleton from scratch”
- CLI only:
okto-pulse kg backfill <board_id> (dry-run) — review.
okto-pulse kg backfill <board_id> --apply.
- Call
kg_health — confirm graph_node_refs is balanced.
Next steps
Consolidation
The 7 transactional write primitives the queue is feeding.
Archive & retention
Cascading entity archive, supersedence as soft-archive, and KG retention policy.