ID Strategy¶

Goals¶

Stability: IDs must remain stable across machines, paths, and indexing time
Determinism: Same artifacts + same config ⇒ identical IDs
Collision resistance: Use cryptographic hashing
Privacy: Avoid embedding sensitive raw strings; hash normalized representations instead
Traceability: IDs must be reproducible from stored index fields

Canonical Hashing¶

Hash Algorithm¶

Primary: SHA-256
ID encoding: lowercase hex
ID format: <prefix>_<hex12> for compact IDs exposed via MCP; store full sha256 internally
Example: t_ab12cd34ef56

Rationale: Clients get short IDs; store retains full hash for collision detection and audit.

Canonical Serialization¶

All hash inputs must use a canonical serialization:

JSON with:
sorted keys
UTF-8 encoding
no whitespace
Normalize strings (see below) before serialization

String Normalization¶

Applies to all ID inputs:

Trim leading/trailing whitespace
Normalize line endings to \n
Lowercase where the semantic field is case-insensitive (e.g., framework name, simulator vendor)
Replace platform-dependent path separators (\ → /)
Remove volatile substrings (see "Volatile stripping")

Volatile Stripping (for stability)¶

Remove/normalize values known to vary between runs but not semantically part of identity:

Absolute filesystem prefixes (replace with <ROOT> before hashing)
Hostnames (replace with <HOST>)
Timestamps (replace with <TS>)
Random temporary directory names (replace with <TMP>)

This is applied only where those values might leak into identity fields (typically evidence paths and raw messages).

run_id Strategy¶

Preferred (CI-backed) run_id¶

If CI metadata exists:

run_id_full = sha256(canonical_json({
    "suite": <suite>,
    "ci_system": <ci.system>,
    "ci_build_id": <ci.build_id>,
    "ci_job_url": <normalized_url(ci.job_url)>
}))

Fallback (artifact-backed) run_id¶

If CI metadata is absent:

# Use stable fingerprint of the artifact set
artifact_manifest_hash = sha256(concat(sorted([
    relative_path + ":" + file_sha256
])))

run_id_full = sha256(canonical_json({
    "suite": <suite>,
    "artifact_manifest": <artifact_manifest_hash>
}))

Exposed run_id¶

run_id = "r_" + hex12(run_id_full)

test_id Strategy¶

Inputs¶

A test_id should uniquely identify a test instance within a run.

Hash input:

{
  "run_id_full": "<full sha256 of run_id>",
  "framework": "uvm|cocotb|sv_unit|unknown",
  "test_name": "<normalized test name>",
  "seed": <int|null>,
  "simulator_vendor": "<normalized vendor|null>",
  "simulator_version": "<normalized version|null>",
  "dut_top": "<normalized top|null>"
}

Notes: - seed is included when present because it materially changes behavior in DV - If your environment has an explicit test GUID (some harnesses do), include that as test_guid and optionally omit other fields

Computation¶

test_id_full = sha256(canonical_json(inputs))
test_id = "t_" + hex12(test_id_full)

failure_id Strategy¶

What a Failure Event Is¶

A FailureEvent is a normalized record of something that went wrong, typically derived from:

UVM report lines
cocotb exception traces
assertion failure entries
compile/elab errors

Inputs¶

Failures must be uniquely addressable within a test, but stable across indexing.

Hash input:

{
  "test_id_full": "<full sha256 of test_id>",
  "severity": "info|warning|error|fatal",
  "category": "...",
  "summary_norm": "<normalized summary>",
  "component_norm": "<normalized component|null>",
  "phase_norm": "<normalized phase|null>",
  "time_bucket": "<time bucket|null>",
  "evidence_fingerprint": "<optional>"
}

time_bucket¶

To prevent instability from minor timestamp differences, use bucketing:

If time is available: time_bucket = floor(time_ns / 1000) (1 µs buckets)
If only log line exists without time: null

evidence_fingerprint¶

Optional but useful when multiple identical summaries exist:

evidence_fingerprint = sha256(concat(sorted([
    path + ":" + start_line + ":" + end_line
])))

Only include if evidence exists; otherwise omit.

Computation¶

failure_id_full = sha256(canonical_json(inputs))
failure_id = "f_" + hex12(failure_id_full)

signature_id Strategy (Regression Clustering)¶

Purpose¶

A FailureSignature clusters failures across tests/runs.

Inputs¶

Signature should represent the type of failure, not the instance.

Hash input:

{
  "category": "...",
  "summary_signature": "<signature-normalized summary>",
  "protocol": "<protocol tag|null>",
  "component_role": "<optional normalized role|null>"
}

signature-normalized summary¶

Apply stronger normalization than summary_norm:

Replace hex literals: 0x[0-9a-fA-F]+ → <HEX>
Replace decimal numbers: \b\d+\b → <NUM>
Replace time units: 123ns, 45 us → <TIME>
Replace paths: /.../file.sv → <PATH>
Replace instance paths: tb.top.env.agent[3].drv → <INST> (optional, configurable)
Collapse whitespace

Computation¶

signature_id_full = sha256(canonical_json(inputs))
signature_id = "s_" + hex12(signature_id_full)

Collision Handling¶

Store full hashes (*_id_full) in the index
If two different records yield same short ID (hex12 collision), server must:
still disambiguate internally by full hash
expose a longer prefix (configurable) for those IDs or include full hash in details responses
In practice, SHA-256 with 12 hex chars (48 bits) is typically sufficient, but collision handling must exist

Implementation Reference¶

See sentinel_dv/ids.py for the canonical implementation.