QUANTARA • QUANTUM-RESISTANT L1
Incident response on Quantara Devnet-0
How we classify incidents, communicate with validators, and move from detection to containment, recovery, and postmortem.
Docs • Ops
Incident response for Quantara Devnet-0
A shared playbook for validators, infra operators, and builders when something looks wrong on Devnet-0 — from the first alert to the final postmortem.
Devnet-0 is where we discover and fix problems early. Incidents are expected. What matters is that we detect them quickly, communicate clearly, and recover in a controlled way.
This document defines how Quantara classifies incidents, how we expect validators and operators to respond, and how it fits with the Rollback & Recovery and Security checklist.
Treat this as the operational counterpart to the Devnet-0 overview and Validator runbook — those explain the steady state; this one covers “when things go sideways.”
When in doubt, assume it might be a real incident. Capture evidence, check /status, and share what you're seeing — even if it later turns out to be benign.
Devnet-0 is a non-financial rehearsal network. We intentionally push new runtimes, node versions, and tooling to find issues before public testnet and mainnet.
• RPC: wss://rpc.devnet-0.quantara.xyz
• Explorer: https://explorer.devnet-0.quantara.xyz
Last updated: 2025-11-23 22:00 UTC. The Status page is the canonical source for live incident state, maintenance windows, and advisories.
1 • Classification
How we classify incidents on Devnet-0
A shared severity language keeps everyone aligned on urgency and expectations — even on a devnet.
Sev-1 — Network-wide impact
- • Finality stalled or blocks not being produced.
- • Majority of validators unable to stay in sync.
- • Critical consensus or runtime bug suspected.
- • Requires coordinated response from core team.
Sev-2 — Partial degradation
- • Some validators / RPCs degraded, but chain healthy.
- • Performance issues (high latency, resource spikes).
- • Non-critical bugs impacting a subset of operators.
- • Workarounds exist, but we still want a fix.
Sev-3 — Local / cosmetic
- • Single validator down or misconfigured.
- • Explorer or wallet UI glitch with no chain impact.
- • Intermittent RPC errors that self-resolve quickly.
- • Good candidates for GitHub issues and follow-up.
2 • First response
First five minutes when you see something weird
Whether you’re a validator or running supporting infra, these are the steps we expect you to take before diving into deep debugging.
2.1 — Confirm the signal
- • Check if /status reports an incident or maintenance.
- • Compare your block height with reference RPC / explorer.
- • Verify that alerts are not just firing from noisy rules.
- • Check a second vantage point (another node, region, or tool).
2.2 — Capture evidence
- • Note current time, block height, and peer count.
- • Grab a short log snippet around the first error.
- • Capture key metrics (CPU, RAM, disk, network).
- • Save any error messages exactly as they appear.
2.3 — Share a concise report
- • Post in the validator / ops channel with your findings.
- • Include time, node role (validator / RPC / sentry), and region.
- • Attach logs / metrics or links where safe to do so.
- • Suggest your initial severity (Sev-1/2/3) if you can.
A good first incident message answers: “what changed, when, where, and how bad does it look?” Perfection is not required — clarity is.
3 • Communication
Who says what, where, and when
Even on a devnet, we want predictable communication patterns — both inside Quantara and across the validator set.
3.1 — Channels & sources of truth
- • /status — canonical incident state, timelines, and summaries.
- • Validator / ops channels — real-time coordination and updates.
- • Docs — long-lived guidance, updated post-incident.
- • Social / public feeds — used selectively for larger events.
3.2 — Typical timeline
- 1) Detection — alert fires or operator observes an anomaly.
- 2) Triage — severity assigned, initial scope determined.
- 3) Containment — temporary mitigations applied.
- 4) Recovery — permanent fix rolled out and verified.
- 5) Review — incident logged and postmortem drafted.
4 • Symptom playbooks
Common incident patterns & first actions
These patterns cover most incidents you’ll see on Devnet-0. Use them as a starting point while we evolve more detailed runbooks.
4.1 — Chain not finalizing / blocks stalled
- • Confirm stall via explorer and multiple RPC endpoints.
- • Check for consensus-related errors in node logs.
- • Verify you're on the canonical chain (hash / spec).
- • Treat as Sev-1 until downgraded by core team.
4.2 — Node stuck or constantly behind
- • Compare height with reference nodes and explorer.
- • Check disk, CPU, RAM, and network saturation.
- • Review logs for repeated I/O or DB-related errors.
- • If localized, treat as Sev-2/3 and consider rebuild or snapshot restore.
4.3 — RPC / wallet issues
- • Test multiple methods (health, system, chain RPCs).
- • Determine if issue is specific to one RPC or region.
- • Check for CORS, rate limiting, or TLS errors.
- • Coordinate with other operators to confirm scope.
More detailed flows live in the Rollback & Recovery doc, especially for controlled rollbacks and version pinning.
5 • Rollback & recovery
When we decide to roll back or pin a version
Most incidents resolve forward — but sometimes the safest move is to roll back to a known-good runtime or node binary.
5.1 — When rollback is on the table
- • New runtime causes consensus or finality instability.
- • Node version introduces severe performance regression.
- • Data corruption suspected after a specific upgrade.
- • Recovery forward would be slower / riskier than rollback.
5.2 — Follow the rollback runbook
- • Read the Rollback & Recovery doc before you attempt any coordinated rollback.
- • Never roll back alone if the rest of the network is moving forward.
- • Always record versions, hashes, and timing of rollback steps.
- • After recovery, verify metrics and chain state match expectations.
6 • Post-incident review
Turn every incident into an upgrade
The goal is not zero incidents; it’s zero repeat incidents with the same root cause.
6.1 — Minimum incident record
- • Short description, date, and severity.
- • Impacted components (validators, RPCs, explorers, users).
- • Root cause (once understood) and triggering conditions.
- • Concrete actions taken during response and recovery.
6.2 — Follow-up actions
- • Update runbooks and checklists with lessons learned.
- • Adjust alerts, dashboards, and thresholds if needed.
- • File or update issues in the relevant code repositories.
- • Share a summary with the validator / ops community.
For longer-lived incidents, we align on a shared postmortem format inspired by the Postmortem template.
Next steps
Practice now, so mainnet incidents feel familiar
If Devnet-0 incidents feel routine — not chaotic — you’re in the right place for public testnet and mainnet.
Keep this page close to the Validator runbook, Security checklist and Rollback & Recovery docs. Together they form the core of Quantara's operational handbook for early networks.
The strongest Devnet-0 operators are the ones who treat every incident as a chance to upgrade their systems, tooling, and habits. That mindset is exactly what we're building Quantara with.