QUANTARA • QUANTUM-RESISTANT L1

Incident response on Quantara Devnet-0

How we classify incidents, communicate with validators, and move from detection to containment, recovery, and postmortem.

Docs • Ops

Incident response for Quantara Devnet-0

A shared playbook for validators, infra operators, and builders when something looks wrong on Devnet-0 — from the first alert to the final postmortem.

Devnet-0 is where we discover and fix problems early. Incidents are expected. What matters is that we detect them quickly, communicate clearly, and recover in a controlled way.

This document defines how Quantara classifies incidents, how we expect validators and operators to respond, and how it fits with the Rollback & Recovery and Security checklist.

Treat this as the operational counterpart to the Devnet-0 overview and Validator runbook — those explain the steady state; this one covers “when things go sideways.”

Devnet-0Incident responseQTR • 12 decimals • SS58=73

When in doubt, assume it might be a real incident. Capture evidence, check /status, and share what you're seeing — even if it later turns out to be benign.

Current networkDevnet-0

Devnet-0 is a non-financial rehearsal network. We intentionally push new runtimes, node versions, and tooling to find issues before public testnet and mainnet.

• RPC: wss://rpc.devnet-0.quantara.xyz

• Explorer: https://explorer.devnet-0.quantara.xyz

Last updated: 2025-11-23 22:00 UTC. The Status page is the canonical source for live incident state, maintenance windows, and advisories.

1 • Classification

How we classify incidents on Devnet-0

A shared severity language keeps everyone aligned on urgency and expectations — even on a devnet.

Sev-1 — Network-wide impact

• Finality stalled or blocks not being produced.
• Majority of validators unable to stay in sync.
• Critical consensus or runtime bug suspected.
• Requires coordinated response from core team.

Sev-2 — Partial degradation

• Some validators / RPCs degraded, but chain healthy.
• Performance issues (high latency, resource spikes).
• Non-critical bugs impacting a subset of operators.
• Workarounds exist, but we still want a fix.

Sev-3 — Local / cosmetic

• Single validator down or misconfigured.
• Explorer or wallet UI glitch with no chain impact.
• Intermittent RPC errors that self-resolve quickly.
• Good candidates for GitHub issues and follow-up.

2 • First response

First five minutes when you see something weird

Whether you’re a validator or running supporting infra, these are the steps we expect you to take before diving into deep debugging.

2.1 — Confirm the signal

• Check if /status reports an incident or maintenance.
• Compare your block height with reference RPC / explorer.
• Verify that alerts are not just firing from noisy rules.
• Check a second vantage point (another node, region, or tool).

2.2 — Capture evidence

• Note current time, block height, and peer count.
• Grab a short log snippet around the first error.
• Capture key metrics (CPU, RAM, disk, network).
• Save any error messages exactly as they appear.

2.3 — Share a concise report

• Post in the validator / ops channel with your findings.
• Include time, node role (validator / RPC / sentry), and region.
• Attach logs / metrics or links where safe to do so.
• Suggest your initial severity (Sev-1/2/3) if you can.

A good first incident message answers: “what changed, when, where, and how bad does it look?” Perfection is not required — clarity is.

3 • Communication

Who says what, where, and when

Even on a devnet, we want predictable communication patterns — both inside Quantara and across the validator set.

3.1 — Channels & sources of truth

• /status — canonical incident state, timelines, and summaries.
• Validator / ops channels — real-time coordination and updates.
• Docs — long-lived guidance, updated post-incident.
• Social / public feeds — used selectively for larger events.

3.2 — Typical timeline

1) Detection — alert fires or operator observes an anomaly.
2) Triage — severity assigned, initial scope determined.
3) Containment — temporary mitigations applied.
4) Recovery — permanent fix rolled out and verified.
5) Review — incident logged and postmortem drafted.

4 • Symptom playbooks

Common incident patterns & first actions

These patterns cover most incidents you’ll see on Devnet-0. Use them as a starting point while we evolve more detailed runbooks.

4.1 — Chain not finalizing / blocks stalled

• Confirm stall via explorer and multiple RPC endpoints.
• Check for consensus-related errors in node logs.
• Verify you're on the canonical chain (hash / spec).
• Treat as Sev-1 until downgraded by core team.

4.2 — Node stuck or constantly behind

• Compare height with reference nodes and explorer.
• Check disk, CPU, RAM, and network saturation.
• Review logs for repeated I/O or DB-related errors.
• If localized, treat as Sev-2/3 and consider rebuild or snapshot restore.

4.3 — RPC / wallet issues

• Test multiple methods (health, system, chain RPCs).
• Determine if issue is specific to one RPC or region.
• Check for CORS, rate limiting, or TLS errors.
• Coordinate with other operators to confirm scope.

More detailed flows live in the Rollback & Recovery doc, especially for controlled rollbacks and version pinning.

5 • Rollback & recovery

When we decide to roll back or pin a version

Most incidents resolve forward — but sometimes the safest move is to roll back to a known-good runtime or node binary.

5.1 — When rollback is on the table

• New runtime causes consensus or finality instability.
• Node version introduces severe performance regression.
• Data corruption suspected after a specific upgrade.
• Recovery forward would be slower / riskier than rollback.

5.2 — Follow the rollback runbook

• Read the Rollback & Recovery doc before you attempt any coordinated rollback.
• Never roll back alone if the rest of the network is moving forward.
• Always record versions, hashes, and timing of rollback steps.
• After recovery, verify metrics and chain state match expectations.

6 • Post-incident review

Turn every incident into an upgrade

The goal is not zero incidents; it’s zero repeat incidents with the same root cause.

6.1 — Minimum incident record

• Short description, date, and severity.
• Impacted components (validators, RPCs, explorers, users).
• Root cause (once understood) and triggering conditions.
• Concrete actions taken during response and recovery.

6.2 — Follow-up actions

• Update runbooks and checklists with lessons learned.
• Adjust alerts, dashboards, and thresholds if needed.
• File or update issues in the relevant code repositories.
• Share a summary with the validator / ops community.

For longer-lived incidents, we align on a shared postmortem format inspired by the Postmortem template.

Next steps

Practice now, so mainnet incidents feel familiar

If Devnet-0 incidents feel routine — not chaotic — you’re in the right place for public testnet and mainnet.

Keep this page close to the Validator runbook, Security checklist and Rollback & Recovery docs. Together they form the core of Quantara's operational handbook for early networks.

The strongest Devnet-0 operators are the ones who treat every incident as a chance to upgrade their systems, tooling, and habits. That mindset is exactly what we're building Quantara with.