# インシデント対応 Runbook

> **SSoT 参照**: 通知ルーティング (severity / channel / mention / format) は
> [notification-strategy.md](notification-strategy.md) で定義。本 runbook は
> P0/P1 発火時の手順 (トリアージ + 初動 + エスカレーション) のみ扱う。
>
> Severity 区分は notification-strategy.md と一致させること:
> - SEV1 ≡ P0 (Critical, 15分以内対応, `#parky-alerts` + `@here`)
> - SEV2 ≡ P1 (High, 4時間以内対応, `#parky-ops`)
> - SEV3 ≡ P2 (Info, 翌営業日, `#parky-deploys`)

Parky の本番/開発環境で障害が発生した際のトリアージ・初動手順。

## トリアージ

1. **影響範囲の特定**
   - 対象: `parky.co.jp` / `api.parky.co.jp` / `admin.parky.co.jp` / `owner.parky.co.jp` / `dev-*` のどこか
   - 症状: 完全停止 / 部分機能不全 / レスポンス遅延 / エラー率上昇
   - 開始時刻（JST）を記録
2. **Severity 判定**
   - **SEV1**: 本番完全停止、決済系エラー、データ破損の可能性
   - **SEV2**: 本番の一部機能停止、一部ユーザー影響
   - **SEV3**: dev 環境障害、監視アラートのみ
3. **SEV1/2 は即エスカレーション**（Slack #parky-ops / 担当者 Discord）

## 初動コマンド集

### Cloudflare Workers ログ（BFF `api/`）

```bash
cd parky/api
wrangler tail --env dev   # または --env prod
```

### GitHub Actions 状態

```bash
gh run list --limit 10
gh run view <run-id> --log-failed
```

### Supabase DB 接続確認

```bash
OP_SERVICE_ACCOUNT_TOKEN=$(cat ~/.op/sa_token.txt) op run -- \
  psql "$SUPABASE_DB_URL" -c "select 1"
```

### Cloudflare Pages デプロイ状態

- Dashboard: https://dash.cloudflare.com/ → Pages → 対象プロジェクト → Deployments

### 通知が届かなかった疑いがあるとき (DLQ 確認)

「Discord に alert が来てない」「Sentry alert ルールが発火したが気付かなかった」等の場合、`admin.notification_failures` を確認する ([notification-strategy.md §7](./notification-strategy.md) Layer 3 永続化先)。

```sql
-- 直近 24h で配信失敗した通知 (channel / severity / error message)
SELECT attempted_at, channel, severity, title, http_status, retry_count, error_message
  FROM admin.notification_failures
 WHERE attempted_at >= NOW() - INTERVAL '24 hours'
 ORDER BY attempted_at DESC
 LIMIT 50;

-- P0 で email backup が送られたか確認 (Discord 落ちで email に流れたケース)
SELECT attempted_at, title, fallback_email_sent, error_message
  FROM admin.notification_failures
 WHERE severity = 'P0' AND status = 'failed'
 ORDER BY attempted_at DESC;

-- Discord 自体の障害を疑うとき: HTTP status の集計
SELECT http_status, COUNT(*) AS n
  FROM admin.notification_failures
 WHERE attempted_at >= NOW() - INTERVAL '1 hour'
 GROUP BY http_status
 ORDER BY n DESC;
```

DLQ には full Discord payload が JSONB で残っているので、必要なら手動 replay も可能 (curl で webhook URL に payload を再 POST)。週次自動 digest は月曜 18:00 JST に `#parky-insights` に投稿される。

## エスカレーションパス

1. 一次対応者（気付いた担当） → 状況を #parky-ops に投下
2. SEV1/2 は即 tech-lead を呼び出し
3. ユーザー影響が継続 → ステークホルダーへの告知判断

## 対応後

- 事後: `docs/ops/postmortems/YYYY-MM-DD-<title>.md` を作成 (テンプレート: [postmortem-template.md](./postmortem-template.md))
- 再発防止: AI (action item) を SMART 形式で起こし、Owner / 期限 / 検証方法を明記
- migration / runbook / 監視を更新

TODO:
- [x] ポストモーテムテンプレート整備 — [postmortem-template.md](./postmortem-template.md)
- [ ] オンコールローテーション明文化 (週次担当 rota / 引き継ぎ手順)
- [ ] PagerDuty / Opsgenie 連携検討 — Sentry → Slack 一段で当面運用、頻度が増えたら導入

## 関連 docs

- [notification-strategy.md](./notification-strategy.md) — 通知戦略 SSoT (Severity / channel / failure handling)
- [postmortem-template.md](./postmortem-template.md) — ポストモーテム標準テンプレート (Blameless / SMART)
- [slo-error-budget.md](./slo-error-budget.md) — SLO / Error Budget 定義 + burn rate アラート
- [sentry-setup.md](./sentry-setup.md) — Sentry 設定状況 + 残チャネル投入手順
- [opentelemetry-setup.md](./opentelemetry-setup.md) — OTel + Honeycomb 投入手順
- [secret-rotation.md](./secret-rotation.md) — secret rotation 手順
- [synthetic-healthcheck.md](./synthetic-healthcheck.md) — 外部 healthcheck (worker outage 検知)