This runbook focuses on what to check first during an incident, not every implementation detail. The goal is to speed up diagnosis and separate operational issues from feature development work.
Minimum operational routine
Before assuming a complex incident, make these three checks a habit:- Run
npm run env:checkin the relevant environment. - Check
GET /api/health. - Match the user symptom against the features that are actually enabled in your product.
Incident playbooks
Health check shows degraded
Health check shows degraded
Check first:
- which features are
enabledbut still havemissing_env, - whether the database check is failing while admin or payments are active,
- whether the feature is actually supposed to be off and only the env or copy was not adjusted.
- env values look correct but the database still reports
ok: false, - the degraded status appears without a clear configuration change.
Users cannot log in or reach the dashboard
Users cannot log in or reach the dashboard
Check first:
- Supabase public env,
- auth redirect URLs,
- whether auth was intentionally turned off,
- whether the problem only happens after email verify, OAuth, or password reset.
/auth/error,/auth/confirm,- the Supabase Auth dashboard.
Payment succeeded in the gateway but the subscription is still inactive
Payment succeeded in the gateway but the subscription is still inactive
Check first:
- the provider webhook URL,
- signature verification,
- the
paymentstable, - the
subscriptionstable, - the
webhook_eventstable, - the admin audit trail.
Webhooks are duplicated or keep retrying
Webhooks are duplicated or keep retrying
Check first:
- whether the same event is already recorded in
webhook_events, - whether the latest status is
processing,processed, orfailed, - whether the provider is retrying because the app did not return a stable
200.
- the repo already uses an idempotent event claim flow,
- duplicate delivery does not automatically mean duplicate payments.
Contact form is not sending email
Contact form is not sending email
Check first:
RESEND_API_KEY,EMAIL_FROM,- sender domain verification in Resend,
- whether
CONTACT_EMAILpoints to the right inbox, - whether the error looks like
429,503, or500.
503means the feature is not ready,429means rate limiting,500means provider-side send failure.
Admin panel is missing or the user role is wrong
Admin panel is missing or the user role is wrong
Check first:
NEXT_PUBLIC_ENABLE_ADMIN,SUPABASE_SERVICE_ROLE_KEY,- the
user_rolestable, - whether the bootstrap email in
ADMIN_EMAILShas logged in at least once.
- the source of truth for admin access is
user_roles, ADMIN_EMAILSis only an early bootstrap mechanism, not the permanent role system.
AI route returns 429 or 503
AI route returns 429 or 503
Check first:
- the provider key matching
AI_DEFAULT_PROVIDER, - whether the user is logged in,
- the user’s current plan,
- monthly usage in
ai_usage, - the per-5-minute request limit.
503usually means the provider is not configured,429can mean request throttling or exhausted monthly quota.
Avatar upload fails or does not update
Avatar upload fails or does not update
Check first:
- the user is authenticated,
- the file is under 2 MB,
- the file type is allowed,
- the
avatarsbucket and storage policies exist, POST /api/profile/avatarreturns a signed upload URL correctly.
When an issue is still operational, not a new bug
Very often the problem is not a new bug yet when:- env values changed recently,
- provider dashboards were just switched to the production domain,
- a feature was just turned on or off,
- copy or navigation still points to flows that are not actually active.
When to escalate directly to a developer
Bring in a developer faster when:- the production build fails,
- the database check fails without env changes,
- webhooks keep failing for valid events,
- admin or profile writes break even though the service role looks correct,
- users report data that is no longer consistent across tables.