Skip to content

CAPTCHA outages don't make the front page. They're usually short, regional, and most operators don't notice until users start complaining that they can't log in. But the failure mode is uniquely painful: your auth pages, sign-up forms, and password resets stop working all at once, and your support inbox fills with people who think your site is broken.

If you've never thought about what happens when your CAPTCHA vendor returns a 5xx, this post is for you.

What "down" actually looks like

A CAPTCHA outage rarely means the vendor's homepage goes 503. It usually looks like one of:

SymptomWhat's actually happening
Widget loads but never returns a tokenBackend challenge service is degraded
Widget script 404s or hangsCDN edge issue in one region
Verify endpoint returns 500 / timeoutVerification service overloaded
Tokens come back but verify always rejectsStale signing key, regional split-brain
Mobile SDK throws on initApp-key lookup service down

The user-facing result is the same in every case: your form looks broken. Whether the vendor's status page admits to an incident or not is a separate question.

Fail-open vs fail-closed

The single most important decision you make about your CAPTCHA integration is what happens when the verify call fails. Two camps:

Fail-closed. If verification can't be reached, reject the request. Safer against bots, but if your vendor goes down, your sign-up form goes down with them.

Fail-open. If verification can't be reached, allow the request through (often with extra logging or a soft-flag for later review). Site keeps working, but for the duration of the outage you have no bot defense.

There is no universally right answer. A high-value action — payment, account creation with credit, password change — should usually fail-closed because letting bots through during the outage is worse than the brief downtime. A low-value action — newsletter signup, contact form — should usually fail-open because frustrating real users is worse than a few extra spam submissions.

The wrong move is to never decide. Most teams ship fail-closed by accident (they wrote if (!valid) return 403; and didn't think about the timeout case) and only discover their choice during the next outage.

Code: explicit timeout handling

Here's a Node example that fails-open with logging when verification times out, and fails-closed when verification rejects:

ts
const VERIFY_TIMEOUT_MS = 3000;

async function verifyCaptcha(token: string, ip: string) {
  const ctrl = new AbortController();
  const timer = setTimeout(() => ctrl.abort(), VERIFY_TIMEOUT_MS);

  try {
    const res = await fetch('https://apiv1.captcha.la/v1/challenge/verify', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-App-Key': process.env.APP_KEY!,
        'X-App-Secret': process.env.APP_SECRET!,
      },
      body: JSON.stringify({ token, client_ip: ip }),
      signal: ctrl.signal,
    });
    clearTimeout(timer);
    const j = await res.json();
    return { ok: j.code === 0 && j.data?.valid, reason: 'verified' };
  } catch (e) {
    clearTimeout(timer);
    // Log loudly so you know your vendor is having a bad day
    console.error('captcha_verify_unreachable', { err: String(e) });
    return { ok: true, reason: 'fail_open_on_error' };
  }
}

Key detail: distinguish "verification said no" from "verification couldn't be reached." A rejection is a real signal. A timeout is an operational event. They should be handled differently and logged differently.

Detecting an incident before users do

A CAPTCHA outage looks identical to a bug in your integration from the outside. Both produce "users can't log in." Two signals help you tell them apart:

  1. Verify success rate. Track the rate of code === 0 && valid === true divided by all verify attempts, broken down by minute. A drop from ~95% to ~10% over 60 seconds is an incident, not a bug. If you don't have this metric you'll always be reactive.
  2. Vendor status page in your alerting. Subscribe to your CAPTCHA vendor's status RSS or webhook and route it into the same Slack channel as your own alerts. Most teams find out about vendor incidents from a customer email, which is far too slow.

CaptchaLa publishes incidents to a status RSS that you can wire into PagerDuty or Slack in two minutes. We'd rather you find out from us than from your users.

The recovery checklist

When verification starts returning errors:

  1. Confirm with the vendor's status page (don't just assume).
  2. Switch the affected forms to fail-open if they aren't already, or surface a banner: "Verification is temporarily slow — please retry in a minute."
  3. After the incident, rate-limit signups from the outage window aggressively and review for anomalies. Bots watching your verify success rate may have spotted the gap.
  4. File the incident in your post-mortem tracker with three questions: what's our fail mode, how did we find out, and how long did users see broken UI?

A quieter benefit of multi-region

Some CAPTCHA vendors run a single region. Some run multi-region with regional failover. The latter degrade more gracefully — instead of a global outage, you get a slow region that eventually rebalances. CaptchaLa runs verify endpoints in multiple regions so a single zone failure doesn't cascade into a global outage.

This isn't a feature most buyers ask about during evaluation, and it's the one that matters most at 3 a.m. on a Saturday.

The takeaway

Pick a fail mode per form, log timeouts separately from rejections, and wire your vendor's status into your alerting. Outages are rare; the cost of being unprepared is not.

Articles are CC BY 4.0 — feel free to quote with attribution