How to Monitor ChatGPT API Availability in Production

Your app depends on ChatGPT. When it's down, your users know before you do.

Here's how to stay ahead of outages with real-time monitoring.

Why Monitor LLM APIs?

ChatGPT went down on May 11, 2024 for 1.5 hours. Users saw errors immediately. Support got hammered. Revenue dropped.

With monitoring, you'd have:

✅ Alerts before your users complain
✅ Data to show customers "we detected it"
✅ Fallback logic to degrade gracefully
✅ Historical uptime to evaluate reliability

Let's build it.

Approach 1: Simple Polling (DIY, 15 mins)

Ping ChatGPT every 60 seconds and log the result.

// lib/chatgpt-health-check.ts
export async function checkChatGPTHealth(): Promise<{
  status: "operational" | "degraded" | "outage";
  latencyMs: number;
  error?: string;
}> {
  const startTime = performance.now();

  try {
    const response = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "gpt-4o-mini", // Cheaper model for health checks
        messages: [{ role: "user", content: "Respond with: OK" }],
        max_tokens: 5,
      }),
    });

    const latencyMs = Math.round(performance.now() - startTime);

    if (!response.ok) {
      if (response.status === 429) {
        return { status: "degraded", latencyMs, error: "Rate limited" };
      }
      if (response.status >= 500) {
        return { status: "outage", latencyMs, error: `Server error: ${response.status}` };
      }
      return { status: "degraded", latencyMs, error: `HTTP ${response.status}` };
    }

    if (latencyMs > 10_000) {
      return { status: "degraded", latencyMs, error: "Slow response" };
    }

    return { status: "operational", latencyMs };
  } catch (error) {
    const latencyMs = Math.round(performance.now() - startTime);
    return {
      status: "outage",
      latencyMs,
      error: error instanceof Error ? error.message : "Unknown error",
    };
  }
}

Use it in your app:

// pages/api/check-health.ts
import { checkChatGPTHealth } from "@/lib/chatgpt-health-check";

export default async function handler(req, res) {
  const health = await checkChatGPTHealth();
  
  if (health.status === "outage") {
    // Alert: ChatGPT is down
    console.error("🔴 ChatGPT DOWN:", health.error);
    // Send Slack alert, PagerDuty, etc.
  }

  res.json(health);
}

Cost: ~$0.0001 per check (minimal tokens). 60 checks/day = $0.000006/month.

Approach 2: Cron Job + Database Logging

Run health checks every 30 seconds and track uptime trends.

// scripts/monitor-llm-apis.ts
import { db } from "@/lib/db";
import { checkChatGPTHealth } from "@/lib/chatgpt-health-check";

export async function logHealthCheck() {
  const health = await checkChatGPTHealth();

  await db.statusCheck.create({
    data: {
      providerId: "openai-id", // Your provider ID
      modelId: "gpt-4o-mini-id",
      status: health.status.toUpperCase(),
      latencyMs: health.latencyMs,
      errorMessage: health.error,
      checkedAt: new Date(),
    },
  });

  // Alert on status change
  const previousCheck = await db.statusCheck.findFirst({
    where: { providerId: "openai-id" },
    orderBy: { checkedAt: "desc" },
    skip: 1,
  });

  if (previousCheck?.status !== health.status.toUpperCase()) {
    console.log(`⚠️ Status changed: ${previousCheck?.status} → ${health.status}`);
    // Send alert
  }
}

Schedule it:

Vercel Cron: Every 30 seconds (requires Pro plan)
Render: Include in your pinger loop
AWS Lambda: Scheduled EventBridge rule

Cost: Same as polling, plus database writes (~$0.01/month on Supabase free tier).

Approach 3: Graceful Degradation (Production-Ready)

When ChatGPT is down, fall back to Claude or local cache.

// lib/llm-fallback.ts
export async function getCompletion(prompt: string) {
  const providers = [
    { name: "chatgpt", fn: callChatGPT },
    { name: "claude", fn: callClaude },
    { name: "cache", fn: returnCachedResponse },
  ];

  for (const provider of providers) {
    try {
      const result = await Promise.race([
        provider.fn(prompt),
        new Promise((_, reject) =>
          setTimeout(() => reject(new Error("Timeout")), 5000)
        ),
      ]);
      return { result, provider: provider.name };
    } catch (error) {
      console.warn(`${provider.name} failed:`, error);
      continue; // Try next provider
    }
  }

  throw new Error("All providers failed");
}

Benefits:

Users never see "API is down" errors
Automatic failover to secondary providers
Cached responses as last resort
Your app stays functional even during outages

Approach 4: Use an Uptime Service (Managed, $0)

Don't want to build monitoring? Use IsItDown.ai (free tier).

// In your app, check our real-time API:
async function checkChatGPTStatus() {
  const res = await fetch("https://isitdown.ai/api/status");
  const data = await res.json();
  
  const chatgpt = data.providers.find(p => p.slug === "openai");
  
  if (chatgpt.status === "operational") {
    return true; // Safe to call
  } else {
    console.warn(`ChatGPT is ${chatgpt.status}, consider using fallback`);
    return false;
  }
}

Advantages:

No infrastructure to maintain
Real data from actual API calls
Weekly uptime reports
Free (we have an API)

Best Practices

1. Monitor Early, Act Fast

// Alert threshold: 3 consecutive failures = alert immediately
let failureCount = 0;
const ALERT_THRESHOLD = 3;

if (health.status === "outage") {
  failureCount++;
  if (failureCount >= ALERT_THRESHOLD) {
    alertTeam(); // PagerDuty, Slack, SMS
    failureCount = 0; // Reset after alert
  }
} else {
  failureCount = 0;
}

2. Avoid Cascading Failures

// Don't hammer a failing API
async function smartRetry(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i < maxRetries - 1) {
        // Exponential backoff: 1s, 2s, 4s
        await sleep(Math.pow(2, i) * 1000);
      }
    }
  }
  throw error;
}

3. Log Everything

// Store failures for analysis
await db.incidentLog.create({
  data: {
    provider: "openai",
    status: health.status,
    latencyMs: health.latencyMs,
    error: health.error,
    timestamp: new Date(),
    context: { userId, action: "generate_response" },
  },
});

4. Set Realistic Thresholds

// Don't alert on every 429 (rate limit)
if (response.status === 429) {
  // Expected during load spikes, not an emergency
  return { status: "degraded", latencyMs };
}

// Alert on 5xx (server error)
if (response.status >= 500) {
  return { status: "outage", latencyMs }; // Alert!
}

// Alert on timeout > 30s
if (latencyMs > 30_000) {
  return { status: "outage", latencyMs }; // Likely down
}

Cost Comparison

Approach	Setup	Monthly Cost	Reliability
DIY Polling	1 hour	$0.01	Depends on you
Cron + DB	2 hours	$0.20	Good if you run it
Fallback Logic	4 hours	$0.20	Excellent (auto-recovery)
IsItDown.ai	15 mins	$0	High (shared infrastructure)
Third-party (Datadog)	30 mins	$300+/mo	Enterprise-grade

Summary

Start here: Approach 1 (simple polling) takes 15 minutes.
Scale to: Approach 3 (fallback logic) when you have multiple users.
Monitor with: IsItDown.ai (free) for trend data and reports.

Real-time monitoring isn't optional — it's how you stay operational when APIs fail.

Questions?
Check IsItDown.ai for live data
Browse the code on GitHub

Published by the Is It Down AI Team.