How to Monitor ChatGPT API Availability in Production
Your app depends on ChatGPT. When it's down, your users know before you do.
Here's how to stay ahead of outages with real-time monitoring.
Why Monitor LLM APIs?
ChatGPT went down on May 11, 2024 for 1.5 hours. Users saw errors immediately. Support got hammered. Revenue dropped.
With monitoring, you'd have:
- ✅ Alerts before your users complain
- ✅ Data to show customers "we detected it"
- ✅ Fallback logic to degrade gracefully
- ✅ Historical uptime to evaluate reliability
Let's build it.
Approach 1: Simple Polling (DIY, 15 mins)
Ping ChatGPT every 60 seconds and log the result.
// lib/chatgpt-health-check.ts
export async function checkChatGPTHealth(): Promise<{
status: "operational" | "degraded" | "outage";
latencyMs: number;
error?: string;
}> {
const startTime = performance.now();
try {
const response = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-mini", // Cheaper model for health checks
messages: [{ role: "user", content: "Respond with: OK" }],
max_tokens: 5,
}),
});
const latencyMs = Math.round(performance.now() - startTime);
if (!response.ok) {
if (response.status === 429) {
return { status: "degraded", latencyMs, error: "Rate limited" };
}
if (response.status >= 500) {
return { status: "outage", latencyMs, error: `Server error: ${response.status}` };
}
return { status: "degraded", latencyMs, error: `HTTP ${response.status}` };
}
if (latencyMs > 10_000) {
return { status: "degraded", latencyMs, error: "Slow response" };
}
return { status: "operational", latencyMs };
} catch (error) {
const latencyMs = Math.round(performance.now() - startTime);
return {
status: "outage",
latencyMs,
error: error instanceof Error ? error.message : "Unknown error",
};
}
}
Use it in your app:
// pages/api/check-health.ts
import { checkChatGPTHealth } from "@/lib/chatgpt-health-check";
export default async function handler(req, res) {
const health = await checkChatGPTHealth();
if (health.status === "outage") {
// Alert: ChatGPT is down
console.error("🔴 ChatGPT DOWN:", health.error);
// Send Slack alert, PagerDuty, etc.
}
res.json(health);
}
Cost: ~$0.0001 per check (minimal tokens). 60 checks/day = $0.000006/month.
Approach 2: Cron Job + Database Logging
Run health checks every 30 seconds and track uptime trends.
// scripts/monitor-llm-apis.ts
import { db } from "@/lib/db";
import { checkChatGPTHealth } from "@/lib/chatgpt-health-check";
export async function logHealthCheck() {
const health = await checkChatGPTHealth();
await db.statusCheck.create({
data: {
providerId: "openai-id", // Your provider ID
modelId: "gpt-4o-mini-id",
status: health.status.toUpperCase(),
latencyMs: health.latencyMs,
errorMessage: health.error,
checkedAt: new Date(),
},
});
// Alert on status change
const previousCheck = await db.statusCheck.findFirst({
where: { providerId: "openai-id" },
orderBy: { checkedAt: "desc" },
skip: 1,
});
if (previousCheck?.status !== health.status.toUpperCase()) {
console.log(`⚠️ Status changed: ${previousCheck?.status} → ${health.status}`);
// Send alert
}
}
Schedule it:
- Vercel Cron: Every 30 seconds (requires Pro plan)
- Render: Include in your pinger loop
- AWS Lambda: Scheduled EventBridge rule
Cost: Same as polling, plus database writes (~$0.01/month on Supabase free tier).
Approach 3: Graceful Degradation (Production-Ready)
When ChatGPT is down, fall back to Claude or local cache.
// lib/llm-fallback.ts
export async function getCompletion(prompt: string) {
const providers = [
{ name: "chatgpt", fn: callChatGPT },
{ name: "claude", fn: callClaude },
{ name: "cache", fn: returnCachedResponse },
];
for (const provider of providers) {
try {
const result = await Promise.race([
provider.fn(prompt),
new Promise((_, reject) =>
setTimeout(() => reject(new Error("Timeout")), 5000)
),
]);
return { result, provider: provider.name };
} catch (error) {
console.warn(`${provider.name} failed:`, error);
continue; // Try next provider
}
}
throw new Error("All providers failed");
}
Benefits:
- Users never see "API is down" errors
- Automatic failover to secondary providers
- Cached responses as last resort
- Your app stays functional even during outages
Approach 4: Use an Uptime Service (Managed, $0)
Don't want to build monitoring? Use IsItDown.ai (free tier).
// In your app, check our real-time API:
async function checkChatGPTStatus() {
const res = await fetch("https://isitdown.ai/api/status");
const data = await res.json();
const chatgpt = data.providers.find(p => p.slug === "openai");
if (chatgpt.status === "operational") {
return true; // Safe to call
} else {
console.warn(`ChatGPT is ${chatgpt.status}, consider using fallback`);
return false;
}
}
Advantages:
- No infrastructure to maintain
- Real data from actual API calls
- Weekly uptime reports
- Free (we have an API)
Best Practices
1. Monitor Early, Act Fast
// Alert threshold: 3 consecutive failures = alert immediately
let failureCount = 0;
const ALERT_THRESHOLD = 3;
if (health.status === "outage") {
failureCount++;
if (failureCount >= ALERT_THRESHOLD) {
alertTeam(); // PagerDuty, Slack, SMS
failureCount = 0; // Reset after alert
}
} else {
failureCount = 0;
}
2. Avoid Cascading Failures
// Don't hammer a failing API
async function smartRetry(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i < maxRetries - 1) {
// Exponential backoff: 1s, 2s, 4s
await sleep(Math.pow(2, i) * 1000);
}
}
}
throw error;
}
3. Log Everything
// Store failures for analysis
await db.incidentLog.create({
data: {
provider: "openai",
status: health.status,
latencyMs: health.latencyMs,
error: health.error,
timestamp: new Date(),
context: { userId, action: "generate_response" },
},
});
4. Set Realistic Thresholds
// Don't alert on every 429 (rate limit)
if (response.status === 429) {
// Expected during load spikes, not an emergency
return { status: "degraded", latencyMs };
}
// Alert on 5xx (server error)
if (response.status >= 500) {
return { status: "outage", latencyMs }; // Alert!
}
// Alert on timeout > 30s
if (latencyMs > 30_000) {
return { status: "outage", latencyMs }; // Likely down
}
Cost Comparison
| Approach | Setup | Monthly Cost | Reliability |
|---|---|---|---|
| DIY Polling | 1 hour | $0.01 | Depends on you |
| Cron + DB | 2 hours | $0.20 | Good if you run it |
| Fallback Logic | 4 hours | $0.20 | Excellent (auto-recovery) |
| IsItDown.ai | 15 mins | $0 | High (shared infrastructure) |
| Third-party (Datadog) | 30 mins | $300+/mo | Enterprise-grade |
Summary
Start here: Approach 1 (simple polling) takes 15 minutes.
Scale to: Approach 3 (fallback logic) when you have multiple users.
Monitor with: IsItDown.ai (free) for trend data and reports.
Real-time monitoring isn't optional — it's how you stay operational when APIs fail.
Questions?
Check IsItDown.ai for live data
Browse the code on GitHub
Published by the Is It Down AI Team.