Webhooks power real-time integrations between services, but network failures, server downtime, and processing errors are inevitable. When a webhook delivery fails, what happens next? Understanding webhook retry logic is critical for building reliable integrations that survive temporary outages without losing events or processing duplicates.
This guide covers how webhook providers implement retry mechanisms, how to handle retries gracefully with idempotency patterns, and best practices for ensuring your webhook endpoints are production-ready.
Why Webhooks Fail
Webhook delivery failures occur for many reasons:
- Network issues: Timeouts, DNS failures, connection resets
- Server downtime: Your application is deploying, restarting, or experiencing an outage
- Processing errors: Database deadlocks, external API timeouts, out-of-memory errors
- Rate limiting: Your server is overwhelmed and rejecting requests
- Configuration errors: Incorrect endpoint URLs, firewall rules blocking requests
- Temporary unavailability: Cloud provider issues, load balancer health checks failing
Even well-architected systems experience failures. Webhook retry logic provides fault tolerance by automatically reattempting delivery when transient issues occur.
How Webhook Retries Work
When a webhook provider sends an event to your endpoint, they expect an HTTP response within a timeout window (typically 5-30 seconds). Based on the response, the provider decides whether to retry:
Successful delivery (HTTP 200-299): Event marked as delivered, no retry Temporary failure (HTTP 5xx, timeout, network error): Event queued for retry Permanent failure (HTTP 4xx): Event marked as failed, no retry (usually)
Most providers implement exponential backoff - increasing wait times between retries to avoid overwhelming recovering systems:
Attempt 1: Immediate
Attempt 2: 1 second later
Attempt 3: 2 seconds later
Attempt 4: 4 seconds later
Attempt 5: 8 seconds later
...
After exhausting retries (typically 3-10 attempts over 1-3 days), events are marked as permanently failed.
Provider Retry Comparison
Different webhook providers implement varying retry strategies:
| Provider | Retry Attempts | Retry Window | Backoff Strategy | 4xx Retries |
|---|---|---|---|---|
| Stripe | ~25 attempts | 3 days | Exponential (max 12 hours) | No |
| GitHub | 3 attempts | 1 hour | Linear (5 minutes) | No |
| Twilio | 3 attempts | 24 hours | Exponential | No |
| Shopify | 19 attempts | 48 hours | Exponential (max 12 hours) | No |
| Square | 10 attempts | 3 days | Exponential | No |
| PayPal | ~8 attempts | 4 days | Exponential | No |
| Slack | 3 attempts | 1 hour | Exponential | No |
| Mailgun | 5 attempts | 8 hours | Linear | No |
Key takeaway: Always check your specific provider's documentation. Retry behavior varies significantly, affecting how you design error handling and idempotency.
Implementing Idempotency
Since providers retry failed webhooks, your endpoint may receive the same event multiple times. Idempotency ensures processing an event multiple times produces the same result as processing it once.
Event ID Tracking
Every webhook provider includes a unique event ID in the payload (e.g., Stripe's id, GitHub's x-github-delivery header). Track processed event IDs to detect duplicates:
Database pattern:
CREATE TABLE processed_webhooks (
event_id VARCHAR(255) PRIMARY KEY,
provider VARCHAR(50) NOT NULL,
event_type VARCHAR(100),
processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_processed_at (processed_at)
);
Node.js implementation:
const express = require('express');
const crypto = require('crypto');
const app = express();
app.post('/webhooks/stripe', async (req, res) => {
const eventId = req.body.id;
try {
// Check if already processed
const existing = await db.query(
'SELECT event_id FROM processed_webhooks WHERE event_id = ?',
[eventId]
);
if (existing.length > 0) {
console.log(`Duplicate webhook ${eventId}, skipping`);
return res.status(200).json({ received: true, duplicate: true });
}
// Process webhook
await processStripeEvent(req.body);
// Mark as processed
await db.query(
'INSERT INTO processed_webhooks (event_id, provider, event_type) VALUES (?, ?, ?)',
[eventId, 'stripe', req.body.type]
);
res.status(200).json({ received: true });
} catch (error) {
console.error('Webhook processing error:', error);
// Return 500 to trigger provider retry
res.status(500).json({ error: 'Processing failed' });
}
});
Redis Implementation
For high-throughput systems, Redis provides faster idempotency checking with automatic TTL:
const Redis = require('ioredis');
const redis = new Redis();
app.post('/webhooks/stripe', async (req, res) => {
const eventId = req.body.id;
const lockKey = `webhook:processed:${eventId}`;
try {
// Atomic check-and-set with 4-day TTL (longer than Stripe's 3-day retry window)
const wasSet = await redis.set(lockKey, '1', 'EX', 345600, 'NX');
if (!wasSet) {
console.log(`Duplicate webhook ${eventId}, skipping`);
return res.status(200).json({ received: true, duplicate: true });
}
// Process webhook
await processStripeEvent(req.body);
res.status(200).json({ received: true });
} catch (error) {
// Delete lock on processing failure to allow retry
await redis.del(lockKey);
console.error('Webhook processing error:', error);
res.status(500).json({ error: 'Processing failed' });
}
});
Redis advantages:
- O(1) duplicate detection
- Automatic TTL prevents unbounded growth
- Atomic operations prevent race conditions
- No database cleanup jobs needed
Python Implementation with Django
from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
from django.core.cache import cache
import json
@csrf_exempt
def stripe_webhook(request):
event_id = json.loads(request.body)['id']
cache_key = f'webhook_processed_{event_id}'
# Check cache (Redis) with 4-day TTL
if cache.get(cache_key):
return JsonResponse({'received': True, 'duplicate': True})
try:
# Process webhook
process_stripe_event(json.loads(request.body))
# Mark as processed
cache.set(cache_key, True, 345600) # 4 days in seconds
return JsonResponse({'received': True})
except Exception as e:
# Return 500 to trigger retry
return JsonResponse({'error': str(e)}, status=500)
PHP Implementation
<?php
// Using Redis for idempotency
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
$payload = json_decode(file_get_contents('php://input'), true);
$eventId = $payload['id'];
$lockKey = "webhook:processed:$eventId";
// Check if already processed
if ($redis->get($lockKey)) {
http_response_code(200);
echo json_encode(['received' => true, 'duplicate' => true]);
exit;
}
try {
// Process webhook
processStripeEvent($payload);
// Mark as processed with 4-day TTL
$redis->setex($lockKey, 345600, '1');
http_response_code(200);
echo json_encode(['received' => true]);
} catch (Exception $e) {
error_log('Webhook processing error: ' . $e->getMessage());
http_response_code(500);
echo json_encode(['error' => 'Processing failed']);
}
Handling Retries Gracefully
Return 200 for Successful Receipt
Respond with HTTP 200 as soon as you've received and validated the webhook, even if processing isn't complete:
app.post('/webhooks/stripe', async (req, res) => {
const eventId = req.body.id;
// Immediately return 200 to prevent retry
res.status(200).json({ received: true });
// Process asynchronously
processWebhookAsync(eventId, req.body).catch(error => {
console.error('Async processing failed:', error);
// Log to monitoring system, add to dead letter queue, etc.
});
});
Queue-Based Pattern
For complex processing, enqueue webhooks and process them asynchronously:
const Bull = require('bull');
const webhookQueue = new Bull('webhooks', {
redis: { host: '127.0.0.1', port: 6379 }
});
app.post('/webhooks/stripe', async (req, res) => {
const eventId = req.body.id;
// Check idempotency
const isDuplicate = await redis.get(`webhook:processed:${eventId}`);
if (isDuplicate) {
return res.status(200).json({ received: true, duplicate: true });
}
// Add to queue
await webhookQueue.add({
eventId: eventId,
provider: 'stripe',
payload: req.body
}, {
attempts: 3,
backoff: { type: 'exponential', delay: 2000 }
});
res.status(200).json({ received: true, queued: true });
});
// Worker processes queue
webhookQueue.process(async (job) => {
const { eventId, payload } = job.data;
// Mark as processing
await redis.set(`webhook:processed:${eventId}`, '1', 'EX', 345600);
// Process webhook
await processStripeEvent(payload);
});
Benefits:
- Webhook endpoint returns quickly, preventing timeouts
- Failed processing can retry with backoff
- Queue provides visibility into processing status
- Scales independently from web servers
Duplicate Detection at Multiple Levels
Implement idempotency checks at multiple stages:
- Endpoint level: Prevent processing duplicate webhooks
- Business logic level: Prevent duplicate operations (e.g., charging twice)
- Database level: Use unique constraints to prevent duplicate records
async function processPaymentSucceeded(paymentIntent) {
const idempotencyKey = paymentIntent.id;
try {
// Insert with idempotency constraint
await db.query(
`INSERT INTO payments (payment_intent_id, amount, status, created_at)
VALUES (?, ?, ?, NOW())`,
[idempotencyKey, paymentIntent.amount, 'succeeded']
);
// Update order status
await db.query(
`UPDATE orders SET payment_status = 'paid' WHERE payment_intent_id = ?`,
[idempotencyKey]
);
} catch (error) {
// Duplicate key error is expected and safe
if (error.code === 'ER_DUP_ENTRY') {
console.log('Payment already recorded');
return;
}
throw error;
}
}
Client-Side Retry Logic
When provider retries are insufficient or you need guaranteed delivery, implement your own retry mechanism:
const cron = require('node-cron');
// Run every hour
cron.schedule('0 * * * *', async () => {
// Fetch recent events from provider
const events = await stripe.events.list({
created: { gte: Math.floor(Date.now() / 1000) - 86400 } // Last 24 hours
});
for (const event of events.data) {
// Check if processed
const processed = await redis.get(`webhook:processed:${event.id}`);
if (!processed) {
console.log(`Missing event ${event.id}, processing now`);
await processStripeEvent(event);
await redis.set(`webhook:processed:${event.id}`, '1', 'EX', 345600);
}
}
});
When to implement:
- Critical events (payments, account changes)
- Providers with short retry windows
- High-reliability requirements
- Historical event reconciliation
Dead Letter Queues
Capture permanently failed webhooks for manual review:
async function processSt ripeEvent(payload) {
try {
// Process webhook
await handleEvent(payload);
} catch (error) {
console.error('Webhook processing failed:', error);
// After max retries, save to DLQ
if (isUnrecoverable(error)) {
await db.query(
`INSERT INTO webhook_dlq (event_id, provider, payload, error_message, created_at)
VALUES (?, ?, ?, ?, NOW())`,
[payload.id, 'stripe', JSON.stringify(payload), error.message]
);
// Alert team
await sendAlert({
title: 'Webhook permanently failed',
eventId: payload.id,
error: error.message
});
}
throw error; // Re-throw to trigger provider retry
}
}
DLQ table schema:
CREATE TABLE webhook_dlq (
id INT AUTO_INCREMENT PRIMARY KEY,
event_id VARCHAR(255) UNIQUE,
provider VARCHAR(50),
event_type VARCHAR(100),
payload JSON,
error_message TEXT,
retry_count INT DEFAULT 0,
last_retry_at TIMESTAMP NULL,
resolved_at TIMESTAMP NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_provider_created (provider, created_at),
INDEX idx_resolved (resolved_at)
);
Monitoring Retry Metrics
Track webhook reliability with these metrics:
Key metrics:
- Success rate (successful / total deliveries)
- Retry rate (retried / total deliveries)
- Average time to success
- Permanent failure rate
- Duplicate detection rate
Implementation:
async function recordWebhookMetrics(eventId, provider, status, retryCount) {
await db.query(
`INSERT INTO webhook_metrics (event_id, provider, status, retry_count, timestamp)
VALUES (?, ?, ?, ?, NOW())`,
[eventId, provider, status, retryCount]
);
}
// Query for dashboard
async function getWebhookStats(provider, hours = 24) {
return await db.query(`
SELECT
COUNT(*) as total,
SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successful,
SUM(CASE WHEN status = 'duplicate' THEN 1 ELSE 0 END) as duplicates,
SUM(CASE WHEN retry_count > 0 THEN 1 ELSE 0 END) as retried,
AVG(retry_count) as avg_retries
FROM webhook_metrics
WHERE provider = ? AND timestamp > NOW() - INTERVAL ? HOUR
`, [provider, hours]);
}
Alerting thresholds:
- Success rate drops below 95%
- Retry rate exceeds 10%
- Permanent failures exceed 1%
- Processing time exceeds 5 seconds (median)
Troubleshooting Common Retry Issues
Issue 1: Webhook Storms
Symptom: Thousands of webhooks arriving simultaneously after a brief outage
Solution:
- Implement rate limiting at endpoint level
- Use queue-based processing to smooth traffic
- Scale workers horizontally during recovery
- Consider provider rate limit settings
const rateLimit = require('express-rate-limit');
const webhookLimiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 100, // 100 requests per minute
message: 'Too many webhooks, please retry',
statusCode: 503, // Service unavailable triggers retry
});
app.post('/webhooks/stripe', webhookLimiter, handleStripeWebhook);
Issue 2: Idempotency Key Collisions
Symptom: Different events with same ID (rare but possible)
Solution: Use composite keys including provider and timestamp:
const idempotencyKey = `${provider}:${eventId}:${timestamp}`;
await redis.set(`webhook:processed:${idempotencyKey}`, '1', 'EX', 345600);
Issue 3: Processing Timeouts
Symptom: Webhooks timing out before processing completes
Solution:
- Return 200 immediately, process asynchronously
- Optimize database queries
- Use connection pooling
- Increase timeout limits (if provider allows)
Issue 4: Database Deadlocks
Symptom: Concurrent webhook processing causes database locks
Solution:
- Process webhooks in order per resource (e.g., per customer)
- Use optimistic locking
- Implement retry with backoff for deadlocks
- Reduce transaction scope
async function processWithRetry(event, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
await processEvent(event);
return;
} catch (error) {
if (error.code === 'ER_LOCK_DEADLOCK' && attempt < maxRetries) {
await sleep(Math.pow(2, attempt) * 100); // Exponential backoff
continue;
}
throw error;
}
}
}
Best Practices Summary
- Always implement idempotency: Use event IDs to prevent duplicate processing
- Return 200 quickly: Respond within 5 seconds, process asynchronously if needed
- Use appropriate status codes: 200 for success, 5xx for retriable failures, 4xx for permanent errors
- Store idempotency keys longer than retry window: Keep keys for at least provider's full retry period
- Log everything: Record all webhook receipts, processing attempts, and failures
- Monitor continuously: Track success rates, retry rates, and processing times
- Implement dead letter queues: Capture and review permanently failed events
- Test retry scenarios: Simulate failures in staging to validate behavior
- Use queue-based processing: Scale processing independently from webhook receipt
- Implement alerts: Get notified when success rates drop or failures spike
Conclusion
Webhook retry logic is essential for reliable integrations. Providers automatically retry failed deliveries, but your application must handle retries gracefully with idempotency checks to prevent duplicate processing. By implementing the patterns in this guide - event ID tracking, queue-based processing, proper status codes, and dead letter queues - you'll build webhook endpoints that survive temporary failures without data loss or duplication.
The key is designing for failure from the start: webhooks will fail, retries will happen, and your code must handle both scenarios correctly. With proper idempotency, monitoring, and error handling, your webhook integrations will be production-ready and reliable.
Need help implementing secure, reliable webhook integrations? Our team specializes in building production-ready API integrations with proper error handling, monitoring, and security. Contact us for a consultation or explore our developer tools for webhook testing and debugging resources.