Webhook Retry Logic: Handling Failures and Reliability

Webhooks power real-time integrations between services, but network failures, server downtime, and processing errors are inevitable. When a webhook delivery fails, what happens next? Understanding webhook retry logic is critical for building reliable integrations that survive temporary outages without losing events or processing duplicates.

This guide covers how webhook providers implement retry mechanisms, how to handle retries gracefully with idempotency patterns, and best practices for ensuring your webhook endpoints are production-ready.

Why Webhooks Fail

Webhook delivery failures occur for many reasons:

Network issues: Timeouts, DNS failures, connection resets
Server downtime: Your application is deploying, restarting, or experiencing an outage
Processing errors: Database deadlocks, external API timeouts, out-of-memory errors
Rate limiting: Your server is overwhelmed and rejecting requests
Configuration errors: Incorrect endpoint URLs, firewall rules blocking requests
Temporary unavailability: Cloud provider issues, load balancer health checks failing

Even well-architected systems experience failures. Webhook retry logic provides fault tolerance by automatically reattempting delivery when transient issues occur.

How Webhook Retries Work

When a webhook provider sends an event to your endpoint, they expect an HTTP response within a timeout window (typically 5-30 seconds). Based on the response, the provider decides whether to retry:

Successful delivery (HTTP 200-299): Event marked as delivered, no retry Temporary failure (HTTP 5xx, timeout, network error): Event queued for retry Permanent failure (HTTP 4xx): Event marked as failed, no retry (usually)

Most providers implement exponential backoff - increasing wait times between retries to avoid overwhelming recovering systems:

Attempt 1: Immediate
Attempt 2: 1 second later
Attempt 3: 2 seconds later
Attempt 4: 4 seconds later
Attempt 5: 8 seconds later
...

After exhausting retries (typically 3-10 attempts over 1-3 days), events are marked as permanently failed.

Provider Retry Comparison

Different webhook providers implement varying retry strategies:

Provider	Retry Attempts	Retry Window	Backoff Strategy	4xx Retries
Stripe	~25 attempts	3 days	Exponential (max 12 hours)	No
GitHub	3 attempts	1 hour	Linear (5 minutes)	No
Twilio	3 attempts	24 hours	Exponential	No
Shopify	19 attempts	48 hours	Exponential (max 12 hours)	No
Square	10 attempts	3 days	Exponential	No
PayPal	~8 attempts	4 days	Exponential	No
Slack	3 attempts	1 hour	Exponential	No
Mailgun	5 attempts	8 hours	Linear	No

Key takeaway: Always check your specific provider's documentation. Retry behavior varies significantly, affecting how you design error handling and idempotency.

Implementing Idempotency

Since providers retry failed webhooks, your endpoint may receive the same event multiple times. Idempotency ensures processing an event multiple times produces the same result as processing it once.

Event ID Tracking

Every webhook provider includes a unique event ID in the payload (e.g., Stripe's id, GitHub's x-github-delivery header). Track processed event IDs to detect duplicates:

Database pattern:

CREATE TABLE processed_webhooks (
  event_id VARCHAR(255) PRIMARY KEY,
  provider VARCHAR(50) NOT NULL,
  event_type VARCHAR(100),
  processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_processed_at (processed_at)
);

Node.js implementation:

const express = require('express');
const crypto = require('crypto');
const app = express();

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;

  try {
    // Check if already processed
    const existing = await db.query(
      'SELECT event_id FROM processed_webhooks WHERE event_id = ?',
      [eventId]
    );

    if (existing.length > 0) {
      console.log(`Duplicate webhook ${eventId}, skipping`);
      return res.status(200).json({ received: true, duplicate: true });
    }

    // Process webhook
    await processStripeEvent(req.body);

    // Mark as processed
    await db.query(
      'INSERT INTO processed_webhooks (event_id, provider, event_type) VALUES (?, ?, ?)',
      [eventId, 'stripe', req.body.type]
    );

    res.status(200).json({ received: true });
  } catch (error) {
    console.error('Webhook processing error:', error);
    // Return 500 to trigger provider retry
    res.status(500).json({ error: 'Processing failed' });
  }
});

Redis Implementation

For high-throughput systems, Redis provides faster idempotency checking with automatic TTL:

const Redis = require('ioredis');
const redis = new Redis();

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;
  const lockKey = `webhook:processed:${eventId}`;

  try {
    // Atomic check-and-set with 4-day TTL (longer than Stripe's 3-day retry window)
    const wasSet = await redis.set(lockKey, '1', 'EX', 345600, 'NX');

    if (!wasSet) {
      console.log(`Duplicate webhook ${eventId}, skipping`);
      return res.status(200).json({ received: true, duplicate: true });
    }

    // Process webhook
    await processStripeEvent(req.body);

    res.status(200).json({ received: true });
  } catch (error) {
    // Delete lock on processing failure to allow retry
    await redis.del(lockKey);
    console.error('Webhook processing error:', error);
    res.status(500).json({ error: 'Processing failed' });
  }
});

Redis advantages:

O(1) duplicate detection
Automatic TTL prevents unbounded growth
Atomic operations prevent race conditions
No database cleanup jobs needed

Python Implementation with Django

from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
from django.core.cache import cache
import json

@csrf_exempt
def stripe_webhook(request):
    event_id = json.loads(request.body)['id']
    cache_key = f'webhook_processed_{event_id}'

    # Check cache (Redis) with 4-day TTL
    if cache.get(cache_key):
        return JsonResponse({'received': True, 'duplicate': True})

    try:
        # Process webhook
        process_stripe_event(json.loads(request.body))

        # Mark as processed
        cache.set(cache_key, True, 345600)  # 4 days in seconds

        return JsonResponse({'received': True})
    except Exception as e:
        # Return 500 to trigger retry
        return JsonResponse({'error': str(e)}, status=500)

PHP Implementation

<?php
// Using Redis for idempotency
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);

$payload = json_decode(file_get_contents('php://input'), true);
$eventId = $payload['id'];
$lockKey = "webhook:processed:$eventId";

// Check if already processed
if ($redis->get($lockKey)) {
    http_response_code(200);
    echo json_encode(['received' => true, 'duplicate' => true]);
    exit;
}

try {
    // Process webhook
    processStripeEvent($payload);

    // Mark as processed with 4-day TTL
    $redis->setex($lockKey, 345600, '1');

    http_response_code(200);
    echo json_encode(['received' => true]);
} catch (Exception $e) {
    error_log('Webhook processing error: ' . $e->getMessage());
    http_response_code(500);
    echo json_encode(['error' => 'Processing failed']);
}

Handling Retries Gracefully

Return 200 for Successful Receipt

Respond with HTTP 200 as soon as you've received and validated the webhook, even if processing isn't complete:

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;

  // Immediately return 200 to prevent retry
  res.status(200).json({ received: true });

  // Process asynchronously
  processWebhookAsync(eventId, req.body).catch(error => {
    console.error('Async processing failed:', error);
    // Log to monitoring system, add to dead letter queue, etc.
  });
});

Queue-Based Pattern

For complex processing, enqueue webhooks and process them asynchronously:

const Bull = require('bull');
const webhookQueue = new Bull('webhooks', {
  redis: { host: '127.0.0.1', port: 6379 }
});

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;

  // Check idempotency
  const isDuplicate = await redis.get(`webhook:processed:${eventId}`);
  if (isDuplicate) {
    return res.status(200).json({ received: true, duplicate: true });
  }

  // Add to queue
  await webhookQueue.add({
    eventId: eventId,
    provider: 'stripe',
    payload: req.body
  }, {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 }
  });

  res.status(200).json({ received: true, queued: true });
});

// Worker processes queue
webhookQueue.process(async (job) => {
  const { eventId, payload } = job.data;

  // Mark as processing
  await redis.set(`webhook:processed:${eventId}`, '1', 'EX', 345600);

  // Process webhook
  await processStripeEvent(payload);
});

Benefits:

Webhook endpoint returns quickly, preventing timeouts
Failed processing can retry with backoff
Queue provides visibility into processing status
Scales independently from web servers

Duplicate Detection at Multiple Levels

Implement idempotency checks at multiple stages:

Endpoint level: Prevent processing duplicate webhooks
Business logic level: Prevent duplicate operations (e.g., charging twice)
Database level: Use unique constraints to prevent duplicate records

async function processPaymentSucceeded(paymentIntent) {
  const idempotencyKey = paymentIntent.id;

  try {
    // Insert with idempotency constraint
    await db.query(
      `INSERT INTO payments (payment_intent_id, amount, status, created_at)
       VALUES (?, ?, ?, NOW())`,
      [idempotencyKey, paymentIntent.amount, 'succeeded']
    );

    // Update order status
    await db.query(
      `UPDATE orders SET payment_status = 'paid' WHERE payment_intent_id = ?`,
      [idempotencyKey]
    );

  } catch (error) {
    // Duplicate key error is expected and safe
    if (error.code === 'ER_DUP_ENTRY') {
      console.log('Payment already recorded');
      return;
    }
    throw error;
  }
}

Client-Side Retry Logic

When provider retries are insufficient or you need guaranteed delivery, implement your own retry mechanism:

const cron = require('node-cron');

// Run every hour
cron.schedule('0 * * * *', async () => {
  // Fetch recent events from provider
  const events = await stripe.events.list({
    created: { gte: Math.floor(Date.now() / 1000) - 86400 } // Last 24 hours
  });

  for (const event of events.data) {
    // Check if processed
    const processed = await redis.get(`webhook:processed:${event.id}`);

    if (!processed) {
      console.log(`Missing event ${event.id}, processing now`);
      await processStripeEvent(event);
      await redis.set(`webhook:processed:${event.id}`, '1', 'EX', 345600);
    }
  }
});

When to implement:

Critical events (payments, account changes)
Providers with short retry windows
High-reliability requirements
Historical event reconciliation

Dead Letter Queues

Capture permanently failed webhooks for manual review:

async function processSt ripeEvent(payload) {
  try {
    // Process webhook
    await handleEvent(payload);
  } catch (error) {
    console.error('Webhook processing failed:', error);

    // After max retries, save to DLQ
    if (isUnrecoverable(error)) {
      await db.query(
        `INSERT INTO webhook_dlq (event_id, provider, payload, error_message, created_at)
         VALUES (?, ?, ?, ?, NOW())`,
        [payload.id, 'stripe', JSON.stringify(payload), error.message]
      );

      // Alert team
      await sendAlert({
        title: 'Webhook permanently failed',
        eventId: payload.id,
        error: error.message
      });
    }

    throw error; // Re-throw to trigger provider retry
  }
}

DLQ table schema:

CREATE TABLE webhook_dlq (
  id INT AUTO_INCREMENT PRIMARY KEY,
  event_id VARCHAR(255) UNIQUE,
  provider VARCHAR(50),
  event_type VARCHAR(100),
  payload JSON,
  error_message TEXT,
  retry_count INT DEFAULT 0,
  last_retry_at TIMESTAMP NULL,
  resolved_at TIMESTAMP NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_provider_created (provider, created_at),
  INDEX idx_resolved (resolved_at)
);

Monitoring Retry Metrics

Track webhook reliability with these metrics:

Key metrics:

Success rate (successful / total deliveries)
Retry rate (retried / total deliveries)
Average time to success
Permanent failure rate
Duplicate detection rate

Implementation:

async function recordWebhookMetrics(eventId, provider, status, retryCount) {
  await db.query(
    `INSERT INTO webhook_metrics (event_id, provider, status, retry_count, timestamp)
     VALUES (?, ?, ?, ?, NOW())`,
    [eventId, provider, status, retryCount]
  );
}

// Query for dashboard
async function getWebhookStats(provider, hours = 24) {
  return await db.query(`
    SELECT
      COUNT(*) as total,
      SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successful,
      SUM(CASE WHEN status = 'duplicate' THEN 1 ELSE 0 END) as duplicates,
      SUM(CASE WHEN retry_count > 0 THEN 1 ELSE 0 END) as retried,
      AVG(retry_count) as avg_retries
    FROM webhook_metrics
    WHERE provider = ? AND timestamp > NOW() - INTERVAL ? HOUR
  `, [provider, hours]);
}

Alerting thresholds:

Success rate drops below 95%
Retry rate exceeds 10%
Permanent failures exceed 1%
Processing time exceeds 5 seconds (median)

Troubleshooting Common Retry Issues

Issue 1: Webhook Storms

Symptom: Thousands of webhooks arriving simultaneously after a brief outage

Solution:

Implement rate limiting at endpoint level
Use queue-based processing to smooth traffic
Scale workers horizontally during recovery
Consider provider rate limit settings

const rateLimit = require('express-rate-limit');

const webhookLimiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 100, // 100 requests per minute
  message: 'Too many webhooks, please retry',
  statusCode: 503, // Service unavailable triggers retry
});

app.post('/webhooks/stripe', webhookLimiter, handleStripeWebhook);

Issue 2: Idempotency Key Collisions

Symptom: Different events with same ID (rare but possible)

Solution: Use composite keys including provider and timestamp:

const idempotencyKey = `${provider}:${eventId}:${timestamp}`;
await redis.set(`webhook:processed:${idempotencyKey}`, '1', 'EX', 345600);

Issue 3: Processing Timeouts

Symptom: Webhooks timing out before processing completes

Solution:

Return 200 immediately, process asynchronously
Optimize database queries
Use connection pooling
Increase timeout limits (if provider allows)

Issue 4: Database Deadlocks

Symptom: Concurrent webhook processing causes database locks

Solution:

Process webhooks in order per resource (e.g., per customer)
Use optimistic locking
Implement retry with backoff for deadlocks
Reduce transaction scope

async function processWithRetry(event, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      await processEvent(event);
      return;
    } catch (error) {
      if (error.code === 'ER_LOCK_DEADLOCK' && attempt < maxRetries) {
        await sleep(Math.pow(2, attempt) * 100); // Exponential backoff
        continue;
      }
      throw error;
    }
  }
}

Best Practices Summary

Always implement idempotency: Use event IDs to prevent duplicate processing
Return 200 quickly: Respond within 5 seconds, process asynchronously if needed
Use appropriate status codes: 200 for success, 5xx for retriable failures, 4xx for permanent errors
Store idempotency keys longer than retry window: Keep keys for at least provider's full retry period
Log everything: Record all webhook receipts, processing attempts, and failures
Monitor continuously: Track success rates, retry rates, and processing times
Implement dead letter queues: Capture and review permanently failed events
Test retry scenarios: Simulate failures in staging to validate behavior
Use queue-based processing: Scale processing independently from webhook receipt
Implement alerts: Get notified when success rates drop or failures spike

Conclusion

Webhook retry logic is essential for reliable integrations. Providers automatically retry failed deliveries, but your application must handle retries gracefully with idempotency checks to prevent duplicate processing. By implementing the patterns in this guide - event ID tracking, queue-based processing, proper status codes, and dead letter queues - you'll build webhook endpoints that survive temporary failures without data loss or duplication.

The key is designing for failure from the start: webhooks will fail, retries will happen, and your code must handle both scenarios correctly. With proper idempotency, monitoring, and error handling, your webhook integrations will be production-ready and reliable.

Need help implementing secure, reliable webhook integrations? Our team specializes in building production-ready API integrations with proper error handling, monitoring, and security. Contact us for a consultation or explore our developer tools for webhook testing and debugging resources.

Webhook Retry Logic: Handling Failures and Reliability

Why Webhooks Fail

How Webhook Retries Work

Provider Retry Comparison

Implementing Idempotency

Event ID Tracking

Redis Implementation

Python Implementation with Django

PHP Implementation

Handling Retries Gracefully

Return 200 for Successful Receipt

Queue-Based Pattern

Duplicate Detection at Multiple Levels

Client-Side Retry Logic

Dead Letter Queues

Monitoring Retry Metrics

Troubleshooting Common Retry Issues

Issue 1: Webhook Storms

Issue 2: Idempotency Key Collisions

Issue 3: Processing Timeouts

Issue 4: Database Deadlocks

Best Practices Summary

Conclusion

Building Something Great?

JSON Formatter

Color Picker & Converter: HEX, RGB, HSL Tool

Cron Expression Builder: Schedule Jobs Easily

CSV to JSON Converter - Free Online Tool

Diff Checker - Text Comparison Tool

HTML Encoder/Decoder

JSON Formatter & Validator

Webhook Retry Logic: Handling Failures and Reliability

Why Webhooks Fail

How Webhook Retries Work

Provider Retry Comparison

Implementing Idempotency

Event ID Tracking

Redis Implementation

Python Implementation with Django

PHP Implementation

Handling Retries Gracefully

Return 200 for Successful Receipt

Queue-Based Pattern

Duplicate Detection at Multiple Levels

Client-Side Retry Logic

Dead Letter Queues

Monitoring Retry Metrics

Troubleshooting Common Retry Issues

Issue 1: Webhook Storms

Issue 2: Idempotency Key Collisions

Issue 3: Processing Timeouts

Issue 4: Database Deadlocks

Best Practices Summary

Conclusion

Building Something Great?

Related Tools

JSON Formatter

Related Articles

Color Picker & Converter: HEX, RGB, HSL Tool

Cron Expression Builder: Schedule Jobs Easily

CSV to JSON Converter - Free Online Tool

Diff Checker - Text Comparison Tool

HTML Encoder/Decoder

JSON Formatter & Validator