Home/Blog/Webhook Retry Logic: Handling Failures and Reliability
Developer Tools

Webhook Retry Logic: Handling Failures and Reliability

Learn how webhook providers implement retry logic, how to handle failures gracefully with idempotency patterns, and best practices for building reliable webhook integrations that survive network issues and temporary outages.

By InventiveHQ Team
Webhook Retry Logic: Handling Failures and Reliability

Webhooks power real-time integrations between services, but network failures, server downtime, and processing errors are inevitable. When a webhook delivery fails, what happens next? Understanding webhook retry logic is critical for building reliable integrations that survive temporary outages without losing events or processing duplicates.

This guide covers how webhook providers implement retry mechanisms, how to handle retries gracefully with idempotency patterns, and best practices for ensuring your webhook endpoints are production-ready.

Why Webhooks Fail

Webhook delivery failures occur for many reasons:

  • Network issues: Timeouts, DNS failures, connection resets
  • Server downtime: Your application is deploying, restarting, or experiencing an outage
  • Processing errors: Database deadlocks, external API timeouts, out-of-memory errors
  • Rate limiting: Your server is overwhelmed and rejecting requests
  • Configuration errors: Incorrect endpoint URLs, firewall rules blocking requests
  • Temporary unavailability: Cloud provider issues, load balancer health checks failing

Even well-architected systems experience failures. Webhook retry logic provides fault tolerance by automatically reattempting delivery when transient issues occur.

How Webhook Retries Work

When a webhook provider sends an event to your endpoint, they expect an HTTP response within a timeout window (typically 5-30 seconds). Based on the response, the provider decides whether to retry:

Successful delivery (HTTP 200-299): Event marked as delivered, no retry Temporary failure (HTTP 5xx, timeout, network error): Event queued for retry Permanent failure (HTTP 4xx): Event marked as failed, no retry (usually)

Most providers implement exponential backoff - increasing wait times between retries to avoid overwhelming recovering systems:

Attempt 1: Immediate
Attempt 2: 1 second later
Attempt 3: 2 seconds later
Attempt 4: 4 seconds later
Attempt 5: 8 seconds later
...

After exhausting retries (typically 3-10 attempts over 1-3 days), events are marked as permanently failed.

Provider Retry Comparison

Different webhook providers implement varying retry strategies:

ProviderRetry AttemptsRetry WindowBackoff Strategy4xx Retries
Stripe~25 attempts3 daysExponential (max 12 hours)No
GitHub3 attempts1 hourLinear (5 minutes)No
Twilio3 attempts24 hoursExponentialNo
Shopify19 attempts48 hoursExponential (max 12 hours)No
Square10 attempts3 daysExponentialNo
PayPal~8 attempts4 daysExponentialNo
Slack3 attempts1 hourExponentialNo
Mailgun5 attempts8 hoursLinearNo

Key takeaway: Always check your specific provider's documentation. Retry behavior varies significantly, affecting how you design error handling and idempotency.

Implementing Idempotency

Since providers retry failed webhooks, your endpoint may receive the same event multiple times. Idempotency ensures processing an event multiple times produces the same result as processing it once.

Event ID Tracking

Every webhook provider includes a unique event ID in the payload (e.g., Stripe's id, GitHub's x-github-delivery header). Track processed event IDs to detect duplicates:

Database pattern:

CREATE TABLE processed_webhooks (
  event_id VARCHAR(255) PRIMARY KEY,
  provider VARCHAR(50) NOT NULL,
  event_type VARCHAR(100),
  processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_processed_at (processed_at)
);

Node.js implementation:

const express = require('express');
const crypto = require('crypto');
const app = express();

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;

  try {
    // Check if already processed
    const existing = await db.query(
      'SELECT event_id FROM processed_webhooks WHERE event_id = ?',
      [eventId]
    );

    if (existing.length > 0) {
      console.log(`Duplicate webhook ${eventId}, skipping`);
      return res.status(200).json({ received: true, duplicate: true });
    }

    // Process webhook
    await processStripeEvent(req.body);

    // Mark as processed
    await db.query(
      'INSERT INTO processed_webhooks (event_id, provider, event_type) VALUES (?, ?, ?)',
      [eventId, 'stripe', req.body.type]
    );

    res.status(200).json({ received: true });
  } catch (error) {
    console.error('Webhook processing error:', error);
    // Return 500 to trigger provider retry
    res.status(500).json({ error: 'Processing failed' });
  }
});

Redis Implementation

For high-throughput systems, Redis provides faster idempotency checking with automatic TTL:

const Redis = require('ioredis');
const redis = new Redis();

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;
  const lockKey = `webhook:processed:${eventId}`;

  try {
    // Atomic check-and-set with 4-day TTL (longer than Stripe's 3-day retry window)
    const wasSet = await redis.set(lockKey, '1', 'EX', 345600, 'NX');

    if (!wasSet) {
      console.log(`Duplicate webhook ${eventId}, skipping`);
      return res.status(200).json({ received: true, duplicate: true });
    }

    // Process webhook
    await processStripeEvent(req.body);

    res.status(200).json({ received: true });
  } catch (error) {
    // Delete lock on processing failure to allow retry
    await redis.del(lockKey);
    console.error('Webhook processing error:', error);
    res.status(500).json({ error: 'Processing failed' });
  }
});

Redis advantages:

  • O(1) duplicate detection
  • Automatic TTL prevents unbounded growth
  • Atomic operations prevent race conditions
  • No database cleanup jobs needed

Python Implementation with Django

from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
from django.core.cache import cache
import json

@csrf_exempt
def stripe_webhook(request):
    event_id = json.loads(request.body)['id']
    cache_key = f'webhook_processed_{event_id}'

    # Check cache (Redis) with 4-day TTL
    if cache.get(cache_key):
        return JsonResponse({'received': True, 'duplicate': True})

    try:
        # Process webhook
        process_stripe_event(json.loads(request.body))

        # Mark as processed
        cache.set(cache_key, True, 345600)  # 4 days in seconds

        return JsonResponse({'received': True})
    except Exception as e:
        # Return 500 to trigger retry
        return JsonResponse({'error': str(e)}, status=500)

PHP Implementation

<?php
// Using Redis for idempotency
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);

$payload = json_decode(file_get_contents('php://input'), true);
$eventId = $payload['id'];
$lockKey = "webhook:processed:$eventId";

// Check if already processed
if ($redis->get($lockKey)) {
    http_response_code(200);
    echo json_encode(['received' => true, 'duplicate' => true]);
    exit;
}

try {
    // Process webhook
    processStripeEvent($payload);

    // Mark as processed with 4-day TTL
    $redis->setex($lockKey, 345600, '1');

    http_response_code(200);
    echo json_encode(['received' => true]);
} catch (Exception $e) {
    error_log('Webhook processing error: ' . $e->getMessage());
    http_response_code(500);
    echo json_encode(['error' => 'Processing failed']);
}

Handling Retries Gracefully

Return 200 for Successful Receipt

Respond with HTTP 200 as soon as you've received and validated the webhook, even if processing isn't complete:

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;

  // Immediately return 200 to prevent retry
  res.status(200).json({ received: true });

  // Process asynchronously
  processWebhookAsync(eventId, req.body).catch(error => {
    console.error('Async processing failed:', error);
    // Log to monitoring system, add to dead letter queue, etc.
  });
});

Queue-Based Pattern

For complex processing, enqueue webhooks and process them asynchronously:

const Bull = require('bull');
const webhookQueue = new Bull('webhooks', {
  redis: { host: '127.0.0.1', port: 6379 }
});

app.post('/webhooks/stripe', async (req, res) => {
  const eventId = req.body.id;

  // Check idempotency
  const isDuplicate = await redis.get(`webhook:processed:${eventId}`);
  if (isDuplicate) {
    return res.status(200).json({ received: true, duplicate: true });
  }

  // Add to queue
  await webhookQueue.add({
    eventId: eventId,
    provider: 'stripe',
    payload: req.body
  }, {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 }
  });

  res.status(200).json({ received: true, queued: true });
});

// Worker processes queue
webhookQueue.process(async (job) => {
  const { eventId, payload } = job.data;

  // Mark as processing
  await redis.set(`webhook:processed:${eventId}`, '1', 'EX', 345600);

  // Process webhook
  await processStripeEvent(payload);
});

Benefits:

  • Webhook endpoint returns quickly, preventing timeouts
  • Failed processing can retry with backoff
  • Queue provides visibility into processing status
  • Scales independently from web servers

Duplicate Detection at Multiple Levels

Implement idempotency checks at multiple stages:

  1. Endpoint level: Prevent processing duplicate webhooks
  2. Business logic level: Prevent duplicate operations (e.g., charging twice)
  3. Database level: Use unique constraints to prevent duplicate records
async function processPaymentSucceeded(paymentIntent) {
  const idempotencyKey = paymentIntent.id;

  try {
    // Insert with idempotency constraint
    await db.query(
      `INSERT INTO payments (payment_intent_id, amount, status, created_at)
       VALUES (?, ?, ?, NOW())`,
      [idempotencyKey, paymentIntent.amount, 'succeeded']
    );

    // Update order status
    await db.query(
      `UPDATE orders SET payment_status = 'paid' WHERE payment_intent_id = ?`,
      [idempotencyKey]
    );

  } catch (error) {
    // Duplicate key error is expected and safe
    if (error.code === 'ER_DUP_ENTRY') {
      console.log('Payment already recorded');
      return;
    }
    throw error;
  }
}

Client-Side Retry Logic

When provider retries are insufficient or you need guaranteed delivery, implement your own retry mechanism:

const cron = require('node-cron');

// Run every hour
cron.schedule('0 * * * *', async () => {
  // Fetch recent events from provider
  const events = await stripe.events.list({
    created: { gte: Math.floor(Date.now() / 1000) - 86400 } // Last 24 hours
  });

  for (const event of events.data) {
    // Check if processed
    const processed = await redis.get(`webhook:processed:${event.id}`);

    if (!processed) {
      console.log(`Missing event ${event.id}, processing now`);
      await processStripeEvent(event);
      await redis.set(`webhook:processed:${event.id}`, '1', 'EX', 345600);
    }
  }
});

When to implement:

  • Critical events (payments, account changes)
  • Providers with short retry windows
  • High-reliability requirements
  • Historical event reconciliation

Dead Letter Queues

Capture permanently failed webhooks for manual review:

async function processSt ripeEvent(payload) {
  try {
    // Process webhook
    await handleEvent(payload);
  } catch (error) {
    console.error('Webhook processing failed:', error);

    // After max retries, save to DLQ
    if (isUnrecoverable(error)) {
      await db.query(
        `INSERT INTO webhook_dlq (event_id, provider, payload, error_message, created_at)
         VALUES (?, ?, ?, ?, NOW())`,
        [payload.id, 'stripe', JSON.stringify(payload), error.message]
      );

      // Alert team
      await sendAlert({
        title: 'Webhook permanently failed',
        eventId: payload.id,
        error: error.message
      });
    }

    throw error; // Re-throw to trigger provider retry
  }
}

DLQ table schema:

CREATE TABLE webhook_dlq (
  id INT AUTO_INCREMENT PRIMARY KEY,
  event_id VARCHAR(255) UNIQUE,
  provider VARCHAR(50),
  event_type VARCHAR(100),
  payload JSON,
  error_message TEXT,
  retry_count INT DEFAULT 0,
  last_retry_at TIMESTAMP NULL,
  resolved_at TIMESTAMP NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_provider_created (provider, created_at),
  INDEX idx_resolved (resolved_at)
);

Monitoring Retry Metrics

Track webhook reliability with these metrics:

Key metrics:

  • Success rate (successful / total deliveries)
  • Retry rate (retried / total deliveries)
  • Average time to success
  • Permanent failure rate
  • Duplicate detection rate

Implementation:

async function recordWebhookMetrics(eventId, provider, status, retryCount) {
  await db.query(
    `INSERT INTO webhook_metrics (event_id, provider, status, retry_count, timestamp)
     VALUES (?, ?, ?, ?, NOW())`,
    [eventId, provider, status, retryCount]
  );
}

// Query for dashboard
async function getWebhookStats(provider, hours = 24) {
  return await db.query(`
    SELECT
      COUNT(*) as total,
      SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successful,
      SUM(CASE WHEN status = 'duplicate' THEN 1 ELSE 0 END) as duplicates,
      SUM(CASE WHEN retry_count > 0 THEN 1 ELSE 0 END) as retried,
      AVG(retry_count) as avg_retries
    FROM webhook_metrics
    WHERE provider = ? AND timestamp > NOW() - INTERVAL ? HOUR
  `, [provider, hours]);
}

Alerting thresholds:

  • Success rate drops below 95%
  • Retry rate exceeds 10%
  • Permanent failures exceed 1%
  • Processing time exceeds 5 seconds (median)

Troubleshooting Common Retry Issues

Issue 1: Webhook Storms

Symptom: Thousands of webhooks arriving simultaneously after a brief outage

Solution:

  • Implement rate limiting at endpoint level
  • Use queue-based processing to smooth traffic
  • Scale workers horizontally during recovery
  • Consider provider rate limit settings
const rateLimit = require('express-rate-limit');

const webhookLimiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 100, // 100 requests per minute
  message: 'Too many webhooks, please retry',
  statusCode: 503, // Service unavailable triggers retry
});

app.post('/webhooks/stripe', webhookLimiter, handleStripeWebhook);

Issue 2: Idempotency Key Collisions

Symptom: Different events with same ID (rare but possible)

Solution: Use composite keys including provider and timestamp:

const idempotencyKey = `${provider}:${eventId}:${timestamp}`;
await redis.set(`webhook:processed:${idempotencyKey}`, '1', 'EX', 345600);

Issue 3: Processing Timeouts

Symptom: Webhooks timing out before processing completes

Solution:

  • Return 200 immediately, process asynchronously
  • Optimize database queries
  • Use connection pooling
  • Increase timeout limits (if provider allows)

Issue 4: Database Deadlocks

Symptom: Concurrent webhook processing causes database locks

Solution:

  • Process webhooks in order per resource (e.g., per customer)
  • Use optimistic locking
  • Implement retry with backoff for deadlocks
  • Reduce transaction scope
async function processWithRetry(event, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      await processEvent(event);
      return;
    } catch (error) {
      if (error.code === 'ER_LOCK_DEADLOCK' && attempt < maxRetries) {
        await sleep(Math.pow(2, attempt) * 100); // Exponential backoff
        continue;
      }
      throw error;
    }
  }
}

Best Practices Summary

  1. Always implement idempotency: Use event IDs to prevent duplicate processing
  2. Return 200 quickly: Respond within 5 seconds, process asynchronously if needed
  3. Use appropriate status codes: 200 for success, 5xx for retriable failures, 4xx for permanent errors
  4. Store idempotency keys longer than retry window: Keep keys for at least provider's full retry period
  5. Log everything: Record all webhook receipts, processing attempts, and failures
  6. Monitor continuously: Track success rates, retry rates, and processing times
  7. Implement dead letter queues: Capture and review permanently failed events
  8. Test retry scenarios: Simulate failures in staging to validate behavior
  9. Use queue-based processing: Scale processing independently from webhook receipt
  10. Implement alerts: Get notified when success rates drop or failures spike

Conclusion

Webhook retry logic is essential for reliable integrations. Providers automatically retry failed deliveries, but your application must handle retries gracefully with idempotency checks to prevent duplicate processing. By implementing the patterns in this guide - event ID tracking, queue-based processing, proper status codes, and dead letter queues - you'll build webhook endpoints that survive temporary failures without data loss or duplication.

The key is designing for failure from the start: webhooks will fail, retries will happen, and your code must handle both scenarios correctly. With proper idempotency, monitoring, and error handling, your webhook integrations will be production-ready and reliable.

Need help implementing secure, reliable webhook integrations? Our team specializes in building production-ready API integrations with proper error handling, monitoring, and security. Contact us for a consultation or explore our developer tools for webhook testing and debugging resources.

Building Something Great?

Our development team builds secure, scalable applications. From APIs to full platforms, we turn your ideas into production-ready software.