On 18 November 2025 Cloudflare — the company that provides a content‑delivery network (CDN) for millions of websites — experienced one of its worst outages in years. Internet users across the globe suddenly saw “Error 5xx” pages instead of their favourite sites, including social networks, games and productivity apps. Although services began recovering after about three hours, the incident underscored how fragile the modern web can be when a single provider fails.
Timeline of the outage
The outage began around 11:20 UTC when Cloudflare’s network started returning HTTP 5xx errors for core traffic. Engineers initially suspected a large‑scale distributed denial‑of‑service (DDoS) attack and spent time investigating that possibility. By 14:30 UTC Cloudflare rolled out a fix that stopped propagating a faulty configuration file, and core traffic flows returned to normal. All services were fully restored by 17:06 UTC.
What actually happened?
Contrary to early speculation, the outage was not caused by a cyber‑attack. In an extensive post‑mortem, Cloudflare explained that a routine change to improve database permissions triggered a chain reaction that brought down key parts of its network:
Database permissions change: Engineers modified a ClickHouse database so that queries would explicitly run under individual user accounts. This change meant users could see metadata for underlying tables in the
r0database as well as thedefaultdatabase.Unfiltered query duplicates: The query used to generate a “feature file” for the company’s Bot Management system did not filter by database name. After the permissions change it returned duplicate rows from both databases. The feature file — a list of machine‑learning features used to calculate bot scores for every HTTP request — suddenly doubled in size.
Memory limit breached: Cloudflare’s core proxy pre‑allocates memory for no more than 200 features. When the feature file contained more than 200 entries, the proxy panicked and returned 5xx errors.
Cascading failures: The oversized file was propagated to servers across Cloudflare’s global network. Some nodes were still running the old configuration, so the bad file would occasionally be generated and redistributed. This caused waves of failures and recoveries that made diagnosis tricky. Eventually every node produced the bad file and the system remained in a failing state.
The cause is remarkably similar to previous tech outages: a small configuration change coupled with hidden assumptions (in this case, a query that didn’t expect new schemas) cascading into a systemic failure.
Which services were impacted?
The outage affected a wide range of Cloudflare products and the sites that depend on them:
Core CDN and security services: Visitors to Cloudflare‑hosted sites saw 5xx error pages.
Turnstile: Cloudflare’s CAPTCHA alternative failed to load, preventing many users from logging in.
Workers KV and Access: Both services returned elevated 5xx errors or failed authentications until Cloudflare deployed bypasses.
Dashboard: The Cloudflare dashboard remained online, but most users couldn’t authenticate because the Turnstile challenge was unavailable.
Email Security: Some IP‑reputation lookups and auto‑move actions were delayed.
Outside of Cloudflare, major websites including ChatGPT, X (formerly Twitter), Canva, League of Legends and Valorant experienced downtime. Even the outage‑monitoring site Downdetector went offline.
How Cloudflare responded
Once engineers realised the problem wasn’t a DDoS attack, they implemented several mitigations:
Stopped propagation of the faulty file: At 14:30 UTC they halted the generation and distribution of the oversized feature file and manually inserted a known‑good configuration.
Forced restart of the core proxy: Restarting the proxy cleared the bad state and allowed the network to process traffic normally again.
Bypass for Workers KV and Access: At 13:05 UTC, before rolling out the main fix, Cloudflare temporarily bypassed these services by falling back to an earlier version of the core proxy. This reduced the error rate for customers dependent on those services while investigations continued.
Careful restoration: After the main fix, Cloudflare gradually restarted affected services (Turnstile, KV, Access and the dashboard) while monitoring load. The long tail of 5xx errors was due to restarting and scaling systems as traffic surged back.
Lessons Cloudflare learned
Cloudflare has publicly committed to several changes to prevent a repeat of this incident. Key takeaways from their post‑mortem include:
Harden configuration file ingestion: Treat internally generated configuration files as untrusted input by validating size and content.
Add global kill switches: Implement more granular shutdown mechanisms so problematic modules can be disabled quickly without impacting the entire network.
Guard against error‑reporting overload: Ensure debugging or error‑reporting systems cannot consume excessive CPU and memory during a failure.
Review failure modes across modules: Perform rigorous fault‑injection testing to understand how each proxy module behaves under extreme conditions.
These actions mirror industry best practices: avoid single points of failure, validate all inputs, and build robust failure containment.
How much did the outage cost?
Cloudflare has not published a direct cost estimate. However, analysts at Forrester estimated that the 3 hour 20 minute outage could cause global economic losses of $250–300 million when you include downtime costs and the impact on marketplaces like Shopify and Etsy. That figure reflects just how intertwined the internet’s infrastructure has become — a fault in one provider can ripple across the digital economy.
What can businesses do to protect themselves?
No CDN or hosting provider can guarantee 100% uptime. The Cloudflare outage underscores the importance of resilience and contingency planning:
Use multi‑CDN or multi‑cloud architectures: Distribute traffic across more than one CDN or cloud provider so a failure in one doesn’t halt your entire site. Services like AWS CloudFront, Akamai, Fastly, Google Cloud CDN and Cloudflare can be combined with smart routing.
Cache static content: Employ edge caching and static site generation so that if dynamic services fail, the most important pages (home, product listings) can still be served.
Implement graceful degradation: Design your application to fail in a user‑friendly way: display cached pages, reduce functionality, or provide offline messages rather than throwing server errors.
Monitor third‑party dependencies: Subscribe to status pages and set up automated alerts so you know when your infrastructure providers are experiencing issues.
Have an incident response plan: Practice scenarios where your primary CDN goes down. Ensure your team knows how to switch DNS records to a backup provider or enable maintenance mode.
Conclusion
The November 18 Cloudflare outage is a case study in how small configuration changes can cascade into large‑scale failures. A database permissions tweak exposed previously hidden tables, a query unexpectedly doubled the size of a machine‑learning configuration file, and a memory‑limit check caused the proxy software to panic. Within minutes, vast portions of the web were inaccessible. Cloudflare has pledged improvements, but the incident reminds us that resilience — not blind reliance on any single provider — is essential. At Datronix Tech, we design platforms with redundancy and fail‑safes, ensuring that your site keeps running even when big players stumble. Reach out to see how we can help future‑proof your online presence.



