Alex Ng - BlogThe time I brought down production

Updated: 4/2024~2mins
/images/dark.svg/images/light.svg/images/dark.svg/images/light.svg

How a quest for better SEO caused me great, completely unrelated pain

Project Image

Background


This website is built with NextJS, hosted on Vercel and uses Cloudflare as its DNS (Domain Name Server). So the abbreviated fetch trace looks like this:

  1. Initial website fetch request
  2. Cloudflare understands and forwards the request to Vercel
  3. Vercel understands and forwards the request to the NextJS server

Preface


SEO (Search Engine Optimization) is crucial for a website as it boosts findability. SEO can make or break a website; non-existent SEO will make the website difficult to find - more so than finding a needle in a haystack, and poor SEO will reach the wrong audience and render the site useless.

In short, SEO is essential for this website, and thus, I embarked on improving the website's SEO.

The Incident


11:46 pm:

While I was improving this website's SEO, I came across some SEO-checking websites online. From the generated reports, I discovered that some internal links were not redirecting correctly. Primarily, internal link redirects https://ngjx.org are rewritten to https://www.ngjx.org, causing a mismatch in host path.

Okay, no big deal, I just have to update the Cloudflare redirect rules to redirect all www.ngjx.org traffic to ngjx.org.

11:55 pm: The site goes down.

The investigation


00:14 am: I realize the website is down and start investigating.

The site is unreachable, all requests are timing out. What is going on? The vercel deployment is still online and the preview builds of production are still accessible. However, something is amiss. The logs do not show any request timing out. Perhaps it is an issue with the transport portion of the OSI model?

Checking the network logs in my browser reveal the issue - an infinite loop! When visiting the website, users are redirected to ngjx.org then www.ngjx.org, over and over before timing out shortly after. Why is this happening?

In an epic blunder, I still had Vercel redirecting all ngjx.org traffic to www.ngjx.org, thus causing an infinite loop of redirection mayhem.

What a predicament!

00:24 am: Production is rolled back.

00:25 am: The site is back up.

Takeaways


This 30-minute outage was avoidable. I should have reconfirmed routing rules before pushing to production.