We respect your privacy.

We use strictly necessary cookies to keep you signed in and to protect against CSRF. With your permission we also use a small amount of first-party analytics to improve the product. We do not sell your data and we do not use third-party advertising trackers. See our cookie policy and privacy policy .

← All posts

CDN caching gotchas that hide fresh content

Crawlmind Engineering··5 min read

A CDN caching gotcha is a misconfiguration in your edge cache that keeps serving an old copy of a page after you have updated it, so search and AI crawlers index stale content instead of your latest version. You publish a fix, a new statistic, a fresh product detail, and the change is live at your origin. Yet the crawler that hits your CDN gets last week's response, because the edge decided the old copy was still good enough to serve. The page you worked on stays invisible until something forces a refresh.

This is a quiet failure. Nothing errors. Your analytics look fine because human visitors often get the same stale copy without noticing. Meanwhile the version an answer engine quotes is the one from before your update. Here are five caching traps that cause it, and how to close each one.

#1. A long TTL with no purge on publish

The most common gotcha is a long edge time-to-live combined with a content management system that never tells the CDN to purge. Time-to-live determines how long a cache can serve an object without checking the origin again, according to Fastly's TTL documentation. A multi-day TTL is great for performance and crawl efficiency, but only if publishing an update also purges the affected URL.

Without that integration, a page you edited on Monday can keep serving the version from a week earlier until the TTL expires. The fix is a purge hook: when your CMS saves a page, it should call the CDN's purge API for that URL (and any listing pages that embed it). Treat "publish" and "purge" as one action, not two.

#2. serve-stale that outlives your edit

The stale-while-revalidate directive lets a cache serve an expired copy immediately while it fetches a fresh one in the background. It is a good latency trade for humans. The catch for crawlers is that the first request after expiry, which may well be the crawler, receives the stale body while revalidation happens asynchronously.

Two details make this worse than it sounds. First, Cloudflare only serves stale content if your origin actually includes the stale-while-revalidate directive in the Cache-Control header, per Cloudflare's revalidation docs, so the behavior is opt-in and easy to forget you turned on. Second, Google Cloud CDN keeps serving the stale entry for the number of seconds you configure past expiration while it revalidates in the background. If that window is long and a crawler arrives inside it, the crawler takes the old copy. Keep the serve-stale window short for pages that change, and reserve long windows for assets that do not.

#3. The 304 that pins a broken page in the index

Conditional requests are supposed to save everyone work. Google's crawlers send the ETag from a previous crawl in an If-None-Match header, and if it still matches, your server returns a 304 Not Modified with no body, as documented in Google's Crawling December: HTTP caching post. Googlebot then reuses its cached copy. Efficient, until the cached copy is wrong.

If your origin once served a broken response (an empty body, an error page) as a 200 OK and the crawler cached it, a later 304 tells the crawler nothing changed. The crawler keeps the broken version and stops rechecking. Fixing the underlying page is not enough, because the crawler never re-downloads it. You have to force a fresh 200 with a new ETag so the corrected body replaces the cached error. The same rule applies at the CDN layer: a cached bad response plus revalidation-by-validator can freeze a mistake in place.

#4. Vary headers that fragment or block your cache

The Vary response header tells caches to key entries by another request header. Used carelessly it either explodes your cache into near-duplicates or stops caching entirely. Vary: User-Agent is the classic trap: the User-Agent header has thousands of values in the wild, so many caches treat Vary: User-Agent as effectively uncacheable. Cloudflare notes that if your origin sends Vary: * or an unsupported Vary value, the response is not cached at all, per Cloudflare's cache keys documentation.

Vary: Cookie is just as damaging, because every unique session ID creates a separate cache entry and your edge cache quietly becomes per-user. For crawlers, both patterns cause inconsistency: the response a bot receives depends on a header value you did not intend to branch on, and different crawlers can land on different cached variants. Vary only on headers that genuinely change the response body, and nothing else.

#5. max-age set so high that recrawls slow down

The max-age field in Cache-Control does double duty. It controls freshness for caches, and it also hints to crawlers how often to come back. Google recommends setting max-age to the number of seconds the content is expected to stay unchanged, and says it helps crawlers determine when to recrawl a URL, again from Google's HTTP caching guidance. A page with a very long max-age gets re-fetched less often.

That is exactly what you want for a logo or a versioned script. It is the wrong signal for a pricing page, a changelog, or a comparison table you update. Setting one blanket max-age across the whole site means your fast-moving pages inherit the recrawl cadence of your static assets. Segment your cache rules: long max-age for immutable assets, short max-age for pages whose content is the point.

#How to check whether this is happening to you

You do not need special tooling to catch most of these. Request one of your recently updated pages the way a crawler would and read the response headers. Look at Cache-Control, ETag, Last-Modified, Age, and any Vary value. A large Age header means you are being served a copy that has been sitting at the edge for a while. Then fetch the same URL from your origin directly, bypassing the CDN, and compare the bodies. If they differ, your edge is serving stale content and the crawler is seeing the wrong version.

The through-line across all five gotchas is the same. Performance caching and crawler freshness pull in opposite directions, and the default settings optimize for the former. Decide, per URL pattern, which pages change and which do not, then cache them differently. Static assets can live at the edge for days. The pages you want cited by AI answer engines need a short leash and a purge on every publish.

For the mechanics of how crawlers actually use these headers, Google's own HTTP caching documentation is the primary source worth reading in full.

Related field notes

Share or discuss

Field notes in your inbox

New posts, no spam. Roughly monthly. Unsubscribe with one click.