Entry #15 · June 29, 2026

Who actually reads your llms.txt? We logged 461,328 requests to find out

Name: AI-Native File Engagement Across Production Mirror Sites (90-day window, 461,328 requests)
Creator: ROZZ

The AI-native files at the center of today’s GEO advice draw under 1% of crawler traffic. The bots that answer users touch them least of all. The main audience turns out to be scrapers and tooling.

What we measured

Open any current guide to GEO / AEO and you will find the same checklist item: publish an llms.txt, add structured JSON feeds, hand the machines a clean, machine-readable map of your site. The premise is intuitive. Models are machines, machine-readable files are easy for machines to parse, so machines should prefer them.

We had a rare chance to test that premise against real traffic, and it does not hold.

Across a set of production AI-native sites we operate and instrument, we measured bot access for a 90-day window: 461,328 requests across 31,935 log files. We decoded them user-agent by user-agent and sorted them into more than twenty classes: the AI crawlers (training, indexing, and live citation fetchers from OpenAI, Anthropic, Perplexity, Meta, and others), the traditional search engines, generic scrapers, and human or tooling traffic.

These sites are AI-native in that they serve exactly what the industry best practices recommend: an llms.txt, an llms-full.txt, and JSON feeds for search, pages, topics, and Q&A. We wanted to measure what is useful and what is not.

To be clear, plenty of literature exists that calls the llms.txt files into question. Plenty recommends them. In fact, the whole GEO/AEO space is fluid: everyone wants to do it, but no one knows how to do it for sure. So we measured on our own domains.

Finding one: for everyone, it is a rounding error

Of all 461,328 requests, just under 1% touched the AI-native files at all: llms.txt drew 0.30% and the JSON feeds 0.67%, for 0.97% combined. The other ninety-nine percent went to ordinary HTML pages and sitemaps.

That pattern holds across every crawler. Here is the share of each one’s own traffic that went to llms.txt or the JSON feeds, highest first:

Crawler	Category	AI-native reqs	Total reqs	Share
OAI-SearchBot	AI indexing	75	3,092	2.43%
GPTBot	AI training	140	7,131	1.96%
browser / tooling	human	2,604	215,079	1.21%
CCBot	AI training	24	2,321	1.03%
Baiduspider	search engine	56	7,223	0.78%
Googlebot	search engine	26	4,140	0.63%
YandexBot	search engine	8	1,500	0.53%
PerplexityBot	AI indexing	6	1,149	0.52%
Applebot	search engine	31	6,678	0.46%
generic scraper	scraper	380	88,946	0.43%
Meta	AI training	53	13,366	0.40%
Bingbot	search engine	57	15,024	0.38%
Amazonbot	AI training	72	22,774	0.32%
DuckDuckBot	search engine	7	2,206	0.32%
curl	scraper	40	20,978	0.19%
python	scraper	9	4,846	0.19%
ChatGPT-User	AI citation	21	15,092	0.14%
http-lib	scraper	14	12,171	0.12%
ByteSpider	AI training	6	5,930	0.10%
ClaudeBot	AI training	2	11,070	0.02%

The single most engaged crawler anywhere is OpenAI’s SearchBot, and it spends 2.4% of its requests on the structured files. OpenAI’s GPTBot is next at 2.0%. After that it falls off a cliff. Every other bot, every search engine, every scraper sits under 1.2%, and most are under half a percent. There is no crawler in the data for which the AI-native layer is anything but a trace.

So the foundational premise of these files, that bots will favor structured feeds over HTML, is not borne out by a single crawler we logged. They all run on HTML.

Finding two: the index crawlers actually use is the sitemap

Here is the part that reframes the exercise. The case for llms.txt is that machines need a clean, machine-readable index of a site instead of parsing cluttered HTML page by page. That argument is half right. Crawlers do lean on a machine-readable index. It just is not llms.txt. It is the sitemap, a standard that has existed since 2005.

Break all 461,328 requests down by what was fetched:

Path	Requests	Share
HTML (pages, Q&A, homepage)	254,752	55.2%
sitemap.xml	94,338	20.4%
robots.txt	11,082	2.4%
JSON feeds (/api/*.json)	3,094	0.67%
llms.txt	1,403	0.30%
other (redirects, assets, scanner noise)	96,659	21.0%

HTML, sitemap, and robots together are 78% of everything bots fetch. The two AI-native inventions, llms.txt and the JSON feeds, are 0.97% combined. The sitemap alone outdraws llms.txt by roughly sixty-seven to one.

So llms.txt fails because it is a second index for a job the sitemap already does.

Finding three: the bots that answer users ignore it hardest

If the structured files were going to matter anywhere, you would expect it to be at the moment of citation, when a model fetches a page to answer a live user. We see the opposite. The closer a bot sits to producing a user-facing answer, the less it touches the AI-native layer.

ChatGPT-User, OpenAI’s live answer-time fetcher, made 15,092 requests in the window. Twenty-one of them were for the AI-native files. That is 0.14%. Anthropic’s Claude-User: one fetch. Anthropic’s ClaudeBot, with more than 11,000 requests, fetched llms.txt zero times and hit the JSON feeds twice, a rate of 0.02%.

These are the bots whose output users actually read. They consume your HTML and effectively skip the layer built specifically for them. And they skip the sitemap too: ChatGPT-User’s sitemap rate is 0.0%. The live fetcher does not browse an index at all, machine-readable or otherwise. It goes straight to the single page the model wants. Indexes are for the crawlers that build a map in advance, not for the agent answering in the moment, which means even the sitemap is an indexing lever, not a citation one.

What this does and does not show

The scope matters, because this is easy to over-read.

What the data shows is narrow and solid: the machine-format add-on files, llms.txt and JSON feeds, earn negligible engagement from every category of crawler and near-zero from the ones that generate answers. If your plan is to bolt these files onto a site and expect the citation pipeline to consume them, the pipeline is not consuming them.

What the data does not show, and what we are not claiming here, is anything about HTML quality, content structure, or whether a dedicated AI site beats an ordinary one. Every site serves HTML; the fact that bots read HTML is not an argument for any particular kind of site. Those are separate questions for a separate piece.

One fair caveat for skeptics. We are measuring fetch volume, not influence. In principle a model could read llms.txt once and let it shape a whole answer, so low volume is not the same as zero value. But the volume for the answer-producing bots is so low, 0.14% and below, that even a generous influence-per-fetch assumption leaves the layer touching a vanishing slice of what those bots do. We cannot see inside the models. We can only see what they request, and what they request is HTML.

Caching is not hiding the number, either. These are CDN edge logs, which record cache hits as well as origin fetches, so a low llms.txt count is a real low count and not traffic quietly absorbed before it reaches the log. One scoping note in the other direction: these are requests to the AI-native sites we serve. If a site also published llms.txt from its own origin, fetches there would not appear here. This is a large, representative sample of how bots treat the files, not a census of the entire web.

A note on our own position: we build and serve these files on the sites we instrument, which is exactly why we can measure them at this resolution. This finding partly critiques our own product surface, and we would rather publish the measurement than the marketing. Whatever the value of an AI-native site turns out to be, it is not the llms.txt.

The takeaway

The AI-native add-on layer does not pay off. Across 461,328 requests and more than twenty crawler types, the files at the center of the standard GEO checklist drew under 1% of traffic, were ignored hardest by the bots that answer users, and found their main audience among scrapers and tooling. The machines you built them for are reading your pages instead.

There is a constructive reading underneath the negative one. Crawlers are not allergic to structure; they consume a machine-readable index of your site all day. It is the sitemap, and it gets sixty-seven times the traffic of llms.txt. So the effort that goes into llms.txt and JSON feeds is better spent on the two things every crawler actually ingests: the HTML pages themselves, and a clean, complete sitemap.

An AI site does not just produce AI-discovery files. That is a small part of it. It produces optimally formatted, rewritten HTML pages organized in a new sitemap with semantic groupings, alongside lots of actual user-generated Q&A. That, in our logs, is what is actually fetched.

Get this for your company

Rozz gives you visibility into the AI conversations happening about your product, and the tools to influence what AI recommends.

$997/month | AI site + chatbot + analytics

→ Book a call | → See how it works | → rozz@rozz.site

← Previous Entry

Entry #14: What AI bots read and ignore

All Entries

Latest Entry

→ Data source: CloudFront edge access logs across the production AI-native sites ROZZ operates, a 90-day window. 461,328 requests over 31,935 log files, classified by decoded user-agent into 20+ crawler classes and by fetched path. Fetch volume only (cache hits and origin fetches); no claim about per-fetch influence.