robots.nxt documentation

What is robots.nxt?

Purpose, concepts, and functionality.

robots.nxt is an affirmative access control method to replace robots.txt. robots.nxt is focused on controlling access to web pages by AI user-agents.

Unlike most access control methods, robots.nxt does not concern itself with what human users are doing, only automated systems.

When mature, robots.nxt will act as the third content monetization strategy, after ads and subscriptions. Ads and subscriptions do not work for bots in general or AI agents in specific. This new content monetization strategy is required for an internet where AI user-agents outnumber humans, and website content is largely or primarily accessed and consumed by bots.

Concepts

Every user for robots.nxt is a member of an organization. Every organization has sites that it manages. Each site serves its content to user-agents, and the user-agent requests are evaluated by the access control policies implemented in robots.nxt.

robots.nxt segments traffic from user-agents into humans and bots. Human traffic is passed on without modification, while bots are passed into an access control system.

The access control system is implemented as a series of rules that are imposed on any request from a bot. This is why the system is called "robots.nxt", because it translates the traditional access control requests from "robots.txt" into an affirmative access control system that actively imposes rules on bots. With robots.txt, websites make good-faith requests of bots, and bots voluntarily comply with those requests. With robots.nxt, the bot has no choice - it cannot access the website except in the ways permitted by robots.nxt.

How it works

User-agent classification

The access control system first looks at the user-agent and determines if it's a bot or not. This is done by a series of escalating rules that are applied against the presumed underlying identity of the agent requesting the content.

These evaluations are performed from easiest to hardest, with an increasing bot probability score applied at each step. First the user-agent string is evaluated. Most "good" bots will have a recognizable user-agent string. We can easily match these against known bot lists.

Infrastructure classification

Next, we look at the infrastructure the request is coming from. For example, most humans are not using python scripts, and are not originating from cloud hosts or data centers. We can easily match requests from these infrastructure types to known bot lists.

Behavior classification

Next, we look at how the agent behaves. A human does not make rapid requests, and does not iterate across the paths in an ordered fashion. Most humans don't click every link on a page. These behaviors are easy to match to bots, but take slightly more time than simply looking at the user-agent string.

Other methods

We have several other methods to identify bots, and the methods will change over time. We aren't interested in an exhaustive, universal solution, but a lightweight, low-impact method that works for bots and their parent organizations that have a commercial and legal interest in good behavior and respect for the owners and operators of website that provide valuable content for the bots.

Not exhaustive

We do not believe our method will capture all bots, and we do not believe that is possible. A bad faith actor can readily develop a bot to circumvent these methods, but that bot will be restricted to relatively slow, human-like behavior that will mitigate the costs and impacts of the bot to the website owner.

Not optional

Because we are intercepting requests and modifying responses based on the presumed identity of the system behind the user-agent, this is not optional or in-good-faith. Bots cannot circumvent these rules except by behaving like humans.

Identity sharing

We aggregate imputed identities from all of the bots that we observe across the entire user base of robots.nxt. This means that detecting a bot on one site operated by one of our users populates that identity across all sites managed by robots.nxt. This increases the rate of bot identification and shares the benefits across everyone using the system. Any bot that is detected and identified by any robots.nxt user is then propagated to all other users.

Bots and corporations do not have a right to privacy, and website operators have the right to know who is requesting content from their sites. Bots and corporations request content for one purpose: commercial use. Websites are not obligated to freely support the commercial benefits of bots and their corporate parents, or remain blind to how the website content is being used for commercial benefit.

Human users do not have identities imputed, and as such no human activities are shared between robots.nxt users.

Standard rules

A basic installation of robots.nxt will import a standard ruleset for bot access. These rules are designed to be a good default for most websites, and can be modified by the website operator to better suit their needs.

robots.nxt can read your existing robots.txt file and import the rules into the robots.nxt system. This allows you to gradually transition from robots.txt to robots.nxt, and gives you the opportunity to review and modify the rules as needed.

On this page