ARTICLE AD BOX
Robots.txt conscionable turned 30 – cue the existential crisis! Like galore hitting the large 3-0, it’s wondering if it’s inactive applicable successful today’s satellite of AI and precocious hunt algorithms.
Spoiler alert: It decidedly is!
Let’s instrumentality a look astatine however this record inactive plays a cardinal relation successful managing how hunt engines crawl your site, however to leverage it correctly, and communal pitfalls to avoid.
What Is A Robots.txt File?
A robots.txt record provides crawlers similar Googlebot and Bingbot with guidelines for crawling your site. Like a representation oregon directory astatine the entranceway of a museum, it acts arsenic a acceptable of instructions astatine the entranceway of the website, including details on:
- What crawlers are/aren’t allowed to enter?
- Any restricted areas (pages) that shouldn’t beryllium crawled.
- Priority pages to crawl – via the XML sitemap declaration.
Its superior relation is to negociate crawler entree to definite areas of a website by specifying which parts of the tract are “off-limits.” This helps guarantee that crawlers absorption connected the astir relevant content alternatively than wasting the crawl fund connected low-value content.
While a robots.txt guides crawlers, it’s important to enactment that not each bots travel its instructions, particularly malicious ones. But for astir morganatic hunt engines, adhering to the robots.txt directives is modular practice.
What Is Included In A Robots.txt File?
Robots.txt files dwell of lines of directives for hunt motor crawlers and different bots.
Valid lines successful a robots.txt record dwell of a field, a colon, and a value.
Robots.txt files besides commonly see blank lines to amended readability and comments to assistance website owners support way of directives.
Image from author, November 2024
To get a amended knowing of what is typically included successful a robots.txt record and however antithetic sites leverage it, I looked astatine robots.txt files for 60 domains with a precocious stock of dependable crossed health, fiscal services, retail, and high-tech.
Excluding comments and blank lines, the mean fig of lines crossed 60 robots.txt files was 152.
Large publishers and aggregators, specified arsenic hotels.com, forbes.com, and nytimes.com, typically had longer files, portion hospitals similar pennmedicine.org and hopkinsmedicine.com typically had shorter files. Retail site’s robots.txt files typically autumn adjacent to the mean of 152.
All sites analyzed see the fields user-agent and disallow wrong their robots.txt files, and 77% of sites included a sitemap declaration with the tract sitemap.
Fields leveraged little often were let (used by 60% of sites) and crawl-delay (used by 20%) of sites.
Field | % of Sites Leveraging |
user-agent | 100% |
disallow | 100% |
sitemap | 77% |
allow | 60% |
crawl-delay | 20% |
Robots.txt Syntax
Now that we’ve covered what types of fields are typically included successful a robots.txt, we tin dive deeper into what each 1 means and however to usage it.
For much accusation connected robots.txt syntax and however it is interpreted by Google, cheque retired Google’s robots.txt documentation.
User-Agent
The user-agent field specifies what crawler the directives (disallow, allow) use to. You tin usage the user-agent tract to make rules that use to circumstantial bots/crawlers oregon usage a chaotic paper to bespeak rules that use to each crawlers.
For example, the beneath syntax indicates that immoderate of the pursuing directives lone use to Googlebot.
user-agent: Googlebot
If you privation to make rules that use to each crawlers, you tin usage a wildcard alternatively of naming a circumstantial crawler.
user-agent: *
You tin see aggregate user-agent fields wrong your robots.txt to supply circumstantial rules for antithetic crawlers oregon groups of crawlers, for example:
user-agent: *
#Rules present would use to each crawlers
user-agent: Googlebot
#Rules present would lone use to Googlebot
user-agent: otherbot1
user-agent: otherbot2
user-agent: otherbot3
#Rules present would use to otherbot1, otherbot2, and otherbot3
Disallow And Allow
The disallow field specifies paths that designated crawlers should not access. The allow field specifies paths that designated crawlers tin access.
Because Googlebot and different crawlers volition presume they tin entree immoderate URLs that aren’t specifically disallowed, galore sites support it elemental and lone specify what paths should not beryllium accessed utilizing the disallow field.
For example, the beneath syntax would archer each crawlers not to entree URLs matching the way /do-not-enter.
user-agent: *
disallow: /do-not-enter
#All crawlers are blocked from crawling pages with the way /do-not-enter
If you’re utilizing some let and disallow fields wrong your robots.txt, marque definite to work the conception connected order of precedence for rules successful Google’s documentation.
Generally, successful the lawsuit of conflicting rules, Google volition usage the much circumstantial rule.
For example, successful the beneath case, Google won’t crawl pages with the path/do-not-enter due to the fact that the disallow regularisation is much circumstantial than the let rule.
user-agent: *
allow: /
disallow: /do-not-enter
If neither regularisation is much specific, Google volition default to utilizing the little restrictive rule.
In the lawsuit below, Google would crawl pages with the path/do-not-enter due to the fact that the let regularisation is little restrictive than the disallow rule.
user-agent: *
allow: /do-not-enter
disallow: /do-not-enter
Note that if determination is nary way specified for the let oregon disallow fields, the regularisation volition beryllium ignored.
user-agent: *
disallow:
This is precise antithetic from lone including a guardant slash (/) arsenic the worth for the disallow field, which would lucifer the basal domain and immoderate lower-level URL (translation: each leafage connected your site).
If you privation your tract to amusement up successful hunt results, marque definite you don’t person the pursuing code. It volition artifact each hunt engines from crawling each pages connected your site.
user-agent: *
disallow: /
This mightiness look obvious, but judge me, I’ve seen it happen.
URL Paths
URL paths are the information of the URL aft the protocol, subdomain, and domain opening with a guardant slash (/). For the illustration URL https://www.example.com/guides/technical/robots-txt, the way would beryllium /guides/technical/robots-txt.
Image from author, November 2024
URL paths are case-sensitive, truthful beryllium definite to double-check that the usage of capitals and little cases successful the robot.txt aligns with the intended URL path.
Special Characters
Google, Bing, and different large hunt engines besides enactment a constricted fig of peculiar characters to assistance lucifer URL paths.
A peculiar quality is simply a awesome that has a unsocial relation oregon meaning alternatively of conscionable representing a regular missive oregon number. Special characters supported by Google successful robots.txt are:
- Asterisk (*) – matches 0 oregon much instances of immoderate character.
- Dollar motion ($) – designates the extremity of the URL.
To exemplify however these peculiar characters work, presume we person a tiny tract with the pursuing URLs:
- https://www.example.com/
- https://www.example.com/search
- https://www.example.com/guides
- https://www.example.com/guides/technical
- https://www.example.com/guides/technical/robots-txt
- https://www.example.com/guides/technical/robots-txt.pdf
- https://www.example.com/guides/technical/xml-sitemaps
- https://www.example.com/guides/technical/xml-sitemaps.pdf
- https://www.example.com/guides/content
- https://www.example.com/guides/content/on-page-optimization
- https://www.example.com/guides/content/on-page-optimization.pdf
Example Scenario 1: Block Site Search Results
A communal usage of robots.txt is to artifact interior tract hunt results, arsenic these pages typically aren’t invaluable for integrated hunt results.
For this example, presume erstwhile users behaviour a hunt connected https://www.example.com/search, their query is appended to the URL.
If a idiosyncratic searched “xml sitemap guide,” the caller URL for the hunt results leafage would beryllium https://www.example.com/search?search-query=xml-sitemap-guide.
When you specify a URL way successful the robots.txt, it matches immoderate URLs with that path, not conscionable the nonstop URL. So, to artifact some the URLs above, utilizing a wildcard isn’t necessary.
The pursuing regularisation would lucifer some https://www.example.com/search and https://www.example.com/search?search-query=xml-sitemap-guide.
user-agent: *
disallow: /search
#All crawlers are blocked from crawling pages with the way /search
If a wildcard (*) were added, the results would beryllium the same.
user-agent: *
disallow: /search*
#All crawlers are blocked from crawling pages with the way /search
Example Scenario 2: Block PDF files
In immoderate cases, you whitethorn privation to usage the robots.txt record to artifact circumstantial types of files.
Imagine the tract decided to make PDF versions of each usher to marque it casual for users to print. The effect is 2 URLs with precisely the aforesaid content, truthful the tract proprietor whitethorn privation to artifact hunt engines from crawling the PDF versions of each guide.
In this case, utilizing a wildcard (*) would beryllium adjuvant to lucifer the URLs wherever the way starts with /guides/ and ends with .pdf, but the characters successful betwixt vary.
user-agent: *
disallow: /guides/*.pdf
#All crawlers are blocked from crawling pages with URL paths that contain: /guides/, 0 oregon much instances of immoderate character, and .pdf
The supra directive would forestall hunt engines from crawling the pursuing URLs:
- https://www.example.com/guides/technical/robots-txt.pdf
- https://www.example.com/guides/technical/xml-sitemaps.pdf
- https://www.example.com/guides/content/on-page-optimization.pdf
Example Scenario 3: Block Category Pages
For the past example, presume the tract created class pages for method and contented guides to marque it easier for users to browse contented successful the future.
However, since the tract lone has 3 guides published close now, these pages aren’t providing overmuch worth to users oregon hunt engines.
The tract proprietor whitethorn privation to temporarily forestall hunt engines from crawling the class leafage lone (e.g., https://www.example.com/guides/technical), not the guides wrong the class (e.g., https://www.example.com/guides/technical/robots-txt).
To execute this, we tin leverage “$” to designate the extremity of the URL path.
user-agent: *
disallow: /guides/technical$
disallow: /guides/content$
#All crawlers are blocked from crawling pages with URL paths that extremity with /guides/technical and /guides/content
The supra syntax would forestall the pursuing URLs from being crawled:
- https://www.example.com/guides/technical
- https://www.example.com/guides/content
While allowing hunt engines to crawl:
- https://www.example.com/guides/technical/robots-txt
- <https://www.example.com/guides/technical/xml-sitemaps
- https://www.example.com/guides/content/on-page-optimization
Sitemap
The sitemap field is utilized to supply hunt engines with a nexus to 1 oregon much XML sitemaps.
While not required, it’s a champion signifier to see XML sitemaps wrong the robots.txt record to supply hunt engines with a database of precedence URLs to crawl.
The worth of the sitemap tract should beryllium an implicit URL (e.g., https://www.example.com/sitemap.xml), not a comparative URL (e.g., /sitemap.xml). If you person aggregate XML sitemaps, you tin see aggregate sitemap fields.
Example robots.txt with a azygous XML sitemap:
user-agent: *
disallow: /do-not-enter
sitemap: https://www.example.com/sitemap.xml
Example robots.txt with aggregate XML sitemaps:
user-agent: *
disallow: /do-not-enter
sitemap: https://www.example.com/sitemap-1.xml
sitemap: https://www.example.com/sitemap-2.xml
sitemap: https://www.example.com/sitemap-3.xml
Crawl-Delay
As mentioned above, 20% of sites besides see the crawl-delay tract wrong their robots.txt file.
The crawl hold tract tells bots however accelerated they tin crawl the tract and is typically utilized to dilatory down crawling to debar overloading servers.
The worth for crawl-delay is the fig of seconds crawlers should hold to petition a caller page. The beneath regularisation would archer the specified crawler to hold 5 seconds aft each petition earlier requesting different URL.
user-agent: FastCrawlingBot
crawl-delay: 5
Google has stated that it does not enactment the crawl-delay field, and it volition beryllium ignored.
Other large hunt engines similar Bing and Yahoo respect crawl-delay directives for their web crawlers.
Search Engine | Primary user-agent for search | Respects crawl-delay? |
Googlebot | No | |
Bing | Bingbot | Yes |
Yahoo | Slurp | Yes |
Yandex | YandexBot | Yes |
Baidu | Baiduspider | No |
Sites astir commonly see crawl-delay directives for each idiosyncratic agents (using user-agent: *), hunt motor crawlers mentioned supra that respect crawl-delay, and crawlers for SEO tools similar Ahrefbot and SemrushBot.
The fig of seconds crawlers were instructed to hold earlier requesting different URL ranged from 1 2nd to 20 seconds, but crawl-delay values of 5 seconds and 10 seconds were the astir communal crossed the 60 sites analyzed.
Testing Robots.txt Files
Any clip you’re creating oregon updating a robots.txt file, marque definite to trial directives, syntax, and operation earlier publishing.
This robots.txt Validator and Testing Tool makes this casual to bash (thank you, Max Prin!).
To trial a unrecorded robots.txt file, simply:
- Add the URL you privation to test.
- Select your idiosyncratic agent.
- Choose “live.”
- Click “test.”
The beneath illustration shows that Googlebot smartphone is allowed to crawl the tested URL.
Image from author, November 2024
If the tested URL is blocked, the instrumentality volition item the circumstantial regularisation that prevents the selected idiosyncratic cause from crawling it.
Image from author, November 2024
To trial caller rules earlier they are published, power to “Editor” and paste your rules into the substance container earlier testing.
Common Uses Of A Robots.txt File
While what is included successful a robots.txt record varies greatly by website, analyzing 60 robots.txt files revealed immoderate commonalities successful however it is leveraged and what types of contented webmasters commonly artifact hunt engines from crawling.
Preventing Search Engines From Crawling Low-Value Content
Many websites, particularly ample ones similar ecommerce oregon content-heavy platforms, often make “low-value pages” arsenic a byproduct of features designed to amended the idiosyncratic experience.
For example, interior hunt pages and faceted navigation options (filters and sorts) assistance users find what they’re looking for rapidly and easily.
While these features are indispensable for usability, they tin effect successful duplicate oregon low-value URLs that aren’t invaluable for search.
The robots.txt is typically leveraged to artifact these low-value pages from being crawled.
Common types of contented blocked via the robots.txt include:
- Parameterized URLs: URLs with tracking parameters, league IDs, oregon different dynamic variables are blocked due to the fact that they often pb to the aforesaid content, which tin make duplicate contented issues and discarded the crawl budget. Blocking these URLs ensures hunt engines lone scale the primary, cleanable URL.
- Filters and sorts: Blocking filter and benignant URLs (e.g., merchandise pages sorted by terms oregon filtered by category) helps debar indexing aggregate versions of the aforesaid page. This reduces the hazard of duplicate contented and keeps hunt engines focused connected the astir important mentation of the page.
- Internal hunt results: Internal hunt effect pages are often blocked due to the fact that they make contented that doesn’t connection unsocial value. If a user’s hunt query is injected into the URL, leafage content, and meta elements, sites mightiness adjacent hazard immoderate inappropriate, user-generated content getting crawled and indexed (see the illustration screenshot successful this post by Matt Tutt). Blocking them prevents this low-quality – and perchance inappropriate – contented from appearing successful search.
- User profiles: Profile pages whitethorn beryllium blocked to support privacy, trim the crawling of low-value pages, oregon guarantee absorption connected much important content, similar merchandise pages oregon blog posts.
- Testing, staging, oregon improvement environments: Staging, development, oregon trial environments are often blocked to guarantee that non-public contented is not crawled by hunt engines.
- Campaign sub-folders: Landing pages created for paid media campaigns are often blocked erstwhile they aren’t applicable to a broader hunt assemblage (i.e., a nonstop message landing leafage that prompts users to participate a redemption code).
- Checkout and confirmation pages: Checkout pages are blocked to forestall users from landing connected them straight done hunt engines, enhancing idiosyncratic acquisition and protecting delicate accusation during the transaction process.
- User-generated and sponsored content: Sponsored contented oregon user-generated contented created via reviews, questions, comments, etc., are often blocked from being crawled by hunt engines.
- Media files (images, videos): Media files are sometimes blocked from being crawled to conserve bandwidth and trim the visibility of proprietary contented successful hunt engines. It ensures that lone applicable web pages, not standalone files, look successful hunt results.
- APIs: APIs are often blocked to forestall them from being crawled oregon indexed due to the fact that they are designed for machine-to-machine communication, not for end-user hunt results. Blocking APIs protects their usage and reduces unnecessary server load from bots trying to entree them.
Blocking “Bad” Bots
Bad bots are web crawlers that prosecute successful unwanted oregon malicious activities specified arsenic scraping contented and, successful utmost cases, looking for vulnerabilities to bargain delicate information.
Other bots without immoderate malicious intent whitethorn inactive beryllium considered “bad” if they flood websites with excessively galore requests, overloading servers.
Additionally, webmasters whitethorn simply not privation definite crawlers accessing their tract due to the fact that they don’t basal to summation thing from it.
For example, you whitethorn take to artifact Baidu if you don’t service customers successful China and don’t privation to hazard requests from Baidu impacting your server.
Though immoderate of these “bad” bots whitethorn disregard the instructions outlined successful a robots.txt file, websites inactive commonly see rules to disallow them.
Out of the 60 robots.txt files analyzed, 100% disallowed astatine slightest 1 idiosyncratic cause from accessing each contented connected the tract (via the disallow: /).
Blocking AI Crawlers
Across sites analyzed, the astir blocked crawler was GPTBot, with 23% of sites blocking GPTBot from crawling immoderate contented connected the site.
Orginality.ai’s unrecorded dashboard that tracks however galore of the apical 1,000 websites are blocking circumstantial AI web crawlers recovered akin results, with 27% of the apical 1,000 sites blocking GPTBot arsenic of November 2024.
Reasons for blocking AI web crawlers whitethorn alteration – from concerns implicit information power and privateness to simply not wanting your information utilized successful AI grooming models without compensation.
The determination connected whether oregon not to artifact AI bots via the robots.txt should beryllium evaluated connected a case-by-case basis.
If you don’t privation your site’s contented to beryllium utilized to bid AI but besides privation to maximize visibility, you’re successful luck. OpenAI is transparent connected however it uses GPTBot and different web crawlers.
At a minimum, sites should see allowing OAI-SearchBot, which is utilized to diagnostic and nexus to websites successful the SearchGPT – ChatGPT’s precocious launched real-time hunt feature.
Blocking OAI-SearchBot is acold little communal than blocking GPTBot, with lone 2.9% of the apical 1,000 sites blocking the SearchGPT-focused crawler.
Getting Creative
In summation to being an important instrumentality successful controlling however web crawlers entree your site, the robots.txt record tin besides beryllium an accidental for sites to amusement their “creative” side.
While sifting done files from implicit 60 sites, I besides came crossed immoderate delightful surprises, similar the playful illustrations hidden successful the comments connected Marriott and Cloudflare’s robots.txt files.
Screenshot of marriot.com/robots.txt, November 2024
Screenshot of cloudflare.com/robots.txt, November 2024
Multiple companies are adjacent turning these files into unsocial recruitment tools.
TripAdvisor’s robots.txt doubles arsenic a occupation posting with a clever connection included successful the comments:
“If you’re sniffing astir this file, and you’re not a robot, we’re looking to conscionable funny folks specified arsenic yourself…
Run – don’t crawl – to use to articulation TripAdvisor’s elite SEO team[.]”
If you’re looking for a caller vocation opportunity, you mightiness privation to see browsing robots.txt files successful summation to LinkedIn.
How To Audit Robots.txt
Auditing your Robots.txt record is an indispensable portion of astir method SEO audits.
Conducting a thorough robots.txt audit ensures that your record is optimized to heighten tract visibility without inadvertently restricting important pages.
To audit your Robots.txt file:
- Crawl the tract utilizing your preferred crawler. (I typically usage Screaming Frog, but immoderate web crawler should bash the trick.)
- Filter crawl for immoderate pages flagged arsenic “blocked by robots.txt.” In Screaming Frog, you tin find this accusation by going to the response codes tab and filtering by “blocked by robots.txt.”
- Review the database of URLs blocked by the robots.txt to find whether they should beryllium blocked. Refer to the supra database of communal types of contented blocked by robots.txt to assistance you find whether the blocked URLs should beryllium accessible to hunt engines.
- Open your robots.txt record and behaviour further checks to marque definite your robots.txt record follows SEO champion practices (and avoids communal pitfalls) elaborate below.
Image from author, November 2024
Robots.txt Best Practices (And Pitfalls To Avoid)
The robots.txt is simply a almighty instrumentality erstwhile utilized effectively, but determination are immoderate communal pitfalls to steer wide of if you don’t privation to harm the tract unintentionally.
The pursuing champion practices volition assistance acceptable yourself up for occurrence and debar unintentionally blocking hunt engines from crawling important content:
- Create a robots.txt record for each subdomain. Each subdomain connected your tract (e.g., blog.yoursite.com, shop.yoursite.com) should person its ain robots.txt record to negociate crawling rules circumstantial to that subdomain. Search engines dainty subdomains arsenic abstracted sites, truthful a unsocial record ensures due power implicit what contented is crawled oregon indexed.
- Don’t artifact important pages connected the site. Make definite precedence content, specified arsenic merchandise and work pages, interaction information, and blog content, are accessible to hunt engines. Additionally, marque definite that blocked pages aren’t preventing hunt engines from accessing links to contented you privation to beryllium crawled and indexed.
- Don’t artifact indispensable resources. Blocking JavaScript (JS), CSS, oregon representation files tin forestall hunt engines from rendering your tract correctly. Ensure that important resources required for a due show of the tract are not disallowed.
- Include a sitemap reference. Always see a notation to your sitemap successful the robots.txt file. This makes it easier for hunt engines to find and crawl your important pages much efficiently.
- Don’t lone let circumstantial bots to entree your site. If you disallow each bots from crawling your site, but for circumstantial hunt engines similar Googlebot and Bingbot, you whitethorn unintentionally artifact bots that could payment your site. Example bots include:
- FacebookExtenalHit – utilized to get unfastened graph protocol.
- GooglebotNews – utilized for the News tab successful Google Search and the Google News app.
- AdsBot-Google – utilized to cheque webpage advertisement quality.
- Don’t artifact URLs that you privation removed from the index. Blocking a URL successful robots.txt lone prevents hunt engines from crawling it, not from indexing it if the URL is already known. To region pages from the index, usage different methods similar the “noindex” tag oregon URL removal tools, ensuring they’re decently excluded from hunt results.
- Don’t artifact Google and different large hunt engines from crawling your full site. Just don’t bash it.
TL;DR
- A robots.txt record guides hunt motor crawlers connected which areas of a website to entree oregon avoid, optimizing crawl efficiency by focusing connected high-value pages.
- Key fields see “User-agent” to specify the people crawler, “Disallow” for restricted areas, and “Sitemap” for precedence pages. The record tin besides see directives similar “Allow” and “Crawl-delay.”
- Websites commonly leverage robots.txt to artifact interior hunt results, low-value pages (e.g., filters, benignant options), oregon delicate areas similar checkout pages and APIs.
- An expanding fig of websites are blocking AI crawlers similar GPTBot, though this mightiness not beryllium the champion strategy for sites looking to summation postulation from further sources. To maximize tract visibility, see allowing OAI-SearchBot astatine a minimum.
- To acceptable your tract up for success, guarantee each subdomain has its ain robots.txt file, trial directives earlier publishing, see an XML sitemap declaration, and debar accidentally blocking cardinal content.
More resources:
- 8 Common Robots.txt Issues And How To Fix Them
- 13 Steps To Boost Your Site’s Crawlability And Indexability
- The Complete Technical SEO Audit Workbook
Featured Image: Se_vector/Shutterstock