cybersecuritweb

8 Common Robots.txt Issues

by @cybersecuritweb (110), 1 month ago

A helpful and effective tool for telling search engine crawlers how to index your website is robots.txt. Keeping this file under control is essential to effective technical SEO.In Google's own words, "it is not a mechanism for keeping a web page out of Google," so it's not superpowerful, but it can assist save your site or server from getting inundated with crawler queries.Make sure this crawl block is being used correctly if it is on your website.

This is especially crucial if you employ dynamic URLs or other techniques that produce an almost endless number of pages.

This tutorial will examine some of the most typical problems that arise with the robots.txt file, how they affect your website and search engine presence, and how to resolve them if you believe they have happened.

8 Common Robots.txt Mistakes

  1. Robots.txt Not In The Root Directory.

  2. Poor Use Of Wildcards.

  3. Noindex In Robots.txt.

  4. Blocked Scripts And Stylesheets.

  5. No Sitemap URL.

  6. Access To Development Sites.

  7. Using Absolute URLs.

  8. Deprecated & Unsupported Elements.

  9. Robots.txt Not In The Root Directory Search robots can only discover the file if it’s in your root folder.

That’s why there should be only a forward slash between the .com (or equivalent domain) of your website, and the ‘robots.txt’ filename, in the URL of your robots.txt file.

If there’s a subfolder in there, your robots.txt file is probably not visible to the search robots, and your website is probably behaving as if there was no robots.txt file at all.

  1. Poor Use Of Wildcards

Robots.txt supports two wildcard characters:

Asterisk (*) – represents any instances of a valid character, like a Joker in a deck of cards. Dollar sign ($) – denotes the end of a URL, allowing you to apply rules only to the final part of the URL, such as the filetype extension.

It’s sensible to adopt a minimalist approach to using wildcards, as they have the potential to apply restrictions to a much broader portion of your website.

It’s also relatively easy to end up blocking robot access from your entire site with a poorly placed asterisk.

Test your wildcard rules using a robots.txt testing tool to ensure they behave as expected. Be cautious with wildcard usage to prevent accidentally blocking or allowing too much.

  1. Noindex In Robots.txt

This one is more common on websites that are over a few years old.

Google has stopped obeying noindex rules in robots.txt files as of September 1, 2019.

If your robots.txt file was created before that date or contains noindex instructions, you will likely see those pages indexed in Google’s search results.

The solution to this problem is to implement an alternative “noindex” method.

One option is the robots meta tag, which you can add to the head of any webpage you want to prevent Google from indexing.

  1. Blocked Scripts And Stylesheets It might seem logical to block crawler access to external JavaScripts and cascading stylesheets (CSS).

However, remember that Googlebot needs access to CSS and JS files to “see” your HTML and PHP pages correctly.

If your pages are behaving oddly in Google’s results, or it looks like Google is not seeing them correctly, check whether you are blocking crawler access to required external files.

A simple solution to this is to remove the line from your robots.txt file that is blocking access.

Or, if you have some files you do need to block, insert an exception that restores access to the necessary CSS and JavaScript.

  1. No XML Sitemap URL This is more about SEO than anything else.

You can include the URL of your XML sitemap in your robots.txt file.

Because this is the first place Googlebot looks when it crawls your website, this gives the crawler a headstart in knowing the structure and main pages of your site.

While this is not strictly an error – as omitting a sitemap should not negatively affect the actual core functionality and appearance of your website in the search results – it’s still worth adding your sitemap URL to robots.txt if you want to give your SEO efforts a boost.

  1. Access To Development Sites Blocking crawlers from your live website is a no-no, but so is allowing them to crawl and index your pages that are still under development.

It’s best practice to add a disallow instruction to the robots.txt file of a website under construction so the general public doesn’t see it until it’s finished.

Equally, it’s crucial to remove the disallow instruction when you launch a completed website.

Forgetting to remove this line from robots.txt is one of the most common mistakes among web developers; it can stop your entire website from being crawled and indexed correctly.

If your development site seems to be receiving real-world traffic, or your recently launched website is not performing at all well in search, look for a universal user agent disallow rule in your robots.txt file:

User-Agent: *

Disallow: /

If you see this when you shouldn’t (or don’t see it when you should), make the necessary changes to your robots.txt file and check that your website’s search appearance updates accordingly.

  1. Using Absolute URLs While using absolute URLs in things like canonicals and hreflang is best practice, for URLs in the robots.txt, the inverse is true.

Using relative paths in the robots.txt file is the recommended approach for indicating which parts of a site should not be accessed by crawlers.

This is detailed in Google’s robots.txt documentation, which states:

A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned.

When you use an absolute URL, there’s no guarantee that crawlers will interpret it as intended and that the disallow/allow rule will be followed.

  1. Deprecated & Unsupported Elements While the guidelines for robots.txt files haven’t changed much over the years, two elements that are oftentimes included are:

Crawl-delay. Noindex. While Bing supports crawl-delay, Google doesn’t, but it is often specified by webmasters. You used to be able to set crawl settings in Google Search Console, but this was removed towards the end of 2023.

Google announced it would stop supporting the noindex directive in robots.txt files in July 2019. Before this date, webmasters were able to use the noindex directive in their robots.txt file.

This was not a widely supported or standardized practice, and the preferred method for noindex was to use on-page robots, or x-robots measures at a page level.

101 Views
0 Upvotes
0 Replies
0 Users

Join the forum to unlock true power of SEO community

You're welcome to become part of SEO Forum community. Register for free, learn and contribute.

Log In Sign up