Sitemap & Robots & NoIndex NoFollow

published on 06.08.2020

Search engine bots or crawlers, use two files to crawl a website more intelligently. These are robots.txt and sitemap.xml. A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from the site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep pages out of Google, we should use “noindex” meta tags and also exclude the page from the sitemap. Here is how Google explains sitemap file concept.

In this post, I’m going to explain how I configured sitemap.xml, robots.txt and robots noindex, nofollow meta tags in this Hugo generated website. As I mentioned before in rss-feed post, Hugo generates RSS feed, sitemap, and robots.txt files automatically. Yet, we may want to change the default behavior and overwrite Hugo’s template.

Sitemap

In my case, I wanted to remove some single pages like /series and /archives also, all tag pages and section pages. I am going to include only blog-posts and some single pages like about page.

To achieve this, we need to create a custom sitemap template /layouts/_default/sitemap.xml. Copy and paste Hugo’s default template. Then edit it according to your needs.

{{ printf "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?>" | safeHTML }}
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  {{ range .Data.Pages }}
  <url>
    <loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
    <lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
    <changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
    <priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
  </url>
  {{ end }}
</urlset>

This is how I edited it. You should simply change the {{ range .Data.Pages }} line to change how it loops through pages.

{{ $pages := where $.Site.RegularPages ".Params.sitemap_exclude" "!=" true }}
  {{ range $pages }}

As you can see above I’ve used a parameter called sitemap_exclude to filter pages. I’ve added this boolean parameter to the front-matter of the pages that I want to exclude from the sitemap.

---
sitemap_exclude: true
---

one small note here, you can’t write ‘sitemap: true’ or ‘sitemapExclude: true’. I think the first one is a keyword and the latter doesn’t work because of the uppercase letter. It should be lowercase. Try it, didn’t work for me at least. Hugo v0.74

one more side note, you can set changefreq and priority in the page’s front-matter or set a default in config.toml.

---
sitemap:
  changefreq: monthly
  priority: 0.9
---

NoIndex - NoFollow Meta Tags

To my understanding sitemap and “noindex” robots tags completes each other. If you want to tell the crawler, not to crawl a page you should add noindex-nofollow meta tags and also remove the page from the sitemap. We’ve already excluded some pages from sitemap, now we need to add <meta name="robots" content="noindex, nofollow" /> in <head> section of those pages.

To achieve this, you need to add a control mechanism in the head.html partial. I didn’t want to add more front-matter parameters like nofollow, noindex, noarchive. So, I’ve used the same sitemap_exclude parameter I’ve already defined above.

{{ if or (ne .Kind "page") (and (eq .Kind "page") (eq .Params.sitemap_exclude true)) }}
  <meta name="robots" content="noindex, nofollow" />
{{ end }}

My if statement may seem complicated because the template language is not easy to read. The statement is pretty simple. Let me explain.

.Kind != "page" || (.Kind == "page" && .Params.sitemap_exclude)

if .Kind is not “page”: this will gather “home”, “section”, “taxonomy” and “term” pages.
if .Kind is “page” and sitemap_exclude is true: this will apply for pages I’ve set sitemap_exclude parameter manually.

In the future I may want to stop excluding “section” pages if I add content to those pages. In this case, I can select “section” pages simply by checking .Kind variable eq .Kind "section"

Robots.txt

By default, Hugo creates this file for us but it allows every page to be crawled. I want to disallow some. Here is hugo docs. About robots.txt, it doesn’t block crawlers, it just tells them “don’t crawl these areas”. The crawler doesn’t have to listen to it. Check robotstxt.org if you need to learn more.

To overwrite the default robots.txt we need to add this line to config.toml.

enableRobotsTXT = true

Then we can create our robots.txt file under /layouts folder. To check it go to /robots.txton your site. I entered some pages and the tags folder and included a sitemap in the last line.

User-agent: *

Disallow: /tags/
Disallow: archives.html
Disallow: series.html
Disallow: 404.html

Sitemap: {{ "sitemap.xml" | absLangURL }}

Conclusion

We’ve added custom “robots.txt”, “noindex-nofollow” tags, and “sitemap.xml” to exclude pages that contain less or no content. We’ve excluded taxonomy pages, term pages like “tags”, “categories”. Also, excluded some other single page by specifying a boolean in page front-matter.

For better SEO, we should exclude pages that are thin content.

Should we though? These guys say “we should improve those pages to get traffic instead of blocking them”. I wouldn’t listen to me about SEO. Do your own research. I’ve explained how you can overwrite Hugo templates.

You are reading the 6th of 19 episodes in Hugo.

Published on 06.08.2020 by Mert Bakır with commit db8d56a.

hugo

#hugo #seo #sitemap #static-site

Next episode:
Math Typesetting in Hugo

Math Typesetting in Hugo

hugo · #hugo #latex #static-site

published on 21.08.2020

Math typesetting, or rendering LaTeX, in Hugo is not a hard task. We have two popular alternatives. First one is mathjax, I was using it before with jekyll. The other popular library for math typesetting is KaTeX. Since it’s faster, this time I’ll go with KaTeX. Implementing these …

Render Tags and Tag Cloud in Hugo

hugo · #hugo #static-site

published on 22.08.2020

Tags in Hugo is a default taxonomy and we don’t need any special configuration to use tags. We just enter the tags in the post’s front-matter. This can be YAML, TOML, or JSON. I’m using the YAML format, just a personal preference. […] tags:["hugo","tags"]In …

Last Modified Date in Hugo

hugo · #hugo #static-site

published on 22.08.2020

I like to display “published date” and “last modified date” info for my posts. Date info comes from the date parameter and last modified info comes from the lastmod parameter, both defined in the front-matter. […] …

Host Static Content with Basic Authentication on AWS

dev-ops · #dev-ops #static-site

published on 10.07.2022

Previously, I’ve published a blog post about deploying static content on heroku with basic authentication. The main purpose was to get basic auth for a freely hosted static website. In that post, we hosted the source code on GitLab and configured a CI/CD pipeline to render the static content …

About Git Commit Email

random

published on 28.05.2022

Each git commit has a field called Author which consists ‘user.name’ and ‘user.email’. We usually set these variables once, after installing git, with git config --global so that each repo gets the variables from the global definition. We can also set them locally for a …

Host Static Content with Basic Authentication on Heroku using GitLab CI/CD Pipelines

dev-ops · #dev-ops #static-site

published on 25.05.2022

In this post, I’ll first walk through hosting static content with basic authentication. Then, we’ll look into deploying to Heroku using GitLab Pipelines, more specifically deploying a certain sub-directory within the project instead of pushing the whole project. Also, I’ll share …

Bookdown ile Teknik Yazıya Giriş

writing-thesis · #rmarkdown

published on 17.04.2022
edited on 15.07.2022

Önceki bölümde, markdown formatını LaTeX formatına dönüştürmek için kullanılan Pandoc yazılımından bahsetmiştik. Şimdi konuyu bir adım daha ileri taşıyıp ve bookdown’a geçiyoruz. Bookdown; Rmarkdown kullanarak teknik dökümanlar, kitaplar yazabilmemizi sağlayan, Yihui Xie tarafından yazılmış …

My WSL-2 Notes

random · #linux #windows

published on 10.04.2022

I’ve been using WSL-2 on Windows for over a year. It’s very useful because some Python packages are just a headache to install on Windows. Also, docker. It’s just better on Linux. Yet, WSL-2 can also be problematic. I remember trying a dual-boot setup when things just went way too …

Install GeoPandas on Windows

random · #gis #python

published on 03.03.2022

In this post, I’ll share how to install geopandas and some other gis related packages on Windows. If you are on Mac or Linux you can probably just pip install those without any issue. I usually had to do a google search every time I wanted to install these packages on Windows environment. Of …

👋 Hi! I'm Mert, creator and author of this blog. I share my experiences about algorithms, mathematical problems, python and static sites. Feel free to contact me.