published on 06.08.2020

Search engine bots or crawlers, use two files to crawl a website more intelligently. These are robots.txt and sitemap.xml. A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from the site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep pages out of Google, we should use “noindex” meta tags and also exclude the page from the sitemap. Here is how Google explains sitemap file concept.

In this post, I’m going to explain how I configured sitemap.xml, robots.txt and robots noindex, nofollow meta tags in this Hugo generated website. As I mentioned before in rss-feed post, Hugo generates RSS feed, sitemap, and robots.txt files automatically. Yet, we may want to change the default behavior and overwrite Hugo’s template.

Sitemap

In my case, I wanted to remove some single pages like /series and /archives also, all tag pages and section pages. I am going to include only blog-posts and some single pages like about page.

To achieve this, we need to create a custom sitemap template /layouts/_default/sitemap.xml. Copy and paste Hugo’s default template. Then edit it according to your needs.

{{ printf "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?>" | safeHTML }}
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  {{ range .Data.Pages }}
  <url>
    <loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
    <lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
    <changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
    <priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
  </url>
  {{ end }}
</urlset>

This is how I edited it. You should simply change the {{ range .Data.Pages }} line to change how it loops through pages.

{{ $pages := where $.Site.RegularPages ".Params.sitemap_exclude" "!=" true }}
  {{ range $pages }}

As you can see above I’ve used a parameter called sitemap_exclude to filter pages. I’ve added this boolean parameter to the front-matter of the pages that I want to exclude from the sitemap.

---
sitemap_exclude: true
---

one small note here, you can’t write ‘sitemap: true’ or ‘sitemapExclude: true’. I think the first one is a keyword and the latter doesn’t work because of the uppercase letter. It should be lowercase. Try it, didn’t work for me at least. Hugo v0.74

one more side note, you can set changefreq and priority in the page’s front-matter or set a default in config.toml.

---
sitemap:
  changefreq: monthly
  priority: 0.9
---

NoIndex - NoFollow Meta Tags

To my understanding sitemap and “noindex” robots tags completes each other. If you want to tell the crawler, not to crawl a page you should add noindex-nofollow meta tags and also remove the page from the sitemap. We’ve already excluded some pages from sitemap, now we need to add <meta name="robots" content="noindex, nofollow" /> in <head> section of those pages.

To achieve this, you need to add a control mechanism in the head.html partial. I didn’t want to add more front-matter parameters like nofollow, noindex, noarchive. So, I’ve used the same sitemap_exclude parameter I’ve already defined above.

{{ if or (ne .Kind "page") (and (eq .Kind "page") (eq .Params.sitemap_exclude true)) }}
  <meta name="robots" content="noindex, nofollow" />
{{ end }}

My if statement may seem complicated because the template language is not easy to read. The statement is pretty simple. Let me explain.

.Kind != "page" || (.Kind == "page" && .Params.sitemap_exclude)
  • if .Kind is not “page”: this will gather “home”, “section”, “taxonomy” and “term” pages.

  • if .Kind is “page” and sitemap_exclude is true: this will apply for pages I’ve set sitemap_exclude parameter manually.

In the future I may want to stop excluding “section” pages if I add content to those pages. In this case, I can select “section” pages simply by checking .Kind variable eq .Kind "section"

Robots.txt

By default, Hugo creates this file for us but it allows every page to be crawled. I want to disallow some. Here is hugo docs. About robots.txt, it doesn’t block crawlers, it just tells them “don’t crawl these areas”. The crawler doesn’t have to listen to it. Check robotstxt.org if you need to learn more.

To overwrite the default robots.txt we need to add this line to config.toml.

enableRobotsTXT = true

Then we can create our robots.txt file under /layouts folder. To check it go to /robots.txton your site. I entered some pages and the tags folder and included a sitemap in the last line.

User-agent: *

Disallow: /tags/
Disallow: archives.html
Disallow: series.html
Disallow: 404.html

Sitemap: {{ "sitemap.xml" | absLangURL }}

Conclusion

We’ve added custom “robots.txt”, “noindex-nofollow” tags, and “sitemap.xml” to exclude pages that contain less or no content. We’ve excluded taxonomy pages, term pages like “tags”, “categories”. Also, excluded some other single page by specifying a boolean in page front-matter.

For better SEO, we should exclude pages that are thin content.

Should we though? These guys say “we should improve those pages to get traffic instead of blocking them”. I wouldn’t listen to me about SEO. Do your own research. I’ve explained how you can overwrite Hugo templates.

You are reading the 6th of 19 episodes in Hugo.
Published on 06.08.2020 by Mert Bakır with commit 261e231.
hugo
#hugo #seo #sitemap #static-site
Next episode:
Math Typesetting in Hugo
published on 21.08.2020

Math typesetting, or rendering LaTeX, in Hugo is not a hard task. We have two popular alternatives. First one is mathjax, I was using it before with jekyll. The other popular library for math typesetting is KaTeX. Since it’s is faster, this time I’ll go with KaTeX. Implementing these …

published on 22.08.2020

Tags in Hugo is a default taxonomy and we don’t need any special configuration to use tags. We just enter the tags in the post’s front-matter. This can be YAML, TOML, or JSON. I’m using the YAML format, just a personal preference. […] tags:["hugo","tags"]In …

published on 22.08.2020

I like to display “published date” and “last modified date” info for my posts. Date info comes from the date parameter and last modified info comes from the lastmod parameter, both defined in the front-matter. […] …

published on 17.04.2022
edited on 02.05.2022

Önceki bölümde, markdown formatını LaTeX formatına dönüştürmek için kullanılan Pandoc yazılımından bahsetmiştik. Şimdi konuyu bir adım daha ileri taşıyıp ve bookdown’a geçiyoruz. Bookdown; Rmarkdown kullanarak teknik dökümanlar, kitaplar yazabilmemizi sağlayan, Yihui Xie tarafından yazılmış …

published on 10.04.2022

I’ve been using WSL-2 on Windows for over a year. It’s very useful because some Python packages are just a headache to install on Windows. Also, docker. It’s just better on Linux. Yet, WSL-2 can also be problematic. I remember trying a dual-boot setup when things just went way too …

published on 03.03.2022

In this post, I’ll share how to install geopandas and some other gis related packages on Windows. If you are on Mac or Linux you can probably just pip install those without any issue. I usually had to do a google search every time I wanted to install these packages on Windows environment. Of …

published on 19.02.2022

Önceki yazıda LaTeX kullanmanın avantajlarından ve zorluklarından bahsetmiştim, MS Word gibi WYSIWYG programlar ile karşılaştırıp tavsiyelerimi yazmıştım. Bu yazıda Pandoc yazılımı üzerine konuşacağız. Pandoc’un da yardımıyla LaTeX’in avantajlarını alıp zorluklarını biraz geri de …

published on 12.02.2022

LaTeX bir çeşit döküman hazırlama sistemidir, ancak çoğumuzun alıştığı arayüzlerin aksine bu döküman içerisinde kod yazar gibi bazı komutlar kullanırız. Daha sonra derleyici (compiler) bu dökümanı bizim tanımladığınız kurallara göre PDF’e dönüştürür. […] 1978 yılında, Donald Knuth TeX …

published on 02.01.2022

This post is for fellow students who is studying Operations Research with CPLEX. If you don’t know already, IBM offers free academic license for IBM ILOG CPLEX Optimization Studio. […] First, you have to register IBM Academic Initiative with your academic instituion issued email. (e.g. …