published on 06.08.2020

Search engine bots or crawlers, use two files to crawl a website more intelligently. These are robots.txt and sitemap.xml. A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from the site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep pages out of Google, we should use “noindex” meta tags and also exclude the page from the sitemap. Here is how Google explains sitemap file concept.

In this post, I’m going to explain how I configured sitemap.xml, robots.txt and robots noindex, nofollow meta tags in this Hugo generated website. As I mentioned before in rss-feed post, Hugo generates RSS feed, sitemap, and robots.txt files automatically. Yet, we may want to change the default behavior and overwrite Hugo’s template.

Sitemap

In my case, I wanted to remove some single pages like /series and /archives also, all tag pages and section pages. I am going to include only blog-posts and some single pages like about page.

To achieve this, we need to create a custom sitemap template /layouts/_default/sitemap.xml. Copy and paste Hugo’s default template. Then edit it according to your needs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
{{ printf "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?>" | safeHTML }}
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  {{ range .Data.Pages }}
  <url>
    <loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
    <lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
    <changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
    <priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
  </url>
  {{ end }}
</urlset>

This is how I edited it. You should simply change the {{ range .Data.Pages }} line to change how it loops through pages.

1
2
{{ $pages := where $.Site.RegularPages ".Params.sitemap_exclude" "!=" true }}
  {{ range $pages }}

As you can see above I’ve used a parameter called sitemap_exclude to filter pages. I’ve added this boolean parameter to the front-matter of the pages that I want to exclude from the sitemap.

1
2
3
---
sitemap_exclude: true
---

one small note here, you can’t write ‘sitemap: true’ or ‘sitemapExclude: true’. I think the first one is a keyword and the latter doesn’t work because of the uppercase letter. It should be lowercase. Try it, didn’t work for me at least. Hugo v0.74

one more side note, you can set changefreq and priority in the page’s front-matter or set a default in config.toml.

1
2
3
4
5
---
sitemap:
  changefreq: monthly
  priority: 0.9
---

NoIndex - NoFollow Meta Tags

To my understanding sitemap and “noindex” robots tags completes each other. If you want to tell the crawler, not to crawl a page you should add noindex-nofollow meta tags and also remove the page from the sitemap. We’ve already excluded some pages from sitemap, now we need to add <meta name="robots" content="noindex, nofollow" /> in <head> section of those pages.

To achieve this, you need to add a control mechanism in the head.html partial. I didn’t want to add more front-matter parameters like nofollow, noindex, noarchive. So, I’ve used the same sitemap_exclude parameter I’ve already defined above.

1
2
3
{{ if or (ne .Kind "page") (and (eq .Kind "page") (eq .Params.sitemap_exclude true)) }}
  <meta name="robots" content="noindex, nofollow" />
{{ end }}

My if statement may seem complicated because the template language is not easy to read. The statement is pretty simple. Let me explain.

1
.Kind != "page" || (.Kind == "page" && .Params.sitemap_exclude)
  • if .Kind is not “page”: this will gather “home”, “section”, “taxonomy” and “term” pages.

  • if .Kind is “page” and sitemap_exclude is true: this will apply for pages I’ve set sitemap_exclude parameter manually.

In the future I may want to stop excluding “section” pages if I add content to those pages. In this case, I can select “section” pages simply by checking .Kind variable eq .Kind "section"

Robots.txt

By default, Hugo creates this file for us but it allows every page to be crawled. I want to disallow some. Here is hugo docs. About robots.txt, it doesn’t block crawlers, it just tells them “don’t crawl these areas”. The crawler doesn’t have to listen to it. Check robotstxt.org if you need to learn more.

To overwrite the default robots.txt we need to add this line to config.toml.

1
enableRobotsTXT = true

Then we can create our robots.txt file under /layouts folder. To check it go to /robots.txton your site. I entered some pages and the tags folder and included a sitemap in the last line.

1
2
3
4
5
6
7
8
User-agent: *

Disallow: /tags/
Disallow: archives.html
Disallow: series.html
Disallow: 404.html

Sitemap: {{ "sitemap.xml" | absLangURL }}

Conclusion

We’ve added custom “robots.txt”, “noindex-nofollow” tags, and “sitemap.xml” to exclude pages that contain less or no content. We’ve excluded taxonomy pages, term pages like “tags”, “categories”. Also, excluded some other single page by specifying a boolean in page front-matter.

For better SEO, we should exclude pages that are thin content.

Should we though? These guys say “we should improve those pages to get traffic instead of blocking them”. I wouldn’t listen to me about SEO. Do your own research. I’ve explained how you can overwrite Hugo templates.

You are reading the 6th of 19 episodes in Hugo.
Published on 06.08.2020 by Mert Bakır with commit 3a172e2.
hugo
#hugo #seo #sitemap #web-dev
Next episode:
Math Typesetting in Hugo
published on 21.08.2020

Math typesetting, or rendering LaTeX, in Hugo is not a hard task. We have two popular alternatives. First one is mathjax, I was using it before with jekyll. The other popular library for math typesetting is KaTeX. Since it’s is faster, this time I’ll go with KaTeX. Implementing these …

published on 22.08.2020

Tags in Hugo is a default taxonomy and we don’t need any special configuration to use tags. We just enter the tags in the post’s front-matter. This can be YAML, TOML, or JSON. I’m using the YAML format, just a personal preference. […] 1 tags:["hugo","tags"] …

published on 22.08.2020
edited on 23.01.2021

I like to display “published date” and “last modified date” info for my posts. Date info comes from the date parameter and last modified info comes from the lastmod parameter, both defined in the front-matter. […] 1 2 …

published on 27.01.2019

Sitemap, sitemap.xml dosyası, bir websitesinin içeriğinin gösterildiği modeldir. URL’lerden oluşur. Search enginelerin crawl botlarla siteyi dolaşabilmelerine yardımcı olur. Sitemap, crawler’ın tüm sayfaları indexleyeceğini garantilemez; ancak yine de büyük arama motorları sitemap …

published on 24.01.2021
edited on 11.06.2021

Some time ago, I wanted to display image galleries on my Hugo website and searched for Hugo themes for photography and gallery. I can’t say I find much. Then, I met with a javascript library called nanogallery2 which is using another javascript library as an image viewer lightbox2. In this …

published on 23.01.2021
edited on 11.06.2021

Image processing may seem complicated at first but it’s actually easy and definitely worth implementing since it’ll help you decrease page load times. As you probably know, we don’t want to load raw images with huge sizes for small thumbnails or blog-posts. We want to load a small …

published on 29.11.2020
edited on 05.12.2020

Plotly is a visualization library that allows us to write code in Python, R, or Julia and generates interactive graphs using Javascript. So, we don’t have to deal with Javascript. You can checkout Plotly gallery, there are interesting works. Anyway, last week, I’ve started learning …