Search engine bots or crawlers, use two files to crawl a website more intelligently. These are robots.txt and sitemap.xml. A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from the site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep pages out of Google, we should use “noindex” meta tags and also exclude the page from the sitemap. Here is how Google explains sitemap file concept.

In this post, I’m going to explain how I configured sitemap.xml, robots.txt and robots noindex, nofollow meta tags in this Hugo generated website. As I mentioned before in rss-feed post, Hugo generates RSS feed, sitemap, and robots.txt files automatically. Yet, we may want to change the default behavior and overwrite Hugo’s template.

## Sitemap

In my case, I wanted to remove some single pages like /series and /archives also, all tag pages and section pages. I am going to include only blog-posts and some single pages like about page.

To achieve this, we need to create a custom sitemap template /layouts/_default/sitemap.xml. Copy and paste Hugo’s default template. Then edit it according to your needs.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22  {{ printf "" | safeHTML }} {{ range .Data.Pages }} {{ .Permalink }}{{ if not .Lastmod.IsZero }} {{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}{{ end }}{{ with .Sitemap.ChangeFreq }} {{ . }}{{ end }}{{ if ge .Sitemap.Priority 0.0 }} {{ .Sitemap.Priority }}{{ end }}{{ if .IsTranslated }}{{ range .Translations }} {{ end }} {{ end }} {{ end }} 

This is how I edited it. You should simply change the {{ range .Data.Pages }} line to change how it loops through pages.

 1 2  {{ $pages := where$.Site.RegularPages ".Params.sitemap_exclude" "!=" true }} {{ range \$pages }} 

As you can see above I’ve used a parameter called sitemap_exclude to filter pages. I’ve added this boolean parameter to the front-matter of the pages that I want to exclude from the sitemap.

 1 2 3  --- sitemap_exclude: true --- 

one small note here, you can’t write ‘sitemap: true’ or ‘sitemapExclude: true’. I think the first one is a keyword and the latter doesn’t work because of the uppercase letter. It should be lowercase. Try it, didn’t work for me at least. Hugo v0.74

one more side note, you can set changefreq and priority in the page’s front-matter or set a default in config.toml.

 1 2 3 4 5  --- sitemap: changefreq: monthly priority: 0.9 --- 

## NoIndex - NoFollow Meta Tags

To my understanding sitemap and “noindex” robots tags completes each other. If you want to tell the crawler, not to crawl a page you should add noindex-nofollow meta tags and also remove the page from the sitemap. We’ve already excluded some pages from sitemap, now we need to add <meta name="robots" content="noindex, nofollow" /> in <head> section of those pages.

To achieve this, you need to add a control mechanism in the head.html partial. I didn’t want to add more front-matter parameters like nofollow, noindex, noarchive. So, I’ve used the same sitemap_exclude parameter I’ve already defined above.

 1 2 3  {{ if or (ne .Kind "page") (and (eq .Kind "page") (eq .Params.sitemap_exclude true)) }} {{ end }} 

My if statement may seem complicated because the template language is not easy to read. The statement is pretty simple. Let me explain.

 1  .Kind != "page" || (.Kind == "page" && .Params.sitemap_exclude) 
• if .Kind is not “page”: this will gather “home”, “section”, “taxonomy” and “term” pages.

• if .Kind is “page” and sitemap_exclude is true: this will apply for pages I’ve set sitemap_exclude parameter manually.

In the future I may want to stop excluding “section” pages if I add content to those pages. In this case, I can select “section” pages simply by checking .Kind variable eq .Kind "section"

## Robots.txt

By default, Hugo creates this file for us but it allows every page to be crawled. I want to disallow some. Here is hugo docs. About robots.txt, it doesn’t block crawlers, it just tells them “don’t crawl these areas”. The crawler doesn’t have to listen to it. Check robotstxt.org if you need to learn more.

To overwrite the default robots.txt we need to add this line to config.toml.

 1  enableRobotsTXT = true 

Then we can create our robots.txt file under /layouts folder. To check it go to /robots.txton your site. I entered some pages and the tags folder and included a sitemap in the last line.

 1 2 3 4 5 6 7 8  User-agent: * Disallow: /tags/ Disallow: archives.html Disallow: series.html Disallow: 404.html Sitemap: {{ "sitemap.xml" | absLangURL }} 

## Conclusion

We’ve added custom “robots.txt”, “noindex-nofollow” tags, and “sitemap.xml” to exclude pages that contain less or no content. We’ve excluded taxonomy pages, term pages like “tags”, “categories”. Also, excluded some other single page by specifying a boolean in page front-matter.

For better SEO, we should exclude pages that are thin content.

Should we though? These guys say “we should improve those pages to get traffic instead of blocking them”. I wouldn’t listen to me about SEO. Do your own research. I’ve explained how you can overwrite Hugo templates.

You are reading the 6th of 19 episodes in Hugo.
Published on 06.08.2020 by Mert Bakır with commit 3a172e2.
hugo
#hugo #seo #sitemap #web-dev
Next episode:
Math Typesetting in Hugo

Math typesetting, or rendering LaTeX, in Hugo is not a hard task. We have two popular alternatives. First one is mathjax, I was using it before with jekyll. The other popular library for math typesetting is KaTeX. Since it’s is faster, this time I’ll go with KaTeX. Implementing these …

Tags in Hugo is a default taxonomy and we don’t need any special configuration to use tags. We just enter the tags in the post’s front-matter. This can be YAML, TOML, or JSON. I’m using the YAML format, just a personal preference. […] 1 tags:["hugo","tags"] …

I like to display “published date” and “last modified date” info for my posts. Date info comes from the date parameter and last modified info comes from the lastmod parameter, both defined in the front-matter. […] 1 2 …

Sitemap, sitemap.xml dosyası, bir websitesinin içeriğinin gösterildiği modeldir. URL’lerden oluşur. Search enginelerin crawl botlarla siteyi dolaşabilmelerine yardımcı olur. Sitemap, crawler’ın tüm sayfaları indexleyeceğini garantilemez; ancak yine de büyük arama motorları sitemap …

Some time ago, I wanted to display image galleries on my Hugo website and searched for Hugo themes for photography and gallery. I can’t say I find much. Then, I met with a javascript library called nanogallery2 which is using another javascript library as an image viewer lightbox2. In this …

Image processing may seem complicated at first but it’s actually easy and definitely worth implementing since it’ll help you decrease page load times. As you probably know, we don’t want to load raw images with huge sizes for small thumbnails or blog-posts. We want to load a small …

Plotly is a visualization library that allows us to write code in Python, R, or Julia and generates interactive graphs using Javascript. So, we don’t have to deal with Javascript. You can checkout Plotly gallery, there are interesting works. Anyway, last week, I’ve started learning …