Search engine bots or crawlers, use two files to crawl a website more intelligently. These are robots.txt and sitemap.xml. A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from the site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep pages out of Google, we should use “noindex” meta tags and also exclude the page from the sitemap. Here is how Google explains sitemap file concept.
In this post, I’m going to explain how I configured sitemap.xml
, robots.txt
and robots noindex, nofollow
meta tags in this Hugo generated website. As I mentioned before in rss-feed post, Hugo generates RSS feed, sitemap, and robots.txt files automatically. Yet, we may want to change the default behavior and overwrite Hugo’s template.
Sitemap
In my case, I wanted to remove some single pages like /series
and /archives
also, all tag
pages and section
pages. I am going to include only blog-posts and some single pages like about
page.
To achieve this, we need to create a custom sitemap template /layouts/_default/sitemap.xml
. Copy and paste Hugo’s default template. Then edit it according to your needs.
{{ printf "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?>" | safeHTML }}
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
{{ range .Data.Pages }}
<url>
<loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
<lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
<changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
<priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
<xhtml:link
rel="alternate"
hreflang="{{ .Lang }}"
href="{{ .Permalink }}"
/>{{ end }}
<xhtml:link
rel="alternate"
hreflang="{{ .Lang }}"
href="{{ .Permalink }}"
/>{{ end }}
</url>
{{ end }}
</urlset>
This is how I edited it. You should simply change the {{ range .Data.Pages }}
line to change how it loops through pages.
{{ $pages := where $.Site.RegularPages ".Params.sitemap_exclude" "!=" true }}
{{ range $pages }}
As you can see above I’ve used a parameter called sitemap_exclude
to filter pages. I’ve added this boolean parameter to the front-matter of the pages that I want to exclude from the sitemap.
---
sitemap_exclude: true
---
one small note here, you can’t write ‘sitemap: true’ or ‘sitemapExclude: true’. I think the first one is a keyword and the latter doesn’t work because of the uppercase letter. It should be lowercase. Try it, didn’t work for me at least. Hugo v0.74
one more side note, you can set
changefreq
andpriority
in the page’s front-matter or set a default inconfig.toml
.
---
sitemap:
changefreq: monthly
priority: 0.9
---
NoIndex - NoFollow Meta Tags
To my understanding sitemap and “noindex” robots tags completes each other. If you want to tell the crawler, not to crawl a page you should add noindex-nofollow meta tags and also remove the page from the sitemap. We’ve already excluded some pages from sitemap, now we need to add <meta name="robots" content="noindex, nofollow" />
in <head>
section of those pages.
To achieve this, you need to add a control mechanism in the head.html
partial. I didn’t want to add more front-matter parameters like nofollow, noindex, noarchive
. So, I’ve used the same sitemap_exclude
parameter I’ve already defined above.
{{ if or (ne .Kind "page") (and (eq .Kind "page") (eq .Params.sitemap_exclude true)) }}
<meta name="robots" content="noindex, nofollow" />
{{ end }}
My if statement may seem complicated because the template language is not easy to read. The statement is pretty simple. Let me explain.
.Kind != "page" || (.Kind == "page" && .Params.sitemap_exclude)
if .Kind is not “page”: this will gather “home”, “section”, “taxonomy” and “term” pages.
if .Kind is “page” and sitemap_exclude is true: this will apply for pages I’ve set
sitemap_exclude
parameter manually.
In the future I may want to stop excluding “section” pages if I add content to those pages. In this case, I can select “section” pages simply by checking .Kind
variable eq .Kind "section"
Robots.txt
By default, Hugo creates this file for us but it allows every page to be crawled. I want to disallow some. Here is hugo docs. About robots.txt, it doesn’t block crawlers, it just tells them “don’t crawl these areas”. The crawler doesn’t have to listen to it. Check robotstxt.org if you need to learn more.
To overwrite the default robots.txt we need to add this line to config.toml
.
enableRobotsTXT = true
Then we can create our robots.txt
file under /layouts
folder. To check it go to /robots.txt
on your site. I entered some pages and the tags
folder and included a sitemap in the last line.
User-agent: *
Disallow: /tags/
Disallow: archives.html
Disallow: series.html
Disallow: 404.html
Sitemap: {{ "sitemap.xml" | absLangURL }}
Conclusion
We’ve added custom “robots.txt”, “noindex-nofollow” tags, and “sitemap.xml” to exclude pages that contain less or no content. We’ve excluded taxonomy pages, term pages like “tags”, “categories”. Also, excluded some other single page by specifying a boolean in page front-matter.
For better SEO, we should exclude pages that are thin content.
Should we though? These guys say “we should improve those pages to get traffic instead of blocking them”. I wouldn’t listen to me about SEO. Do your own research. I’ve explained how you can overwrite Hugo templates.