XML Sitemaps for Modern SEO
Last updated: December 31, 2025
XML sitemaps are the primary communication channel between your website and search engine crawlers. This guide covers the complete sitemap protocol, including image/video/news extensions, Hreflang implementation, validation best practices, and how to structure sitemaps for sites with millions of URLs.
In This Guide
The strategic role of sitemaps for crawl optimization and AI search
How do I structure a valid sitemap file?XML headers, namespaces, and the four standard URL tags
Which sitemap tags does Google actually use?Understanding lastmod, changefreq, and priority in practice
How do I add images and videos to my sitemap?Image and video sitemap extensions with current best practices
How do I create a Google News sitemap?News sitemap requirements for articles published in the last 48 hours
How do I implement Hreflang in sitemaps?Internationalization for multilingual and multi-region sites
How do I scale sitemaps for large websites?Sitemap index files and segmentation strategies for millions of URLs
What are the most common sitemap errors?Validation failures, entity escaping, and how to fix them
How do I implement sitemaps in Astro?Using @astrojs/sitemap with configuration options
What is an XML sitemap and why does it matter?
An XML sitemap is a machine-readable file that tells search engines exactly which URLs on your site should be indexed. While search engines discover pages by following links, this “crawl-based discovery” is insufficient for the scale and velocity of modern content creation.
The XML Sitemap Protocol serves as the primary communication channel between your website’s database and search engine crawlers. It’s not just a compliance checkbox—it’s a tool for:
- Crawl budget optimization — Direct crawlers to your most important pages
- Content discovery — Surface pages that might be orphaned or hidden behind JavaScript
- Freshness signals — Tell search engines when content was last updated
- Internationalization — Declare language relationships between page variants
With AI-driven search experiences (Google’s AI Overviews, Bing Copilot), sitemaps have become even more critical. Accurate <lastmod> timestamps now directly influence how quickly AI systems ingest fresh content to minimize hallucinations and provide real-time answers.
How do I structure a valid sitemap file?
A valid sitemap must be a UTF-8 encoded XML document. The root element is <urlset>, which contains namespace declarations governing the entire file.
Basic sitemap header
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<!-- URL entries go here -->
</urlset>
Header with all extensions
To use images, video, news, or Hreflang features, declare their namespaces:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
Failing to declare a namespace while using its tags (e.g., using <image:image> without xmlns:image) causes validation errors.
Hard limits
| Constraint | Limit |
|---|---|
| URLs per file | 50,000 maximum |
| File size | 50MB uncompressed |
| URL length | 2,048 characters |
If your site exceeds these limits, use a Sitemap Index file to reference multiple child sitemaps.
Which sitemap tags does Google actually use?
Inside <urlset>, each URL is wrapped in a <url> element with four possible child tags.
<loc> (Required)
The absolute URL of the page. Must match the sitemap’s protocol (HTTPS sitemap = HTTPS URLs).
<url>
<loc>https://www.example.com/page</loc>
</url>
Requirements:
- Absolute URLs only (not relative paths)
- Must be the canonical version
- No redirects, 404s, or noindex pages
<lastmod> (Optional but critical)
The date the page was last meaningfully modified, in W3C Datetime format.
<lastmod>2025-12-31T15:30:00+00:00</lastmod>
Search engines verify this tag. If you claim a page was updated but the content hash is identical, they’ll stop trusting your <lastmod> signals entirely. Only update this when content actually changes.
<changefreq> (Ignored)
Suggests how often the page changes (daily, weekly, monthly, etc.). Google explicitly ignores this tag because webmasters historically set everything to “daily” regardless of reality.
<priority> (Ignored)
A value from 0.0 to 1.0 indicating relative importance. Google ignores this—their PageRank analysis provides a more accurate signal than self-declared importance.
Recommendation: Omit <changefreq> and <priority> to save file size.
How do I add images and videos to my sitemap?
Image sitemaps
Image sitemaps help Google discover images hidden by JavaScript, lazy loading, or carousels.
Required tags:
<image:image>— Container for a single image (up to 1,000 per URL)<image:loc>— Direct URL to the image file
<url>
<loc>https://www.example.com/product-page</loc>
<image:image>
<image:loc>https://cdn.example.com/images/product.jpg</image:loc>
</image:image>
</url>
The <image:loc> can point to a CDN domain—this is common for modern architectures.
Deprecated tags (May 2022): Google removed support for <image:caption>, <image:geo_location>, <image:title>, and <image:license>. Google’s computer vision now extracts this information automatically from the page and image metadata.
Video sitemaps
Video sitemaps enable rich results (thumbnails, key moments, timestamps) in search results.
Required tags:
<video:thumbnail_loc>— Thumbnail image URL (minimum 160×90 pixels)<video:title>— Video title<video:description>— Description (max 2,048 characters)<video:player_loc>OR<video:content_loc>— Player URL or direct media file URL
Optional tags:
<video:duration>— Length in seconds (1–28,800)<video:publication_date>— When the video was published<video:tag>— Keywords (up to 32 per video)<video:rating>— Rating from 0.0 to 5.0
Deprecated tags: <video:category>, <video:gallery_loc>, <video:price>, <video:tvshow>
How do I create a Google News sitemap?
News sitemaps are specifically for articles published in the last 48 hours. They function as a “breaking news” feed for Google. After 48 hours, URLs should be removed from the News sitemap (but remain in your standard sitemap).
Required tags
<url>
<loc>https://www.example.com/news/article</loc>
<news:news>
<news:publication>
<news:name>The Daily Times</news:name>
<news:language>en</news:language>
</news:publication>
<news:publication_date>2025-12-31T10:00:00+00:00</news:publication_date>
<news:title>Breaking News Headline</news:title>
</news:news>
</url>
Critical: The <news:name> must exactly match your name in Google News Publisher Center. “The Daily Times” ≠ “Daily Times”.
How do I implement Hreflang in sitemaps?
For multilingual or multi-region sites, Hreflang prevents duplicate content issues and ensures users see the correct language version.
Implementation rules
- Self-referencing — Every page must list itself as an alternate
- Bi-directional — If Page A links to Page B, Page B must link back to Page A
- X-default — Specify a fallback for users whose language isn’t matched
<url>
<loc>https://www.example.com/english-page</loc>
<xhtml:link rel="alternate" hreflang="en" href="https://www.example.com/english-page"/>
<xhtml:link rel="alternate" hreflang="de" href="https://www.example.com/deutsch-page"/>
<xhtml:link rel="alternate" hreflang="x-default" href="https://www.example.com/"/>
</url>
Implementing Hreflang in sitemaps (rather than HTML <link> tags) reduces page weight and centralizes language management.
How do I scale sitemaps for large websites?
Sitemap Index files
When you exceed 50,000 URLs or 50MB, use a Sitemap Index—a parent file that lists child sitemaps:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.example.com/sitemap-products.xml</loc>
<lastmod>2025-10-01</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/sitemap-blog.xml</loc>
<lastmod>2025-09-21</lastmod>
</sitemap>
</sitemapindex>
Segmentation strategies
Don’t split sitemaps arbitrarily (sitemap1.xml, sitemap2.xml). Instead, segment by category to enable debugging in Google Search Console:
| Strategy | Example Files | Benefit |
|---|---|---|
| By page type | sitemap-products.xml, sitemap-blog.xml | Identify if indexation issues affect specific templates |
| By freshness | sitemap-news.xml, sitemap-archive-2024.xml | Ensure fresh content is crawled frequently |
| By media | sitemap-video.xml | Track Rich Results performance separately |
Mega-sites (millions of URLs)
A single Sitemap Index can reference up to 50,000 sitemaps. Since each sitemap holds 50,000 URLs, one Index file supports 2.5 billion URLs—sufficient for virtually any website.
What are the most common sitemap errors?
Entity escaping
XML requires special characters to be escaped. This is the #1 cause of sitemap failures.
| Character | Escape Code |
|---|---|
& | & |
' | ' |
" | " |
> | > |
< | < |
Wrong: https://example.com/product?id=1&sort=asc
Correct: https://example.com/product?id=1&sort=asc
Namespace declaration errors
Using extension tags without defining the namespace in the header causes parsing failures. Always declare xmlns:image, xmlns:video, etc.
Mixed signals
Including URLs that are blocked by robots.txt or tagged with noindex creates conflicting signals. Your sitemap should only contain canonical, indexable, 200 OK URLs.
Compression and submission
- Use gzip compression (
.xml.gz) to reduce bandwidth - Submit sitemaps via robots.txt:
Sitemap: https://example.com/sitemap.xml - Or submit through Google Search Console
Note: Google deprecated the sitemap ping endpoint (google.com/ping?sitemap=...) in June 2023. It now returns 404.
How do I implement sitemaps in Astro?
Astro provides an official @astrojs/sitemap integration that automatically generates sitemaps at build time.
Installation
npx astro add sitemap
Configuration
In astro.config.mjs, you must set your site URL:
import { defineConfig } from 'astro/config';
import sitemap from '@astrojs/sitemap';
export default defineConfig({
site: 'https://www.example.com',
integrations: [sitemap()],
});
Output files
The integration generates:
sitemap-index.xml— Links to all numbered sitemap filessitemap-0.xml— Contains your page URLs
For extremely large sites, additional files (sitemap-1.xml, etc.) are created automatically.
Configuration options
sitemap({
// Change output filename (default: sitemap-index.xml)
filenameBase: 'sitemap',
// Maximum entries per file (default: 45000)
entryLimit: 10000,
// Exclude unused namespaces for smaller files
namespaces: {
news: false,
video: false,
},
// Transform entries before writing
serialize(item) {
// Add lastmod to all entries
item.lastmod = new Date().toISOString();
return item;
},
})
Filtering pages
Use the filter option to exclude pages:
sitemap({
filter: (page) => !page.includes('/private/'),
})
Sources
- sitemaps.org Protocol Specification
- Google: Build and Submit a Sitemap
- Google: Image Sitemaps
- Google: Video Sitemaps
- Google: News Sitemaps
- Google: Sitemap Index Files
- @astrojs/sitemap Documentation
- Yoast: Lastmod in XML Sitemaps
Looking for expert guidance? Schedule a free consult:
Book a Free Consultation