Definition

Crawling is the process by which search engine robots traverse a site's pages by following links, to discover and analyze content. It is the step that precedes any indexing: an uncrawled page is an invisible page.

Crawling is step zero of SEO. Before a page can appear in search results, it must be discovered and analyzed by engine robots — primarily Googlebot, but also Bingbot, Apple's robots, and those of LLMs (GPTBot, PerplexityBot). This process conditions the entire organic and AI visibility of a site.

How Googlebot explores a site

Googlebot starts from a set of known pages, then follows every hyperlink to discover new URLs. It analyzes HTML content, HTTP headers, robots directives, and structured data. It regularly revisits known pages to detect updates — at a frequency proportional to the site's authority and publication regularity.

The most common crawl blockers

Several configurations silently block crawling without teams realizing: an overly broad Disallow: / directive in the robots.txt, meta noindex tags mistakenly applied to important pages, an architecture too deep (pages more than 4 clicks from the homepage), redirect chains that slow down exploration, or an exhausted crawl budget on high-volume sites. Each of these issues has direct consequences on indexing and visibility.

Crawl budget: a critical concept for large sites

Google allocates each site a crawl budget defined by two factors: the crawl rate limit (server request ceiling) and crawl demand (perceived page popularity). On a site with thousands of pages, poor crawl budget management leads Google to spend time on low-value pages at the expense of strategic ones. The challenge is to make priority pages as accessible and linkable as possible.

Crawling and LLMs: a new dimension

LLMs have their own robots (GPTBot, PerplexityBot, ClaudeBot, Google-Extended). The robots.txt file allows you to authorize or block them individually. The llms.txt file, not yet standardized, is emerging as a complementary convention to indicate to LLMs which sections of the site are available as sources.

Crawling is exploration: the robot discovers and analyzes the page. Indexing is the decision: Google decides whether or not to add the page to its database. A page can be crawled without being indexed — if Google judges its content low-value, duplicated, or if a noindex directive is present. These are two distinct steps with their own optimization levers.

The URL Inspection tool in Google Search Console shows the last crawl date, indexing status, and any detected issues for a given page. For systematic analysis, tools like Screaming Frog or Sitebulb simulate a full crawl and identify blocked pages, redirect chains, and orphan pages.