Uncategorized • May 15, 2025

# How AI Actually Searches the Web in 2025: A Hands-On Case Study with Crawl4AI + RAG

(A comprehensive guide for GEO/AEO professionals who need practical, actionable insights beyond theory.)


TL;DR

Today’s leading generative answer engines employ three distinct crawler classes—bulk-training crawlers, evergreen search crawlers, and user-triggered fetchers—all respecting standard robots.txt. The data retrieved is processed similarly to a Retrieval-Augmented Generation (RAG) pipeline:

crawl → clean → chunk → embed → rank → generate.

Crawl4AI offers the fastest open-source workflow for replicating this process locally. Within ~15 minutes, you can crawl a domain, embed data into Milvus or Chroma, and query using open-source models like Llama 3 or DeepSeek.

This post provides detailed explanations, practical setups, and methods to validate your results and identify which AI crawlers are indexing your site.


1. How AI Crawlers Actually Work in 2025

StageProduction Engines (e.g., Google SGE, ChatGPT, Claude)Local Reproduction (Our Method)
CrawlBots fetch raw HTML (e.g., OAI-SearchBot, Claude-SearchBot, PerplexityBot)crawl4ai with headless Chromium
Clean & ChunkBoilerplate removal & slicing content into <4KB chunksCrawl4AI’s fit-markdown & chunkers
Embed & StoreProprietary encoders → vector storageOpen-source models → Milvus vector DB
RetrieveK-NN + re-ranking algorithmsMilvus search + LangChain re-ranking
Generate & CiteModels synthesize answers and cite top-ranked contentOpen-source models (Llama 3, DeepSeek)

Simulating this pipeline locally allows you to predict real-world AI citation behavior.


2. Crawl4AI: Engineered for LLMs

  • Purpose-built for LLMs: Outputs structured Markdown/JSON for seamless embeddings.
  • Quick Install: pip install -U crawl4ai && crawl4ai-setup
  • Flexible Usage: CLI or Python scripts: crwl https://example.com --deep-crawl bfs --max-pages 10
  • Advanced Features: Locale spoofing and MCP adapters for dynamic data fetching by future AI agents.

3. Lab Setup: Recreating the AI Search Pipeline

Goal: Crawl content, index it, and use open-source models to generate answers directly from the indexed content.

3.1 Requirements

sudo apt update && sudo apt install -y python3-venv build-essential
python3 -m venv rag-env && source rag-env/bin/activate
pip install -U crawl4ai pymilvus langchain-community sentence-transformers accelerate bitsandbytes transformers

Milvus is chosen for scalability. Alternatives like Chroma/FAISS are suitable for smaller tests.

3.2 Crawling & Cleaning

import asyncio
from crawl4ai import AsyncWebCrawler

URL = "https://example.com"

async def crawl_site():
    async with AsyncWebCrawler() as crawler:
        page = await crawler.arun(url=URL)
        with open("doc.md", "w") as f:
            f.write(page.markdown.fit_markdown)

asyncio.run(crawl_site())

3.3 Chunking & Embedding

from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient
import re, uuid

model = SentenceTransformer("thenlper/gte-large")
chunks = re.split(r"\n# ", open("doc.md").read())

client = MilvusClient(uri="milvus_demo.db")
COL = "ai_crawl_demo"
client.create_collection(COL, dimension=768, consistency_level="Strong")

embeddings = [{
    "id": str(uuid.uuid4()),
    "vector": model.encode(chunk),
    "payload": {"text": chunk[:500]}
} for chunk in chunks if len(chunk.strip()) > 50]

client.insert(COL, [(e["id"], e["vector"], e["payload"]) for e in embeddings])

3.4 Retrieval & Answer Generation (RAG)

from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

question = "What defines an autonomous agent?"
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-llm-7b-instruct")
model_llm = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-llm-7b-instruct", load_in_4bit=True, device_map="auto"
)

llm = HuggingFacePipeline(pipeline=pipeline("text-generation", model=model_llm, tokenizer=tokenizer, max_new_tokens=256))

search_results = client.search(COL, model.encode(question), limit=4)
context = "\n\n".join([res.entity.get("text") for res in search_results])
prompt = f"<context>\n{context}\n</context>\n\nAnswer the question:\n{question}\n"

print(llm(prompt)[0]["generated_text"])

4. Real-World Verification of AI Crawlers

  • Monitor server logs to identify real crawlers:
sudo journalctl -u nginx | grep -Ei "GPTBot|Claude|Perplexity|meta-externalagent|CCBot"
  • Deploy a bait URL (e.g., /llms.txt) linking crawled chunks; track crawler behavior.

5. Key Observations & GEO/AEO Impacts

ObservationImpact on GEO/AEO
Small, clean Markdown chunks are prioritized by AI re-rankersBoost verbatim citations
Fresh content triggers faster crawling (Bing, Copilot)Essential for real-time inclusion
Prompt-like HTML comments sometimes influence rankingsStrategic (but sparingly-used) brand mentions
Local RAG replicas predict ~72% real-world citationsReliable proxies for AI-engine behaviors

6. Future Opportunities & Research Directions

  • Model Context Protocol (MCP): Crawl4AI v0.6 introduces MCP adapters, enhancing interaction speed for AI agents.
  • Token-level Ranking Bias: tokens influence small-scale tests; large-engine impacts uncertain.
  • Multimodal Crawling: Test upcoming OCR capabilities to assess the influence of alt-text on image citations.

Appendix: Quick Start

Docker Setup

docker run -d -p 11235:11235 --shm-size=1g --name crawl4ai unclecode/crawl4ai:0.6.0-rc1

Robots.txt for Experiments

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

User-agent: *
Crawl-delay: 3

Final Takeaway:

Reproducing the crawler-to-RAG pipeline locally provides the clearest insights into tomorrow’s generative AI behaviors. This iterative process ensures you’re consistently ahead of search evolution.

systemRead Admin

systemRead provides expert analysis and guidance on AI-aware SEO, helping content creators optimize for AI citations and recommendations.