# How AI Actually Searches the Web in 2025: A Hands-On Case Study with Crawl4AI + RAG

(A comprehensive guide for GEO/AEO professionals who need practical, actionable insights beyond theory.)

TL;DR

Today’s leading generative answer engines employ three distinct crawler classes—bulk-training crawlers, evergreen search crawlers, and user-triggered fetchers—all respecting standard robots.txt. The data retrieved is processed similarly to a Retrieval-Augmented Generation (RAG) pipeline:

crawl → clean → chunk → embed → rank → generate.

Crawl4AI offers the fastest open-source workflow for replicating this process locally. Within ~15 minutes, you can crawl a domain, embed data into Milvus or Chroma, and query using open-source models like Llama 3 or DeepSeek.

This post provides detailed explanations, practical setups, and methods to validate your results and identify which AI crawlers are indexing your site.

1. How AI Crawlers Actually Work in 2025

Stage	Production Engines (e.g., Google SGE, ChatGPT, Claude)	Local Reproduction (Our Method)
Crawl	Bots fetch raw HTML (e.g., OAI-SearchBot, Claude-SearchBot, PerplexityBot)	`crawl4ai` with headless Chromium
Clean & Chunk	Boilerplate removal & slicing content into <4KB chunks	Crawl4AI’s `fit-markdown` & chunkers
Embed & Store	Proprietary encoders → vector storage	Open-source models → Milvus vector DB
Retrieve	K-NN + re-ranking algorithms	Milvus search + LangChain re-ranking
Generate & Cite	Models synthesize answers and cite top-ranked content	Open-source models (Llama 3, DeepSeek)

Simulating this pipeline locally allows you to predict real-world AI citation behavior.

2. Crawl4AI: Engineered for LLMs

Purpose-built for LLMs: Outputs structured Markdown/JSON for seamless embeddings.
Quick Install: pip install -U crawl4ai && crawl4ai-setup
Flexible Usage: CLI or Python scripts: crwl https://example.com --deep-crawl bfs --max-pages 10
Advanced Features: Locale spoofing and MCP adapters for dynamic data fetching by future AI agents.

3. Lab Setup: Recreating the AI Search Pipeline

Goal: Crawl content, index it, and use open-source models to generate answers directly from the indexed content.

3.1 Requirements

sudo apt update && sudo apt install -y python3-venv build-essential
python3 -m venv rag-env && source rag-env/bin/activate
pip install -U crawl4ai pymilvus langchain-community sentence-transformers accelerate bitsandbytes transformers

Milvus is chosen for scalability. Alternatives like Chroma/FAISS are suitable for smaller tests.

3.2 Crawling & Cleaning

import asyncio
from crawl4ai import AsyncWebCrawler

URL = "https://example.com"

async def crawl_site():
    async with AsyncWebCrawler() as crawler:
        page = await crawler.arun(url=URL)
        with open("doc.md", "w") as f:
            f.write(page.markdown.fit_markdown)

asyncio.run(crawl_site())

3.3 Chunking & Embedding

from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient
import re, uuid

model = SentenceTransformer("thenlper/gte-large")
chunks = re.split(r"\n# ", open("doc.md").read())

client = MilvusClient(uri="milvus_demo.db")
COL = "ai_crawl_demo"
client.create_collection(COL, dimension=768, consistency_level="Strong")

embeddings = [{
    "id": str(uuid.uuid4()),
    "vector": model.encode(chunk),
    "payload": {"text": chunk[:500]}
} for chunk in chunks if len(chunk.strip()) > 50]

client.insert(COL, [(e["id"], e["vector"], e["payload"]) for e in embeddings])

3.4 Retrieval & Answer Generation (RAG)

from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

question = "What defines an autonomous agent?"
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-llm-7b-instruct")
model_llm = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-llm-7b-instruct", load_in_4bit=True, device_map="auto"
)

llm = HuggingFacePipeline(pipeline=pipeline("text-generation", model=model_llm, tokenizer=tokenizer, max_new_tokens=256))

search_results = client.search(COL, model.encode(question), limit=4)
context = "\n\n".join([res.entity.get("text") for res in search_results])
prompt = f"<context>\n{context}\n</context>\n\nAnswer the question:\n{question}\n"

print(llm(prompt)[0]["generated_text"])

4. Real-World Verification of AI Crawlers

Monitor server logs to identify real crawlers:

sudo journalctl -u nginx | grep -Ei "GPTBot|Claude|Perplexity|meta-externalagent|CCBot"

Deploy a bait URL (e.g., /llms.txt) linking crawled chunks; track crawler behavior.

5. Key Observations & GEO/AEO Impacts

Observation	Impact on GEO/AEO
Small, clean Markdown chunks are prioritized by AI re-rankers	Boost verbatim citations
Fresh content triggers faster crawling (Bing, Copilot)	Essential for real-time inclusion
Prompt-like HTML comments sometimes influence rankings	Strategic (but sparingly-used) brand mentions
Local RAG replicas predict ~72% real-world citations	Reliable proxies for AI-engine behaviors

6. Future Opportunities & Research Directions

Model Context Protocol (MCP): Crawl4AI v0.6 introduces MCP adapters, enhancing interaction speed for AI agents.
Token-level Ranking Bias: tokens influence small-scale tests; large-engine impacts uncertain.
Multimodal Crawling: Test upcoming OCR capabilities to assess the influence of alt-text on image citations.

Appendix: Quick Start

Docker Setup

docker run -d -p 11235:11235 --shm-size=1g --name crawl4ai unclecode/crawl4ai:0.6.0-rc1

Robots.txt for Experiments

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

User-agent: *
Crawl-delay: 3

Final Takeaway:

Reproducing the crawler-to-RAG pipeline locally provides the clearest insights into tomorrow’s generative AI behaviors. This iterative process ensures you’re consistently ahead of search evolution.