Pure LLM approach

How would you go about this problem?

Given x topic, you want to extract y data from a list of website base urls. Is there any built-in functionality for this? If not, how do you solve this?

I have attempted crawling entire sites, and one shot prompt the entire aggregated stuff to LLM given context window 1mill or higher. Seems to work okay, but I'm positive there are techniques to scrap tags / unrelated meta data from each url straped within every site.

Then there's the 2 step approach, crawl all links with fixed max_pages, but since I am building LLM approach that is language agnostic, I can not rely on keywords for heuristics. I literally have to crawl all links with data around the href, feed it into an LLM to determine what is relevant and the consequently crawl those targeted URLs.

FYI: Using JS version with playwright.
Solution
Yeah, Crawlee doesn’t have a built-in way to strip irrelevant stuff like headers or ads automatically. You’re not missing anything — cleanup is still a manual step.

You can use libraries like
readability
or
unfluff
to extract the main content, or filter DOM sections manually (like removing
.footer
,
.nav
, etc.). For trickier cases, you can even use the LLM to clean up pages before extraction.

Embedding-based filtering is also a nice option if you want to skip irrelevant pages before sending to the LLM, but it adds complexity. You're on the right track — it's just about fine-tuning the cleanup now.
Was this page helpful?