Rói
Rói3mo ago

Pure LLM approach

How would you go about this problem? Given x topic, you want to extract y data from a list of website base urls. Is there any built-in functionality for this? If not, how do you solve this? I have attempted crawling entire sites, and one shot prompt the entire aggregated stuff to LLM given context window 1mill or higher. Seems to work okay, but I'm positive there are techniques to scrap tags / unrelated meta data from each url straped within every site. Then there's the 2 step approach, crawl all links with fixed max_pages, but since I am building LLM approach that is language agnostic, I can not rely on keywords for heuristics. I literally have to crawl all links with data around the href, feed it into an LLM to determine what is relevant and the consequently crawl those targeted URLs. FYI: Using JS version with playwright.
Solution:
Yeah, Crawlee doesn’t have a built-in way to strip irrelevant stuff like headers or ads automatically. You’re not missing anything — cleanup is still a manual step. You can use libraries like readability or unfluff to extract the main content, or filter DOM sections manually (like removing .footer, .nav, etc.). For trickier cases, you can even use the LLM to clean up pages before extraction. Embedding-based filtering is also a nice option if you want to skip irrelevant pages before sending to the LLM, but it adds complexity. You're on the right track — it's just about fine-tuning the cleanup now....
Jump to solution
4 Replies
thenetaji
thenetaji3mo ago
To solve this, I’d crawl each base URL, collect all internal links with some context (like surrounding text or headings), and then use the LLM to decide which ones are actually relevant to your topic. Once you have those filtered links, go back and scrape just those pages. Clean them up (remove headers, footers, etc.), and feed chunks into the LLM to extract only the info you need. There’s no built-in solution, but combining Playwright for crawling and an LLM for relevance filtering + extraction works well — especially when you can’t rely on keywords or language.
Rói
RóiOP3mo ago
I actually tried splitting up the pages, but found that the context never reaches a million tokens anyway. And you're exact proposal with doing an initial discovery phase is the approach that has worked best so far. Just wanted to know if I was missing something. I was browsing through the docs, but there's no trivial way to strip pages from irrelevant data , so I can feed LLM friendly data into the model? Crawlee JS + Playwright + LLM prompt extraction. I do need the header and footer, but not for all the pages, so there I could save some redundant data. I think right now, it's mostly about the cleanup phase, and I was wondering if crawlee had some option to automatically filter unrelated stuff from the HTML, or even some embedding model that allows it to navigate and include/exlcude pages on the fly.
Solution
thenetaji
thenetaji3mo ago
Yeah, Crawlee doesn’t have a built-in way to strip irrelevant stuff like headers or ads automatically. You’re not missing anything — cleanup is still a manual step. You can use libraries like readability or unfluff to extract the main content, or filter DOM sections manually (like removing .footer, .nav, etc.). For trickier cases, you can even use the LLM to clean up pages before extraction. Embedding-based filtering is also a nice option if you want to skip irrelevant pages before sending to the LLM, but it adds complexity. You're on the right track — it's just about fine-tuning the cleanup now.
Rói
RóiOP3mo ago
Thanks @thenetaji, much appreciated !

Did you find this page helpful?