Miro
Miro3mo ago

enqueue_links does not find any links

Hello, I encountered a weird issue where enqueue_links does not find any links on a webpage, specifically https://nanlab.tech. It does not find any links no matter what strategy I choose. I also tried to use extract_links, which managed to find all links with strategy all, but with strategies same-origin and same-hostname no link is extracted and with strategy same-domain there is an error. I am using the latest version of crawlee for python 0.6.10 and for scraping I am using Playwright. Any idea what might be the issue? Here is the handler: @self.crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: # type: ignore text = await context.page.content() self._data[context.request.url.strip()] = { "html": text, "timestamp": ( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") ), } await asyncio.sleep(self._sleep_between_requests) links = await context.extract_links() print("---------------------------------------------------", len(links), links) await context.enqueue_links(exclude=[self._blocked_extensions]) I am also setting max_requests to 100 and max_crawl_depth to 2 when creating crawler.
Nanlab
You have ideas, we have solutions | Nanlab
Let's jump together to transform your ideas into solutions for your business.
1 Reply
Mantisus
Mantisus3mo ago
Hey, @Miro Strategies same-origin and same-hostname probably don't work because the site uses redirect from https://nanlab.tech/ to https://www.nanlab.tech/. Accordingly they are different hostname, which will be correct. enqueue_links uses the same logic to extract links as extract_links. So I can assume that the problem is in your exclude parameter.
with strategy same-domain there is an error.
It's a bug, thank you.

Did you find this page helpful?