enqueue_links does not find any links
Hello, I encountered a weird issue where enqueue_links does not find any links on a webpage, specifically https://nanlab.tech. It does not find any links no matter what strategy I choose. I also tried to use extract_links, which managed to find all links with strategy all, but with strategies same-origin and same-hostname no link is extracted and with strategy same-domain there is an error. I am using the latest version of crawlee for python 0.6.10 and for scraping I am using Playwright. Any idea what might be the issue?
Here is the handler:
@self.crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None: # type: ignore
text = await context.page.content()
self._data[context.request.url.strip()] = {
"html": text,
"timestamp": (
datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
),
}
await asyncio.sleep(self._sleep_between_requests)
links = await context.extract_links()
print("---------------------------------------------------", len(links), links)
await context.enqueue_links(exclude=[self._blocked_extensions])
I am also setting max_requests to 100 and max_crawl_depth to 2 when creating crawler.
Nanlab
You have ideas, we have solutions | Nanlab
Let's jump together to transform your ideas into solutions for your business.
1 Reply
Hey, @Miro
Strategies
same-origin
and same-hostname
probably don't work because the site uses redirect from https://nanlab.tech/
to https://www.nanlab.tech/
. Accordingly they are different hostname
, which will be correct.
enqueue_links
uses the same logic to extract links as extract_links
. So I can assume that the problem is in your exclude
parameter.
with strategy same-domain there is an error.It's a bug, thank you.