genetic-orange•3y ago
enqueueLinks with a selector doesn't work?
I'm trying to grab the next page link from: https://www.haskovo.net/news with:
But it won't work. I've checked this(+ other selectors) in DevTools and it grabs the element fine.
What am I missing?
PS: I'm just messing around, trying to get the grasp of things. I'm aware that I can grab the whole thing with Cheerio, but I want a 'proof of concept' with PlaywrightCrawler.
Новини - Haskovo.NET
Новини - Haskovo.NET
9 Replies
save and check actual html or screenshot, if something is available from dev tools it means its available for web user, while scraper (bot) might be blocked or asked for verification
genetic-orangeOP•3y ago
Seems like it's there and it's not blocked
afraid-scarlet•3y ago
I do not think it works that way.
enqueueLinks
is a function that searches for the links on the page.
Looks like what you want is to find one link and enqueue them.
If that is the case, you should use crawler.addRequests
https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler#addRequestsPlaywrightCrawler | API | Crawlee
Provides a simple framework for parallel crawling of web pages
using headless Chromium, Firefox and Webkit browsers with Playwright.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.
Since
Playwright
uses headless brow...afraid-scarlet•3y ago
You can do something like this
genetic-orangeOP•3y ago
Yes, that works. But why would enqueueLinks won't work? What I'm selecting is a link, after all. Or it wants the selector to match more than 1 element?
afraid-scarlet•3y ago
It's works by specifying the strategy to
All
https://crawlee.dev/docs/upgrading/upgrading-to-v3#enqueuing-links
Maybe the Dev team will tell us why.
Upgrading to v3 | Crawlee
This page summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3.
genetic-orange•3y ago
The problem in this case is that the page is using HTTPS protocol, while the links have HTTP there. EnqueueLinks by default is using 'same-hostname' strategy, but actually URL origin is used, not the URL host, thus protocol is also respected. Not sure if that was intentional or a bug, will check it with the team, thanks for pointing this out!
genetic-orangeOP•3y ago
That makes more sense, thanks! 🙂
genetic-orange•3y ago
Sure thing! And yeah - we will probably update the logic here - looks like it was not intentional, but probably the next release will already be in 2023. For now - please use the strategy 'all'