genetic-orange
genetic-orange3y ago

enqueueLinks with a selector doesn't work?

I'm trying to grab the next page link from: https://www.haskovo.net/news with:
await enqueueLinks({
selector: '.pagination li:last-child > a',
label: 'LIST',
})
await enqueueLinks({
selector: '.pagination li:last-child > a',
label: 'LIST',
})
But it won't work. I've checked this(+ other selectors) in DevTools and it grabs the element fine. What am I missing? PS: I'm just messing around, trying to get the grasp of things. I'm aware that I can grab the whole thing with Cheerio, but I want a 'proof of concept' with PlaywrightCrawler.
Новини - Haskovo.NET
Новини - Haskovo.NET
9 Replies
Alexey Udovydchenko
save and check actual html or screenshot, if something is available from dev tools it means its available for web user, while scraper (bot) might be blocked or asked for verification
genetic-orange
genetic-orangeOP3y ago
Seems like it's there and it's not blocked
afraid-scarlet
afraid-scarlet3y ago
I do not think it works that way. enqueueLinks is a function that searches for the links on the page. Looks like what you want is to find one link and enqueue them. If that is the case, you should use crawler.addRequests https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler#addRequests
PlaywrightCrawler | API | Crawlee
Provides a simple framework for parallel crawling of web pages using headless Chromium, Firefox and Webkit browsers with Playwright. The URLs to crawl are fed either from a static list of URLs or from a dynamic queue of URLs enabling recursive crawling of websites. Since Playwright uses headless brow...
afraid-scarlet
afraid-scarlet3y ago
You can do something like this
import { PlaywrightCrawler } from 'crawlee'

(async () => {
const crawler = new PlaywrightCrawler({
headless: false,

async requestHandler({ page, enqueueLinks }) {
const eltHandle = await page.$('.pagination > li:last-child > a');
const liLink = await page.evaluate(li => [li.getAttribute('href')], eltHandle);
await crawler.addRequests(liLink);
},
});

await crawler.run([
'https://www.haskovo.net/news',
]);
})();
import { PlaywrightCrawler } from 'crawlee'

(async () => {
const crawler = new PlaywrightCrawler({
headless: false,

async requestHandler({ page, enqueueLinks }) {
const eltHandle = await page.$('.pagination > li:last-child > a');
const liLink = await page.evaluate(li => [li.getAttribute('href')], eltHandle);
await crawler.addRequests(liLink);
},
});

await crawler.run([
'https://www.haskovo.net/news',
]);
})();
genetic-orange
genetic-orangeOP3y ago
Yes, that works. But why would enqueueLinks won't work? What I'm selecting is a link, after all. Or it wants the selector to match more than 1 element?
afraid-scarlet
afraid-scarlet3y ago
It's works by specifying the strategy to All https://crawlee.dev/docs/upgrading/upgrading-to-v3#enqueuing-links Maybe the Dev team will tell us why.
async requestHandler({ request, response, log, enqueueLinks }) {
log.info(`${request.loadedUrl} Status code: ${response.status()} ${request.label ? 'Label: ' + request.label : ''}`);

await enqueueLinks({
selector: '.pagination li:last-child > a',
label: 'LIST',
strategy: EnqueueStrategy.All,
})
},
async requestHandler({ request, response, log, enqueueLinks }) {
log.info(`${request.loadedUrl} Status code: ${response.status()} ${request.label ? 'Label: ' + request.label : ''}`);

await enqueueLinks({
selector: '.pagination li:last-child > a',
label: 'LIST',
strategy: EnqueueStrategy.All,
})
},
Upgrading to v3 | Crawlee
This page summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3.
genetic-orange
genetic-orange3y ago
The problem in this case is that the page is using HTTPS protocol, while the links have HTTP there. EnqueueLinks by default is using 'same-hostname' strategy, but actually URL origin is used, not the URL host, thus protocol is also respected. Not sure if that was intentional or a bug, will check it with the team, thanks for pointing this out!
genetic-orange
genetic-orangeOP3y ago
That makes more sense, thanks! 🙂
genetic-orange
genetic-orange3y ago
Sure thing! And yeah - we will probably update the logic here - looks like it was not intentional, but probably the next release will already be in 2023. For now - please use the strategy 'all'

Did you find this page helpful?