Apify Discord Mirror

Updated 5 months ago

enqueueLinks with a selector doesn't work?

At a glance

The community member is trying to grab the next page link from the website https://www.haskovo.net/news using the enqueueLinks function in the PlaywrightCrawler library, but it's not working. They've checked the selector in DevTools and it seems to be fine.

Other community members suggest that the issue might be related to the protocol (HTTP vs HTTPS) of the links, and that using the strategy: EnqueueStrategy.All option might help. The team behind the library also acknowledges that this was likely an unintentional behavior and will look into it.

The community members provide a working solution using crawler.addRequests instead of enqueueLinks.

Useful resources
I'm trying to grab the next page link from: https://www.haskovo.net/news with:
Plain Text
await enqueueLinks({
        selector: '.pagination li:last-child > a',
        label: 'LIST',
    })


But it won't work. I've checked this(+ other selectors) in DevTools and it grabs the element fine.

What am I missing?

PS: I'm just messing around, trying to get the grasp of things. I'm aware that I can grab the whole thing with Cheerio, but I want a 'proof of concept' with PlaywrightCrawler.
1
A
T
L
9 comments
save and check actual html or screenshot, if something is available from dev tools it means its available for web user, while scraper (bot) might be blocked or asked for verification
Seems like it's there and it's not blocked
Attachment
https___www.haskovo.net_news.jpg.jpeg
I do not think it works that way.
enqueueLinks is a function that searches for the links on the page.
Looks like what you want is to find one link and enqueue them.
If that is the case, you should use crawler.addRequests https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler#addRequests
You can do something like this
Plain Text
import { PlaywrightCrawler } from 'crawlee'

(async () => {
    const crawler = new PlaywrightCrawler({
        headless: false,

        async requestHandler({ page, enqueueLinks }) {
            const eltHandle = await page.$('.pagination > li:last-child > a');
            const liLink = await page.evaluate(li => [li.getAttribute('href')], eltHandle);
            await crawler.addRequests(liLink);
        },
    });

    await crawler.run([
        'https://www.haskovo.net/news',
    ]);
})();
Yes, that works. But why would enqueueLinks won't work? What I'm selecting is a link, after all. Or it wants the selector to match more than 1 element?
It's works by specifying the strategy to All https://crawlee.dev/docs/upgrading/upgrading-to-v3#enqueuing-links
Maybe the Dev team will tell us why.
Plain Text
    async requestHandler({ request, response, log, enqueueLinks }) {
        log.info(`${request.loadedUrl} Status code: ${response.status()} ${request.label ? 'Label: ' + request.label : ''}`);

        await enqueueLinks({
            selector: '.pagination li:last-child > a',
            label: 'LIST',
            strategy: EnqueueStrategy.All,
        })
    },
The problem in this case is that the page is using HTTPS protocol, while the links have HTTP there. EnqueueLinks by default is using 'same-hostname' strategy, but actually URL origin is used, not the URL host, thus protocol is also respected. Not sure if that was intentional or a bug, will check it with the team, thanks for pointing this out!
That makes more sense, thanks! πŸ™‚
Sure thing! And yeah - we will probably update the logic here - looks like it was not intentional, but probably the next release will already be in 2023. For now - please use the strategy 'all'
Add a reply
Sign up and join the conversation on Discord