enqueueLinks with a selector doesn't work?

At a glance

The community member is trying to grab the next page link from the website https://www.haskovo.net/news using the enqueueLinks function in the PlaywrightCrawler library, but it's not working. They've checked the selector in DevTools and it seems to be fine.

Other community members suggest that the issue might be related to the protocol (HTTP vs HTTPS) of the links, and that using the strategy: EnqueueStrategy.All option might help. The team behind the library also acknowledges that this was likely an unintentional behavior and will look into it.

The community members provide a working solution using crawler.addRequests instead of enqueueLinks.

Useful resources

TThePhantom

I'm trying to grab the next page link from: https://www.haskovo.net/news with:

Plain Text

await enqueueLinks({
        selector: '.pagination li:last-child > a',
        label: 'LIST',
    })

But it won't work. I've checked this(+ other selectors) in DevTools and it grabs the element fine.

What am I missing?

PS: I'm just messing around, trying to get the grasp of things. I'm aware that I can grab the whole thing with Cheerio, but I want a 'proof of concept' with PlaywrightCrawler.

9 comments

AAlexey Udovydchenko

save and check actual html or screenshot, if something is available from dev tools it means its available for web user, while scraper (bot) might be blocked or asked for verification

TThePhantom

Seems like it's there and it's not blocked

Attachment

LLeMoussel

I do not think it works that way.
enqueueLinks is a function that searches for the links on the page.
Looks like what you want is to find one link and enqueue them.
If that is the case, you should use crawler.addRequests https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler#addRequests

LLeMoussel

You can do something like this

Plain Text

import { PlaywrightCrawler } from 'crawlee'

(async () => {
    const crawler = new PlaywrightCrawler({
        headless: false,

        async requestHandler({ page, enqueueLinks }) {
            const eltHandle = await page.$('.pagination > li:last-child > a');
            const liLink = await page.evaluate(li => [li.getAttribute('href')], eltHandle);
            await crawler.addRequests(liLink);
        },
    });

    await crawler.run([
        'https://www.haskovo.net/news',
    ]);
})();

TThePhantom

Yes, that works. But why would enqueueLinks won't work? What I'm selecting is a link, after all. Or it wants the selector to match more than 1 element?

LLeMoussel

It's works by specifying the strategy to All https://crawlee.dev/docs/upgrading/upgrading-to-v3#enqueuing-links
Maybe the Dev team will tell us why.

Plain Text

    async requestHandler({ request, response, log, enqueueLinks }) {
        log.info(`${request.loadedUrl} Status code: ${response.status()} ${request.label ? 'Label: ' + request.label : ''}`);

        await enqueueLinks({
            selector: '.pagination li:last-child > a',
            label: 'LIST',
            strategy: EnqueueStrategy.All,
        })
    },

AAndrey Bykov

The problem in this case is that the page is using HTTPS protocol, while the links have HTTP there. EnqueueLinks by default is using 'same-hostname' strategy, but actually URL origin is used, not the URL host, thus protocol is also respected. Not sure if that was intentional or a bug, will check it with the team, thanks for pointing this out!

TThePhantom

That makes more sense, thanks! 🙂

AAndrey Bykov

Sure thing! And yeah - we will probably update the logic here - looks like it was not intentional, but probably the next release will already be in 2023. For now - please use the strategy 'all'

Add a reply

Apify Discord Mirror

enqueueLinks with a selector doesn't work?