useful-bronze•2y ago

Help regarding request_queue

I am building a website scraper for my users. I want to support upto x no.of child URLs to be scrapable, starting from the startUrl. In somecases, I am seeing duplicate links to be scraped. And in somecases, the no.of urls identified goes into the order of 1000s. I want to control the enqueuing of the urls into the request_queue, to avoid unnecessary costs and duplication of URLs that are being scraped. Here is my enque function:

const enqueued = await enqueueLinks({
                selector: 'a',
                transformRequestFunction: (request) => {
                    const url = new URL(request.url);

                    // Use just the domain and path as the unique key, ignoring query parameters and hash fragments
                    request.uniqueKey = url.origin + url.pathname;

                    if (request.uniqueKey.includes(startUrls[0].url)) {
                        if (globalStore.capturedLinks.hasOwnProperty(request.uniqueKey)) {
                            return;
                        }
                        globalStore.capturedLinks[request.uniqueKey] = false;
                        return request;
                    }
                },
            });

const enqueued = await enqueueLinks({
                selector: 'a',
                transformRequestFunction: (request) => {
                    const url = new URL(request.url);

                    // Use just the domain and path as the unique key, ignoring query parameters and hash fragments
                    request.uniqueKey = url.origin + url.pathname;

                    if (request.uniqueKey.includes(startUrls[0].url)) {
                        if (globalStore.capturedLinks.hasOwnProperty(request.uniqueKey)) {
                            return;
                        }
                        globalStore.capturedLinks[request.uniqueKey] = false;
                        return request;
                    }
                },
            });

Also, I have set the link selector as a tag. Should I not use this in the scrape request's input ?

1 Reply

Lukas Krivka•2y ago

You will need to create a global object that will track number of requests enqueued per start Url. You can pass the Start Url to its children via userData

Help regarding request_queue

Did you find this page helpful?