useful-bronze
useful-bronze2y ago

Help regarding request_queue

I am building a website scraper for my users. I want to support upto x no.of child URLs to be scrapable, starting from the startUrl. In somecases, I am seeing duplicate links to be scraped. And in somecases, the no.of urls identified goes into the order of 1000s. I want to control the enqueuing of the urls into the request_queue, to avoid unnecessary costs and duplication of URLs that are being scraped. Here is my enque function:
const enqueued = await enqueueLinks({
selector: 'a',
transformRequestFunction: (request) => {
const url = new URL(request.url);

// Use just the domain and path as the unique key, ignoring query parameters and hash fragments
request.uniqueKey = url.origin + url.pathname;

if (request.uniqueKey.includes(startUrls[0].url)) {
if (globalStore.capturedLinks.hasOwnProperty(request.uniqueKey)) {
return;
}
globalStore.capturedLinks[request.uniqueKey] = false;
return request;
}
},
});
const enqueued = await enqueueLinks({
selector: 'a',
transformRequestFunction: (request) => {
const url = new URL(request.url);

// Use just the domain and path as the unique key, ignoring query parameters and hash fragments
request.uniqueKey = url.origin + url.pathname;

if (request.uniqueKey.includes(startUrls[0].url)) {
if (globalStore.capturedLinks.hasOwnProperty(request.uniqueKey)) {
return;
}
globalStore.capturedLinks[request.uniqueKey] = false;
return request;
}
},
});
Also, I have set the link selector as a tag. Should I not use this in the scrape request's input ?
1 Reply
Lukas Krivka
Lukas Krivka2y ago
You will need to create a global object that will track number of requests enqueued per start Url. You can pass the Start Url to its children via userData

Did you find this page helpful?