other-emerald
other-emerald•11mo ago

How to throttle enqueuing urls to next router

splitAndExecute({
callback: async (urlBatch, batchIndex) => {
// log that we are enqueuing the nth batch of preview jobs from job-id etc
logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
await enqueueLinks({
urls: urlBatch,
label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
userData: createLinkedinRouterUserData(payload),
waitForAllRequestsToBeAdded: false
});
const minSleepTime = 2000 * (batchIndex + 1);
const maxSleepTime = 3000 * (batchIndex + 1);
await random_sleep(minSleepTime, maxSleepTime);

},
urls: jobDetailPageUrls,
maxRequestsPerBatch: 2,
});
splitAndExecute({
callback: async (urlBatch, batchIndex) => {
// log that we are enqueuing the nth batch of preview jobs from job-id etc
logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
await enqueueLinks({
urls: urlBatch,
label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
userData: createLinkedinRouterUserData(payload),
waitForAllRequestsToBeAdded: false
});
const minSleepTime = 2000 * (batchIndex + 1);
const maxSleepTime = 3000 * (batchIndex + 1);
await random_sleep(minSleepTime, maxSleepTime);

},
urls: jobDetailPageUrls,
maxRequestsPerBatch: 2,
});
Hello guys. I have a router to scrape the url list and enqueue them to the next router. However, I want to limit the enqueuing to throttle the request to the website. I've tried add the crawler configuration but it doesn't work as intended. Even when I have a limit of request/min or request/crawl etc. it doesn't respect that. Inititially I thought that its because the checking of limit is done only after a certain url-list is enqueued. And if a person enqueus a list bigger than a limit in the first go, then this could be the reason of limits not taking effect. E.g. if limit is of 10 requests, and I enqueue the 25 request as a single array. So I manually split the job-urls array into mulitple smaller batches. However, this does not work as well. I mean the enqueuing is definitely done with the intervals of sleep, but the next router is still called at once after all the batches are enqueued.
4 Replies
Hall
Hall•11mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
other-emerald
other-emeraldOP•11mo ago
Here's my cheerio config
return new CheerioCrawler({
proxyConfiguration,
// maxRequestRetries: 1,
maxConcurrency: 1,
maxRequestsPerMinute: 2,
maxRequestsPerCrawl: 10, // ! (for all routers i.e. preview/detail) Useful for testing. (In reality, it is more than this because of parallel requests)
autoscaledPoolOptions: {
desiredConcurrency: 1,
},
requestHandler: linkedinRouter,
});
return new CheerioCrawler({
proxyConfiguration,
// maxRequestRetries: 1,
maxConcurrency: 1,
maxRequestsPerMinute: 2,
maxRequestsPerCrawl: 10, // ! (for all routers i.e. preview/detail) Useful for testing. (In reality, it is more than this because of parallel requests)
autoscaledPoolOptions: {
desiredConcurrency: 1,
},
requestHandler: linkedinRouter,
});
xenial-black
xenial-black•11mo ago
Hey, If I understand correctly, you are trying to limit the frequency of requests that are being sent to the server, right? If so, you should enqueue all of the requests at once, and by setting the maxRequestsPerMinute field, CheerioCrawler will automatically limit the frequency of requests sent to the server. By "enqueuing", you only add the requests to the RequestQueue, which then automatically feeds the Crawler. The maxRequestsPerMinute field does not limit the enqueuing rate, but the amount of requests that are being processed per minute. There is no real advantage in limiting the enqueuing process in general.
other-emerald
other-emeraldOP•11mo ago
Thank you @Milunnn . I tried that already wasn't working for some reason. But now Its working thank you for the help 🙂

Did you find this page helpful?