How to throttle enqueuing urls to next router

        splitAndExecute({
            callback: async (urlBatch, batchIndex) => {
                // log that we are enqueuing the nth batch of preview jobs from job-id etc
                logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
                await enqueueLinks({
                    urls: urlBatch,
                    label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
                    userData: createLinkedinRouterUserData(payload),
                    waitForAllRequestsToBeAdded: false
                });
                const minSleepTime = 2000 * (batchIndex + 1);
                const maxSleepTime = 3000 * (batchIndex + 1);
                await random_sleep(minSleepTime, maxSleepTime);

            },
            urls: jobDetailPageUrls,
            maxRequestsPerBatch: 2,
        });

        splitAndExecute({
            callback: async (urlBatch, batchIndex) => {
                // log that we are enqueuing the nth batch of preview jobs from job-id etc
                logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
                await enqueueLinks({
                    urls: urlBatch,
                    label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
                    userData: createLinkedinRouterUserData(payload),
                    waitForAllRequestsToBeAdded: false
                });
                const minSleepTime = 2000 * (batchIndex + 1);
                const maxSleepTime = 3000 * (batchIndex + 1);
                await random_sleep(minSleepTime, maxSleepTime);

            },
            urls: jobDetailPageUrls,
            maxRequestsPerBatch: 2,
        });

Hello guys. I have a router to scrape the url list and enqueue them to the next router.
However, I want to limit the enqueuing to throttle the request to the website.

I've tried add the crawler configuration but it doesn't work as intended. Even when I have a limit of request/min or request/crawl etc. it doesn't respect that.
Inititially I thought that its because the checking of limit is done only after a certain url-list is enqueued. And if a person enqueus a list bigger than a limit in the first go, then this could be the reason of limits not taking effect.
E.g. if limit is of 10 requests, and I enqueue the 25 request as a single array.

So I manually split the job-urls array into mulitple smaller batches.
However, this does not work as well. I mean the enqueuing is definitely done with the intervals of sleep, but the next router is still called at once after all the batches are enqueued.

Apify & Crawlee•2y ago•

4 replies

brilliant-lime

How to throttle enqueuing urls to next router

        splitAndExecute({
            callback: async (urlBatch, batchIndex) => {
                // log that we are enqueuing the nth batch of preview jobs from job-id etc
                logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
                await enqueueLinks({
                    urls: urlBatch,
                    label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
                    userData: createLinkedinRouterUserData(payload),
                    waitForAllRequestsToBeAdded: false
                });
                const minSleepTime = 2000 * (batchIndex + 1);
                const maxSleepTime = 3000 * (batchIndex + 1);
                await random_sleep(minSleepTime, maxSleepTime);

            },
            urls: jobDetailPageUrls,
            maxRequestsPerBatch: 2,
        });

        splitAndExecute({
            callback: async (urlBatch, batchIndex) => {
                // log that we are enqueuing the nth batch of preview jobs from job-id etc
                logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
                await enqueueLinks({
                    urls: urlBatch,
                    label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
                    userData: createLinkedinRouterUserData(payload),
                    waitForAllRequestsToBeAdded: false
                });
                const minSleepTime = 2000 * (batchIndex + 1);
                const maxSleepTime = 3000 * (batchIndex + 1);
                await random_sleep(minSleepTime, maxSleepTime);

            },
            urls: jobDetailPageUrls,
            maxRequestsPerBatch: 2,
        });

How to throttle enqueuing urls to next router

Similar Threads

How to throttle enqueuing urls to next router

Similar Threads

Similar Threads

Similar Threads