fascinating-indigo•2y ago

Scaling crawlers, what does 'Requests' mean

https://crawlee.dev/docs/guides/scaling-crawlers The doc talks about a few things, but it never actually clarifies what the term “Requests” means, can anyone clarify please? - maxConcurrency "how many parallel requests can be run at any time." - maxTasksPerMinute "how many total requests can be made per minute." - maxRequestsPerMinute "total requests can be made per minute" For all functions it talks about 'Requests' but it never clarifies what a request is; A. Is a request the crawl of a single url (not including requests like downloading .js .css, images etc) B. Is a request the total sum of requests that need to be performed when downloading url? (including .js .css, images etc) E.g. if it set “maxRequestsPerMinute” to 3 Do i limit crawlee to fully crawl 3 given urls or Do i limit crawlee to crawl 3 requests total (e.g. 1x html, 1x js, 1x css) Thanks

Scaling our crawlers | Crawlee

To infinity and beyond! ...within limits

4 Replies

Lukas Krivka•2y ago

Request object represents a URL to be scraped. That might be pure HTTP call or browser page opening depending on type of Crawler being used https://crawlee.dev/api/core/class/Request Request is usually added to the queue and then processed in the requestHandler. It is kinda the fundamental unit of Crawlee.

Request | API | Crawlee

Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata. The Request object also stores information about errors that occurred during processing of the request. Each Request instance has the uniqueKey property, which can be either specified manually in the constructor or generated automaticall...

Lukas Krivka•2y ago

Downloading JS, CSS etc. is abstracted over. In case of Cheerio, you get HTML only anyway, in case of browser, it will be a full page load by default as part of single request (so likely 10s to 100s HTTP calls by the browser)

fascinating-indigoOP•2y ago

Hi @Lukas Krivka thanks for your response. I did read the Request Class, but it still wasn't clear to me, i read it again. If i understand correctly, when there is a mention of a request in the docs its about the actual url that is requested, not counting all the requests that are needed to download the pages resources (e.g. .css, .js. images etc). To clarify, if i would request e.g. https://discord.com/ it's counted as 1 request, correct? (it doesn't count the 43 downloads for all the resources that are loaded by https://discord.com/) E.g. setting "maxRequestsPerMinute" to 10 will allow for 10 discord.com url's per minute. if i in addition would set "maxConcurrency" to 1, it would allow for crawling of 1 discord.com url at the time. To sum it up, requests are always the urls provided by my system to crawlee and never about the number of recources crawle has to download

Lukas Krivka•2y ago

Correct

Scaling crawlers, what does 'Requests' mean

Did you find this page helpful?