fascinating-indigo•2y ago
Scaling crawlers, what does 'Requests' mean
https://crawlee.dev/docs/guides/scaling-crawlers
The doc talks about a few things, but it never actually clarifies what the term “Requests” means, can anyone clarify please?
- maxConcurrency "how many parallel requests can be run at any time."
- maxTasksPerMinute "how many total requests can be made per minute."
- maxRequestsPerMinute "total requests can be made per minute"
For all functions it talks about 'Requests' but it never clarifies what a request is;
A. Is a request the crawl of a single url (not including requests like downloading .js .css, images etc)
B. Is a request the total sum of requests that need to be performed when downloading url? (including .js .css, images etc)
E.g. if it set “maxRequestsPerMinute” to 3
Do i limit crawlee to fully crawl 3 given urls
or
Do i limit crawlee to crawl 3 requests total (e.g. 1x html, 1x js, 1x css)
Thanks
Scaling our crawlers | Crawlee
To infinity and beyond! ...within limits
4 Replies
Request object represents a URL to be scraped. That might be pure HTTP call or browser page opening depending on type of Crawler being used
https://crawlee.dev/api/core/class/Request
Request is usually added to the queue and then processed in the
requestHandler
. It is kinda the fundamental unit of Crawlee.Request | API | Crawlee
Represents a URL to be crawled, optionally including HTTP method, headers, payload and other metadata.
The
Request
object also stores information about errors that occurred during processing of the request.
Each Request
instance has the uniqueKey
property, which can be either specified
manually in the constructor or generated automaticall...Downloading JS, CSS etc. is abstracted over. In case of Cheerio, you get HTML only anyway, in case of browser, it will be a full page load by default as part of single request (so likely 10s to 100s HTTP calls by the browser)
fascinating-indigoOP•2y ago
Hi @Lukas Krivka thanks for your response. I did read the Request Class, but it still wasn't clear to me, i read it again.
If i understand correctly, when there is a mention of a request in the docs its about the actual url that is requested, not counting all the requests that are needed to download the pages resources (e.g. .css, .js. images etc).
To clarify, if i would request e.g. https://discord.com/ it's counted as 1 request, correct? (it doesn't count the 43 downloads for all the resources that are loaded by https://discord.com/)
E.g.
setting "maxRequestsPerMinute" to 10 will allow for 10 discord.com url's per minute.
if i in addition would set
"maxConcurrency" to 1, it would allow for crawling of 1 discord.com url at the time.
To sum it up, requests are always the urls provided by my system to crawlee and never about the number of recources crawle has to download
Correct