system design of concurrent crawlers

At a glance

The community member has multiple Playwright crawlers that work fine individually, but when run concurrently, they face issues like memory overloads, timed out navigations, skipping of products, and early ending of the crawlers. The crawlers take base URLs and scrape them to get product URLs, which are then individually scraped to get product page information.

The comments suggest the community member is experiencing memory overload issues, with the crawlers reclaiming failed requests and timing out on navigations. The community member is asking how to design the scrapers to run them either one by one, certain aspects one by one, or manage multiple running at once, and whether it is even doable. They also ask about running the crawlers in separate terminals or on Apify, including how to manage memory, storage, and compute units, and whether to run them under the same or individual actor instances.

Useful resources

hharish

i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site
i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler
i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers
each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info

3 comments

hharish

Plain Text

WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 6177 MB of 4017 MB (154%). Consider increasing available memory.
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. This crawler instance is already running, you can add more requests to it via `crawler.addRequests()`.
 {"id":"vI4UdrhFP5NVjsV","url":"https://www.tentree.com/collections/kids?page=10","retryCount":1}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.05},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tentree.com/collections/womens?page=50", waiting until "networkidle"
============================================================
 {"id":"2QcFPmYcDLgjTat","url":"https://www.tentree.com/collections/womens?page=50","retryCount":2}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"qgjcGJ50OKucpO6","url":"https://www.kleankanteen.com/collections/all/products/party-kit","retryCount":1}

how do i design the scrapers and run them either one by one, certain aspects one by one, how to manage multiple running at once, etc. and is it doable to do that in the first place

hharish

should i run them in separate terminalsor how else should i run thrm

hharish

and finally how would i run each one in actors on apify, what plan would i choose, how would i manage memory, storage, cu's, etc. and should i run them under the same actor instance or should i run them on individual actor instances and how would you reccomed for me to effectively manage each site crawler

Add a reply

Apify Discord Mirror

system design of concurrent crawlers