fascinating-indigo•15mo ago

Large threaded, kubernetes scrape = Target page, context or browser has been closed

Ironically @cryptorex just posted a similar issue (https://discord.com/channels/801163717915574323/1255531330704375828/1255531330704375828) but I wanted to provide some additional context to see if they're related. I'm: - Running a node app with worker threads (usually 32 of them) - Running multiple containers in kubernetes Each thread: - Grabs 5 domains from my postgres DB (of 5 million!) - Loops through each domain - Creates a new PlaywrightCrawler with unique-named storages (to prevent collision / global deletion from crawlers in other threads) - Queues the domains home page - Controllers then queue up some additional pages based on what's found on the home page - The results are processed in real-time and pushed to the database (since we don't want to wait until all 5M all are complete - The thread-specific sotrages are then deleted used drop() The Problem This works flawlessly... for about 60 minutes... afterwards, I get plagued with Target page, context or browser has been closed. It appears at the ~ hour mark is when this first presents itself and then incrementally increases in frequency until I'm getting more failed records than successful (at which point, I kill the cluster or restart it). What I've tried: - browserPoolOptions like retireBrowserAfterPageCount: 100 and closeInactiveBrowserAfterSecs: 200 - await crawler.teardown(); in hopes that this would clear and sort of cache/memory that could be stacking up - A cron to restart my cluster 🤣 - Ensuring the EBS volumes are not running out of space (they're 20GB each and seem to be 50% full when crashing) - Ensuring the pods have plenty of memory (running EC2s with 64GB memory and 16 CPU (32 threads). Seems to handle the load in the first hour just fine. I suspect there's a leak or store not being cleared out since it happens gradually?

14 Replies

adverse-sapphire•15mo ago

thanks @Joshua Perk for joining me in my........puzzle? haha. I'll elaborate a bit more. * The environment is docker running the crawler in nodeJS 20+ crawlee 3.9.2 * Docker image is mcr.microsoft.com/playwright:v1.42.1-amd64 * Running on a 12 Core / 48 GB memory / CRAWLEE_MEMORY_MBYTES=32768 * Also have named storages (key stores, and queues) * For us, it doesn't seem to be a one hour mark, I've had some instances fail after 2 days and 45,000 requests later, this one yesterday was 10k requests and about 6 hours. * We are utilizing proxyConfiguration with about 25 proxies. Settings we're using:

    maxRequestRetries: 3,
    maxConcurrency: 15,
    maxRequestsPerMinute: 150,
    maxRequestsPerCrawl: 35000,
    requestHandlerTimeoutSecs: 180,

    maxRequestRetries: 3,
    maxConcurrency: 15,
    maxRequestsPerMinute: 150,
    maxRequestsPerCrawl: 35000,
    requestHandlerTimeoutSecs: 180,

fascinating-indigoOP•15mo ago

Almost identical setup! I'm going to keep trying different configurations/ideas (and will share back if I find anything). I wonder if your variation in results is a clue of any sort.... Do you store fairly consistent amounts of data in each request? If so, crashing at vastly different points would point me away from memory/storage issues and almost more towards... site-specific errors? I'm trying to understand a bit more about how Crawlee initiates browsers / clears storage. You'd think if the page was no longer available, just that request would fail and the next one would open a fresh browser and be just fine. When you start to see the error does it only happen once or does it plague all the threads eventually until the process is basically useless? Also, we're calling const crawler = new PlaywrightCrawler() inside our loop (ie. it's not a single crawler that stays alive for the entire thread). Is that your approach too?

adverse-sapphire•15mo ago

Do you store fairly consistent amounts of data in each request? If so, crashing at vastly different points would point me away from memory/storage issues and almost more towards... site-specific errors?

Do you store fairly consistent amounts of data in each request? If so, crashing at vastly different points would point me away from memory/storage issues and almost more towards... site-specific errors?

It's not stored in a database until the crawler completes. It's only image URL and page URL data. So its stored in 'memory' (RAM) because we are doing some deduplication logic, and then sends it to database (firebase RTDB) upon crawler completion.

When you start to see the error does it only happen once or does it plague all the threads eventually until the process is basically useless?

When you start to see the error does it only happen once or does it plague all the threads eventually until the process is basically useless?

It only happens once and crashes.

Also, we're calling const crawler = new PlaywrightCrawler() inside our loop  (ie. it's not a single crawler that stays alive for the entire thread). Is that your approach too?

Also, we're calling const crawler = new PlaywrightCrawler() inside our loop  (ie. it's not a single crawler that stays alive for the entire thread). Is that your approach too?

No, we're calling it for each site that gets submitted. It's a customer flow, so a URL is submitted and we have a process listening for new submissions. These submissions trigger the crawler function. So its a single crawler for each URL, I would say.

MEE6•15mo ago

@cryptorex just advanced to level 3! Thanks for your contributions! 🎉

adverse-sapphire•15mo ago

I'm trying to understand a bit more about how Crawlee initiates browsers / clears storage. You'd think if the page was no longer available, just that request would fail and the next one would open a fresh browser and be just fine.

I'm trying to understand a bit more about how Crawlee initiates browsers / clears storage. You'd think if the page was no longer available, just that request would fail and the next one would open a fresh browser and be just fine.

Are you using postNavigationHooks and/or preNavigationHooks ? We're using both. It seems to me that during one of these, the browser is closed - I can't seem to pinpoint the origin trigger for this error.

fascinating-indigoOP•15mo ago

@Pepa J , you had asked for more context in a separate thread. Ever seen this before?

MEE6•15mo ago

@Joshua Perk just advanced to level 1! Thanks for your contributions! 🎉

fascinating-indigoOP•15mo ago

We're not using pre/post hooks but that's a good point... hmm....

adverse-sapphire•15mo ago

hehe, then there is something in the requestHandler I might say...because we recently updated our crawler logic for thoroughness. Previously, we were not doing any page.evaluate(), page.waitForTimeout(3000), or page scrolling in the requestHandler - so it might be there then. 🤔

fascinating-indigoOP•15mo ago

Ahhhhhh I'm going to look into ours! Btw, who'd you choose for proxies?

adverse-sapphire•15mo ago

right now we are using https://instantproxies.com/ I guess we are outta luck @Joshua Perk ? 😄

plain-purple•14mo ago

hey y'all i'm running things at a similar scale we do about a million scraped pages a month and I run into all kinds of issues. 1. make sure you're await page.close() at the end of each one. 2. no missing await's or things that might tell crawlee to close the page. 3. I've had issues specifically w/ some chrome flags I had enabled that would make this worse. Are you using chrome flags and are you using headless: new, true, or false? btw @Joshua Perk love the concept of your company, hope it works! attribution is important.

adverse-sapphire•12mo ago

thanks for the details @bmax - i'm not setting headless, so I think its default to true, not using any chrome flags quick update, interestingly we've updated to crawlee 3.11.0 and the crashes seem to have gone away it hasn't crashed for weeks

Oleg V.•12mo ago

Yeah, there was some bug in old crawlee versions. It's always a good practice to use the latest version (now it's 3.11.3)

Large threaded, kubernetes scrape = Target page, context or browser has been closed

Did you find this page helpful?