conscious-sapphire
conscious-sapphire•2y ago

retireBrowserAfterPageCount does not work with values above close to 30 for Playwright

Hi, I am using Crawlee with Playwright. Some Pages require a login procedure, which works well by simply using page actions in Playwright. I am running into a problem though, because Crawlee opens a new browser after around every 30 pages, which deletes all Cookies/Local Storage and therefore fails the login. Relogging can be difficult or sometimes not technically possible. I tried using the browserPoolOptions to prevent the launch of a new browser like this: browserPoolOptions: { retireBrowserAfterPageCount: 50, }, Setting the value to 1 works well, but setting it to 50 or higher numbers does not have any effect. It still launches a new browser after around 30 pages. Is this a known feature/bug? It would help greatly if I could run a crawl on one browser instance for several hundred pages, even if it is not recommended. Thank you for your help! 🙂
11 Replies
adverse-sapphire
adverse-sapphire•2y ago
+1, facing the same issue, I need to stay logged in for scraping to work, I dont think disabling the browser refresh is a good idea, is there a better way to persist cookies between browser refreshes? one way i could think of is to use a post launch hook to login but the docs mention something about having to wait for post launch hook to conclude before the browser controler methods will work
conscious-sapphire
conscious-sapphireOP•2y ago
I had the idea to copy the cookies and local storage entries of a crawl and to setup a new browser once it gets launched. But I fear that this can lead to issues in some cases and might be not the most reliable. Simply sticking to one browser instance for longer would be easier and more robust.
adverse-sapphire
adverse-sapphire•2y ago
yeah or have an option to persist cookies between them like in sessions saving and loading every 30-40 requests will be a huge bottleneck IO wise if you are scraping 1000+ requests like in my case
Lukas Krivka
Lukas Krivka•2y ago
This is not a bug but perhaps missing docs. The retire happens LATEST after 50 requests but can happen sooner if session is retired, usually because of errors. You can change that via SessionPoolOptions
adverse-sapphire
adverse-sapphire•2y ago
hey , so to pass cookies along to other browser sessions would be setting persistCookiesPerSession and useSessionPool to true?
adverse-sapphire
adverse-sapphire•2y ago
I saw those but I assumed that refreshing of browsers is an optimizing measure, so just wanted to know if the way to go is by using a single session indefinitely or if there was a workaround where we could share the cookies between browsers, because manually setting them between browser refreshes will be a big bottleneck on longer scrapes
conscious-sapphire
conscious-sapphireOP•14mo ago
@AltairSama2 Hey 🙂 Just wondering if you found a way to persist cookies or the general browser context even when new browser instances are launched
adverse-sapphire
adverse-sapphire•14mo ago
hey, no, had to workaround on this by storing the cookies and then manually adding them with every subsequent browser refresh
conscious-sapphire
conscious-sapphireOP•14mo ago
Thanks for the quick feedback! 🙂 Did you use the PreLaunchHook of the browserPool to detect this or did you simply set the context before loading every page?

Did you find this page helpful?