fascinating-indigo
fascinating-indigo2y ago

How to "store" and "retrieve" a browser on a per user basis?

I am working on a Crawlee-based crawler that performs actions on a "per user" basis. Given this, i want to keep a configuration of the browser on a per user basis. I already store cookies, and push those into the browser before each page is loaded on behalf of the user. But, this is done in the same browser.. and I think this is throwing things off.
I run into problems where the cookies “go bad”. This could have NOTHING to do with the architecture. But, it seems to me that every browser is “different”, and as such I think that might be throwing an error. Does anyone have any thoughts on how to store on a per user basis? We also configure the proxy to use a sticky proxy/IP if we can for each user as well. Thanks!
8 Replies
fascinating-indigo
fascinating-indigoOP2y ago
Also, related to this, in our current/old system, we would open tabs for threaded execution. I'm not sure if I need todo something similar with Crawlee? And, given this, I'm not sure how this works.
fair-rose
fair-rose2y ago
For user profile isolation with the browser-based crawlers, you can use the launchContext.userDataDir option - this is basically a passthrough option for the Playwright / Puppeteer option of the same name (https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context-option-user-data-dir).
const crawler = new PlaywrightCrawler({
requestHandler: router,
launchContext: {
userDataDir: './user_data' // path to the folder where you want to store the per-user data.
},
});
const crawler = new PlaywrightCrawler({
requestHandler: router,
launchContext: {
userDataDir: './user_data' // path to the folder where you want to store the per-user data.
},
});
Regarding the "threaded" execution, Crawlee handles per-request concurrency automatically, so you don't really have to care for it (it scales up and down based on the current system load).
fascinating-indigo
fascinating-indigoOP2y ago
Ok that's super helpful. so, it basically works with playwright's options for this?
fair-rose
fair-rose2y ago
Yep, launchContext.userDataDir is just passed to Playwright afaik. You can pass more launch options to the browser (like CLI arguments) in launchContext.launchOptions (check out the TS type annotation in your IDE, it gives you all the options you can use)
MEE6
MEE62y ago
@vroomvroomvroom just advanced to level 1! Thanks for your contributions! 🎉
fascinating-indigo
fascinating-indigoOP2y ago
so, we're running in Kubernetes.. wiht multiple worker processes.. so i'd need to mount these standard data directory into all of my workers.. so it could get to the correct path. we're already stuffing the browser before retrieve with cookies, and dumping them back after the page is loaded this would be the data directory that would store "other" stuff.. i'd guess. that would help us keep things "clean" between users.
fair-rose
fair-rose2y ago
I don't think we ever tried anything like this, but yes - in theory, it should work like this 🙂 If you keep the mapping "one user = one userDataDir", you might even save yourself the hassle with injecting the cookies - the cookies are saved in the userDataDir (along with localStorage contents etc.) This also shows why you definitely shouldn't share the same userDataDir between multiple users 🙂 If you don't specify this option, Playwright generates a new ephemeral userDataDir for each script execution iirc.
fascinating-indigo
fascinating-indigoOP2y ago
ok thanks for that.. yeah, we have initial cookies we'd need to inject. but otherwise, yeah, that seems like it would be logical.

Did you find this page helpful?