inland-turquoise•2y ago
How to keep sessions alive while crawling?
I'm using puppeteercrawler to crawl a site. I am crawling both public and authenticated pages.
The site has many subdomains, and each subdomain has it's own session. Each session has a 15 minute duration, which is refreshed when a request is sent to an authenticated page with the session cookies attached. If session cookies are attached to a request to an unauthenticated page, the session expires.
Before I start crawling, I make POST requests to the login endpoint for each subdomain and store the returned session cookies in memory in a javascript Map. For requests that need to be authenticated, I get the session cookies out of the map and set it in a preNavigationHook (using page.setCookie)
My problem is, if there are a lot of requests in the queue, some of the session can expire by the time the crawler gets to those requests because they have just been sitting for 15+ minutes. I could check the page to see if I am actually authenticated and then retry fetching the session cookies, but I am wondering if there is a better way.
3 Replies
@bobw just advanced to level 1! Thanks for your contributions! 🎉
Hi @bobw ,
If I understand it correctly, you might wanna check the
expires
attribute on Cookie in prenavigationHooks
and if it is about to expire in next few mins you may send the POST request on your own, using got-scraping
(for example) and manage the cookies on your own.inland-turquoiseOP•2y ago
Thank you