national-gold•3y ago
Handle a 401 in errorHandler by detecting login form and gracefully continuing if present
Hello there!
I'm working on a page crawler that can handle logging into sites, and then crawling around as that user. We've had a lot of success so far with Crawlee (PuppeteerCrawler) by detecting the login in
requestHandler
, logging in, and then continuing with the crawl.
Recently we were asked to support "logging in" to a simple password protection screen on a Netlify site.
On navigation to the page, the page returns a 401 status code but renders the password login form. Because of the 401 status code, Crawlee sees that and calls the errorHandler
. Inside that error handler, I'm able to detect the form, login, but then I'm not sure how to save the crawl from that point.
I can enqueue links from the page but the next request it tries to load, it gets the 401 error again. I'm guessing a little bit but I think the page is closed at the end of the errorHandler
and this causes me to lose my logged in session?
Is there anything I can do to abort the error handling flow from errorHandler
and let the crawl continue as normal with the same page session?
I attempted to add a code example but hit the message limit. Can try in a follow up comment.
3 Replies
national-goldOP•3y ago
Here is a simplified version of our setup.
adverse-sapphire•3y ago
Adding
sessionPoolOptions: { blockedStatusCodes: [] }
[1], to crawler options may solve your problem.
[1] https://crawlee.dev/api/core/interface/SessionPoolOptions#blockedStatusCodesnational-goldOP•3y ago
This looks like it will work, thank you!
This did work for my case, thanks again!