stormy-gold•3y ago
Handle a 401 in errorHandler by detecting login form and gracefully continuing if present
Hello there!
I'm working on a page crawler that can handle logging into sites, and then crawling around as that user. We've had a lot of success so far with Crawlee (PuppeteerCrawler) by detecting the login in
requestHandler, logging in, and then continuing with the crawl.
Recently we were asked to support "logging in" to a simple password protection screen on a Netlify site.
On navigation to the page, the page returns a 401 status code but renders the password login form. Because of the 401 status code, Crawlee sees that and calls the errorHandler. Inside that error handler, I'm able to detect the form, login, but then I'm not sure how to save the crawl from that point.
I can enqueue links from the page but the next request it tries to load, it gets the 401 error again. I'm guessing a little bit but I think the page is closed at the end of the errorHandler and this causes me to lose my logged in session?
Is there anything I can do to abort the error handling flow from errorHandler and let the crawl continue as normal with the same page session?
I attempted to add a code example but hit the message limit. Can try in a follow up comment.
3 Replies
stormy-goldOP•3y ago
Here is a simplified version of our setup.
sensitive-blue•3y ago
Adding
sessionPoolOptions: { blockedStatusCodes: [] } [1], to crawler options may solve your problem.
[1] https://crawlee.dev/api/core/interface/SessionPoolOptions#blockedStatusCodesstormy-goldOP•3y ago
This looks like it will work, thank you!
This did work for my case, thanks again!