national-gold•3y ago

Handle a 401 in errorHandler by detecting login form and gracefully continuing if present

Hello there! I'm working on a page crawler that can handle logging into sites, and then crawling around as that user. We've had a lot of success so far with Crawlee (PuppeteerCrawler) by detecting the login in requestHandler, logging in, and then continuing with the crawl. Recently we were asked to support "logging in" to a simple password protection screen on a Netlify site. On navigation to the page, the page returns a 401 status code but renders the password login form. Because of the 401 status code, Crawlee sees that and calls the errorHandler. Inside that error handler, I'm able to detect the form, login, but then I'm not sure how to save the crawl from that point. I can enqueue links from the page but the next request it tries to load, it gets the 401 error again. I'm guessing a little bit but I think the page is closed at the end of the errorHandler and this causes me to lose my logged in session? Is there anything I can do to abort the error handling flow from errorHandler and let the crawl continue as normal with the same page session? I attempted to add a code example but hit the message limit. Can try in a follow up comment.

3 Replies

national-goldOP•3y ago

Here is a simplified version of our setup.

                queueWithName = await RequestQueue.open(url)
                const launchContext = {
                    userAgent: 'crawler/namedbot',
                    launchOptions: {
                        headless: true,
                        args: ['--no-sandbox'],
                        ignoreHTTPSErrors: true
                    }
                }

                const crawler = new PuppeteerCrawler({
                    launchContext,
                    maxRequestsPerCrawl,
                    maxConcurrency: 4,
                    maxRequestRetries: 2,
                    navigationTimeoutSecs: 30,
                    persistCookiesPerSession: false,
                    requestQueue: queueWithName,
                    preNavigationHooks: [
                        async ({ page }, gotoOptions) => {
                            await puppeteerUtils.blockRequests(page)
                            gotoOptions.waitUntil = 'domcontentloaded'
                        }
                    ],
                    async errorHandler({ page, request, response, enqueueLinks }, error) {
                                      // detect login page, if present, login      
                                      // get links off pages and enqueueLinks 
                                      // if no login page, continue with normal error flow
                                    },
                    async requestHandler({ enqueueLinks, page, request, response }) {
                                      // detect login page, if present, login      
                                      // get links off pages and enqueueLinks      
                                  },
                                })

                // add first URL to the queue and start the crawl.
                await crawler.run([url])

                queueWithName = await RequestQueue.open(url)
                const launchContext = {
                    userAgent: 'crawler/namedbot',
                    launchOptions: {
                        headless: true,
                        args: ['--no-sandbox'],
                        ignoreHTTPSErrors: true
                    }
                }

                const crawler = new PuppeteerCrawler({
                    launchContext,
                    maxRequestsPerCrawl,
                    maxConcurrency: 4,
                    maxRequestRetries: 2,
                    navigationTimeoutSecs: 30,
                    persistCookiesPerSession: false,
                    requestQueue: queueWithName,
                    preNavigationHooks: [
                        async ({ page }, gotoOptions) => {
                            await puppeteerUtils.blockRequests(page)
                            gotoOptions.waitUntil = 'domcontentloaded'
                        }
                    ],
                    async errorHandler({ page, request, response, enqueueLinks }, error) {
                                      // detect login page, if present, login      
                                      // get links off pages and enqueueLinks 
                                      // if no login page, continue with normal error flow
                                    },
                    async requestHandler({ enqueueLinks, page, request, response }) {
                                      // detect login page, if present, login      
                                      // get links off pages and enqueueLinks      
                                  },
                                })

                // add first URL to the queue and start the crawl.
                await crawler.run([url])

adverse-sapphire•3y ago

Adding sessionPoolOptions: { blockedStatusCodes: [] } [1], to crawler options may solve your problem.

const crawler = new PuppeteerCrawler({
  sessionPoolOptions: { blockedStatusCodes: [401] },
  // ...
});

const crawler = new PuppeteerCrawler({
  sessionPoolOptions: { blockedStatusCodes: [401] },
  // ...
});

[1] https://crawlee.dev/api/core/interface/SessionPoolOptions#blockedStatusCodes

SessionPoolOptions | API | Crawlee

national-goldOP•3y ago

This looks like it will work, thank you! This did work for my case, thanks again!

Handle a 401 in errorHandler by detecting login form and gracefully continuing if present

Did you find this page helpful?