national-gold
national-gold3y ago

Handle a 401 in errorHandler by detecting login form and gracefully continuing if present

Hello there! I'm working on a page crawler that can handle logging into sites, and then crawling around as that user. We've had a lot of success so far with Crawlee (PuppeteerCrawler) by detecting the login in requestHandler, logging in, and then continuing with the crawl. Recently we were asked to support "logging in" to a simple password protection screen on a Netlify site. On navigation to the page, the page returns a 401 status code but renders the password login form. Because of the 401 status code, Crawlee sees that and calls the errorHandler. Inside that error handler, I'm able to detect the form, login, but then I'm not sure how to save the crawl from that point. I can enqueue links from the page but the next request it tries to load, it gets the 401 error again. I'm guessing a little bit but I think the page is closed at the end of the errorHandler and this causes me to lose my logged in session? Is there anything I can do to abort the error handling flow from errorHandler and let the crawl continue as normal with the same page session? I attempted to add a code example but hit the message limit. Can try in a follow up comment.
No description
3 Replies
national-gold
national-goldOP3y ago
Here is a simplified version of our setup.
queueWithName = await RequestQueue.open(url)
const launchContext = {
userAgent: 'crawler/namedbot',
launchOptions: {
headless: true,
args: ['--no-sandbox'],
ignoreHTTPSErrors: true
}
}

const crawler = new PuppeteerCrawler({
launchContext,
maxRequestsPerCrawl,
maxConcurrency: 4,
maxRequestRetries: 2,
navigationTimeoutSecs: 30,
persistCookiesPerSession: false,
requestQueue: queueWithName,
preNavigationHooks: [
async ({ page }, gotoOptions) => {
await puppeteerUtils.blockRequests(page)
gotoOptions.waitUntil = 'domcontentloaded'
}
],
async errorHandler({ page, request, response, enqueueLinks }, error) {
// detect login page, if present, login
// get links off pages and enqueueLinks
// if no login page, continue with normal error flow
},
async requestHandler({ enqueueLinks, page, request, response }) {
// detect login page, if present, login
// get links off pages and enqueueLinks
},
})

// add first URL to the queue and start the crawl.
await crawler.run([url])
queueWithName = await RequestQueue.open(url)
const launchContext = {
userAgent: 'crawler/namedbot',
launchOptions: {
headless: true,
args: ['--no-sandbox'],
ignoreHTTPSErrors: true
}
}

const crawler = new PuppeteerCrawler({
launchContext,
maxRequestsPerCrawl,
maxConcurrency: 4,
maxRequestRetries: 2,
navigationTimeoutSecs: 30,
persistCookiesPerSession: false,
requestQueue: queueWithName,
preNavigationHooks: [
async ({ page }, gotoOptions) => {
await puppeteerUtils.blockRequests(page)
gotoOptions.waitUntil = 'domcontentloaded'
}
],
async errorHandler({ page, request, response, enqueueLinks }, error) {
// detect login page, if present, login
// get links off pages and enqueueLinks
// if no login page, continue with normal error flow
},
async requestHandler({ enqueueLinks, page, request, response }) {
// detect login page, if present, login
// get links off pages and enqueueLinks
},
})

// add first URL to the queue and start the crawl.
await crawler.run([url])
adverse-sapphire
adverse-sapphire3y ago
Adding sessionPoolOptions: { blockedStatusCodes: [] } [1], to crawler options may solve your problem.
const crawler = new PuppeteerCrawler({
sessionPoolOptions: { blockedStatusCodes: [401] },
// ...
});
const crawler = new PuppeteerCrawler({
sessionPoolOptions: { blockedStatusCodes: [401] },
// ...
});
[1] https://crawlee.dev/api/core/interface/SessionPoolOptions#blockedStatusCodes
national-gold
national-goldOP3y ago
This looks like it will work, thank you! This did work for my case, thanks again!

Did you find this page helpful?