deep-jade
deep-jade6mo ago

Redirect Control

Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?
5 Replies
Hall
Hall6mo ago
Someone will reply to you shortly. In the meantime, this might help:
Alexey Udovydchenko
Session Management | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
deep-jade
deep-jadeOP6mo ago
so each request is a session? say i send 3 urls to crawl would this mark them all as failed once the session is marked as bad? I think i might have explained myself incorrectly. This still lets the page navigate to the auth-login page, my question was if its possible to prevent a redirect on the main document and retire the session in case it is.
Alexey Udovydchenko
sessions defined by the session pool, so on blocking mark request session as "bad" to not continue with other requests if current one is blocked
Oleg V.
Oleg V.6mo ago
You can do something like this:
// Option 1: Use the failedRequestHandler
failedRequestHandler: async ({ request, session, error }) => {
if (error.message.includes('/auth/login') || request.url.includes('/auth/login')) {
console.log(`Request redirected to auth page: ${request.url}`);
// Mark the proxy as bad if you're using a session pool
if (session) {
session.markBad();
}
// You can retry with a different proxy if needed
// request.retryCount = 0;
// await crawler.addRequest(request);
}
},

// Option 2: Handle redirects in the request handler
requestHandler: async ({ request, response, $, crawler, session }) => {
// Check if we were redirected to an auth page
if (request.url.includes('/auth/login') || response.url.includes('/auth/login')) {
console.log(`Detected auth redirect: ${response.url}`);
// Mark the session as bad
if (session) {
session.markBad();
}
// Throw an error to fail this request
throw new Error('Redirected to auth page');
}

// Your normal processing code if not redirected
// ...
},

// Option 3: Use the preNavigationHooks for Playwright/Puppeteer
preNavigationHooks: [
async ({ request, page, session }) => {
// Set up redirect interception
await page.route('**', async (route) => {
const url = route.request().url();
if (url.includes('/auth/login')) {
console.log(`Intercepted auth redirect: ${url}`);
// Abort the navigation
await route.abort();
// Mark the session as bad
if (session) {
session.markBad();
}
throw new Error('Prevented auth page redirect');
} else {
await route.continue();
}
});
}
],
// Option 1: Use the failedRequestHandler
failedRequestHandler: async ({ request, session, error }) => {
if (error.message.includes('/auth/login') || request.url.includes('/auth/login')) {
console.log(`Request redirected to auth page: ${request.url}`);
// Mark the proxy as bad if you're using a session pool
if (session) {
session.markBad();
}
// You can retry with a different proxy if needed
// request.retryCount = 0;
// await crawler.addRequest(request);
}
},

// Option 2: Handle redirects in the request handler
requestHandler: async ({ request, response, $, crawler, session }) => {
// Check if we were redirected to an auth page
if (request.url.includes('/auth/login') || response.url.includes('/auth/login')) {
console.log(`Detected auth redirect: ${response.url}`);
// Mark the session as bad
if (session) {
session.markBad();
}
// Throw an error to fail this request
throw new Error('Redirected to auth page');
}

// Your normal processing code if not redirected
// ...
},

// Option 3: Use the preNavigationHooks for Playwright/Puppeteer
preNavigationHooks: [
async ({ request, page, session }) => {
// Set up redirect interception
await page.route('**', async (route) => {
const url = route.request().url();
if (url.includes('/auth/login')) {
console.log(`Intercepted auth redirect: ${url}`);
// Abort the navigation
await route.abort();
// Mark the session as bad
if (session) {
session.markBad();
}
throw new Error('Prevented auth page redirect');
} else {
await route.continue();
}
});
}
],
Also You can use maxRedirects option: https://crawlee.dev/api/next/core/interface/HttpRequest#maxRedirects And followRedirect: https://crawlee.dev/api/next/core/interface/HttpRequest#followRedirect

Did you find this page helpful?