deep-jade•6mo ago
Redirect Control
Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?
5 Replies
Someone will reply to you shortly. In the meantime, this might help:
Session Management | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
deep-jadeOP•6mo ago
so each request is a session? say i send 3 urls to crawl would this mark them all as failed once the session is marked as bad? I think i might have explained myself incorrectly. This still lets the page navigate to the auth-login page, my question was if its possible to prevent a redirect on the main document and retire the session in case it is.
sessions defined by the session pool, so on blocking mark request session as "bad" to not continue with other requests if current one is blocked
You can do something like this:
Also You can use maxRedirects option: https://crawlee.dev/api/next/core/interface/HttpRequest#maxRedirects
And followRedirect: https://crawlee.dev/api/next/core/interface/HttpRequest#followRedirect