extended-salmon
extended-salmon5mo ago

Handling of 4xx and 5xx in default handler (Python)

I built a crawler for crawling the websites and now trying to add functionality to also handle error pages/links like 4xx and 5xx. I was not able to find any documentation regarding that. So, the question is if it is supported and if yes in what direction to look at?
6 Replies
Hall
Hall5mo ago
Someone will reply to you shortly. In the meantime, this might help: -# This post was marked as solved by rast42. View answer.
Mantisus
Mantisus5mo ago
Hey @rast42 Standard crawlee has its own behavior for status error handling 5xx - cause a repeat 403, 429, 401 - cause session rotation if used 4xx - marked as erroneous without repetition If you want to handle any statuses yourself you can use ignore_http_error_status_codes.
sunny-green
sunny-green5mo ago
is it needed to include all the codes in this setting or can we set it to ignore all codes?
Mantisus
Mantisus5mo ago
You need to include all. Something like.
list(range(400,600))
list(range(400,600))
sunny-green
sunny-green5mo ago
crazy. is there no better solution to override the error handling?
Mantisus
Mantisus5mo ago
Could you give examples of the kind of behavior you want to achieve? Perhaps error_handler is better for your case https://crawlee.dev/python/api/class/BasicCrawler#error_handler

Did you find this page helpful?