passive-yellow
passive-yellow3y ago

Mark session as bad when request times out or proxy responds with 502

I'm using CheerioCrawler and I'd like to mark sessions as bad when the request either times out or there's a proxy error. Those cases trigger an error before reaching requestHandler and the request is added back to the queue without me having the opportunity to mark the session. Is there a hook somewhere that I can use? Or should I override _requestFunctionErrorHandler?
16 Replies
fair-rose
fair-rose3y ago
I would like to know this as well
fascinating-indigo
fascinating-indigo3y ago
You can mark a session as bad with the session.markBad() function within the errorHandler function (which runs on every request failed, as opposed to failedRequestHandler, which runs once a request has reached its max retries)
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
errorHandler: ({ session }) => {
session.markBad();
},
});
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
errorHandler: ({ session }) => {
session.markBad();
},
});
But if you just want a session to be thrown away if it fails once, you can do this instead in the sessionPoolOptions:
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
sessionPoolOptions: {
sessionOptions: {
maxErrorScore: 1,
},
},
});
const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
sessionPoolOptions: {
sessionOptions: {
maxErrorScore: 1,
},
},
});
passive-yellow
passive-yellowOP3y ago
Amazing thank you @thek1tten I didn't know about errorHandler One more question: how can I access the error in errorHandler? Is it passed as parameter? All good I found my answer in the docs! @thek1tten can I prevent the request from being retried depending on the error from the errorHandler?
fascinating-indigo
fascinating-indigo3y ago
No description
fascinating-indigo
fascinating-indigo3y ago
This should work
passive-yellow
passive-yellowOP3y ago
So I've tried that but without success, the request still ends up being retried Is there any other way to prevent a retry? Maybe throwing a NonRetryableError?
passive-yellow
passive-yellowOP3y ago
See on the logs, I print request right after setting request.noRetry to true in errorHandler, then the request is retried right after
No description
fascinating-indigo
fascinating-indigo3y ago
Hmm, that means it’s going off of the old value and reassigning it here does nothing. Let me look into it.
passive-yellow
passive-yellowOP3y ago
Thanks!
MEE6
MEE63y ago
@fab8203 just advanced to level 4! Thanks for your contributions! 🎉
fascinating-indigo
fascinating-indigo3y ago
This feature doesn’t seem to exist yet. I’m making a PR on Crawlee’s GitHub to fix this
passive-yellow
passive-yellowOP3y ago
Thank you @thek1tten let me know if there is a link to the issue that I can follow
fascinating-indigo
fascinating-indigo3y ago
GitHub
feat(basic-crawler): allow request skipping by mstephen19 · Pull Re...
See this Discord post to fully understand the use case: https://discord.com/channels/801163717915574323/1019936393235017769 Didn't want to make big changes to existing code so kept the else sta...
fascinating-indigo
fascinating-indigo3y ago
@fab8203 It was merged with master
passive-yellow
passive-yellowOP3y ago
Thank you for the follow up @thek1tten
Lukas Krivka
Lukas Krivka3y ago
Btw: retiring sessions and not retrying request are 2 completely different concepts. Request and Session are separate objects that might be connected temporarily (do a Request using this Session)

Did you find this page helpful?