Apify Discord Mirror

Updated 5 months ago

requestHandler timed out

At a glance

The community member has a large scraper that goes over 200k pages and takes approximately 12 hours, but after 6 hours, all the requests are getting a "requestHandler timed out after 30 seconds" error. The community members suggest that this is likely because the website is blocking the scraper due to the large number of requests or the proxy pool being exhausted. They recommend trying slower scraping with less concurrency or delays, and also using try-catch blocks and screenshots to investigate the timeout errors.

The community members discuss various approaches to handle the timeout errors, such as implementing a custom error handler to mark bad proxies, and checking for specific error types like "TimeoutError". They also mention that the crawlee library has an automatic mechanism to drop blocked proxies, but it may not be triggered in this case since the error is a navigation timeout rather than an HTTP status error.

The community member eventually resolves the issue by aggressively cycling through proxies, though they still encounter some file requests that they suspect are related to the proxy issues.

Useful resources
Hello,
I have a quite big scraper, it goes over 200k pages and it will take approximately 12 hours, but after 6 hours, for some reason all the requests are getting this requestHandler timed out after 30 seconds.

I don't think increasing the requestHandler timeout will solve it, maybe there is something else wrong that I don't get ?
2
H
N
O
15 comments
that is most propably because because the website started to block you because have burnt through all the proxies or it is just overloaded
I would try with slower scraping (less concurrency or some delays)
The proxy pool is kind of huge because I pay per traffic
I don't think it blocked a few hundreds of thousands of ips
I will try slower
I'm using incognito windows and each request gets its own proxy, so the ips repeat very rare
Also, maybe try to add try-catch and make a screenshot in case of time out. To check what is happening.
great idea !
thank you.
will try a bit later, now I'm stuck with a different project
hey, how do I handle these timeout errors? I have tried using a try/catch and nested my entire route handler inside it but still its not triggered and I keep getting these timeout errors
Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.
do I need to do this in the postNavigationHooks? usecase is to rotate IPs in my proxy if a request times out
There is automatic mechanism in crawlee that drops using proxies that are being blocked, but since you end due to timeout instead of http status it might not be triggered.

I suggest you to implement you own errorHandler and in case that your request ends due to timeout during navigation, you may call session.markBad() . ( see https://docs.apify.com/sdk/js/docs/guides/session-management )

Another thing could be that the mechanism already work, but all the proxies from your proxy pool were already used and blocked, so rotating to new ones doesn't change much.
hey, yeah thats tte issue, I have been trying to handle this, only issue its a navigation error and not a request one, so routes wont handle it, the only way I figured Ican handle it is in postNav hooks but it only takes crawling contect as amm argument and I'm not sure how to check for this specific error where the navigation itself is taking time
Did You tried to setup your own errorHandler?
Plain Text
const crawler = new PuppeteerCrawler({
    // ...
    errorHandler: async ({ page, log }, error) => {
        // ...        
    },
    requestHandler: async ({ session, page}) => {
        // ...
    },
});
no, but I'll check it out thanks, I have a custom logging solution but not an error handler
just one thing since crawlee is not exporting TimeoutError anywhere, do I have to manually check for it like error.name==="TimeoutError? and if I do add in my errorhandler, crawlee's default settings wont get affected right? though from the documentation it seems, this is specifically exposed to us for explicitly modifying request obj before we retry it
what do you mean by crawlee's default settings? I believe adding the condition for error name and calling session.markBad() should improve the situation, of course you can always try to test more possible scenarios.
thanks, this is what I ended up doing, aggressively cycling through proxies, there are still some files requests but I'm guessing it's on proxy's side now
Add a reply
Sign up and join the conversation on Discord