zac•2mo ago

Avoiding Crawler Detection

Hi folks, I have a crawlee playwright script that I can run locally without being blocked. However, when I deploy and run it on apify I'm getting blocked. Do you have any suggestions on how to avoid it? Here's the error I'm seeing in the apify console: WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. Here's my proxy configuration:

{
    useApifyProxy: true,
    groups: ['RESIDENTIAL'],
    countryCode: 'US'
}

{
    useApifyProxy: true,
    groups: ['RESIDENTIAL'],
    countryCode: 'US'
}

8 Replies

azzouzana•2mo ago

Try using residential proxies Oh I see that you're already using residential proxies.. Does this fail since the first call or it works successfully in the beginning and then fail? Are you using proxies locally? And is it running inside docker in local + Apify? Have you tried your own proxies?

zacOP•2mo ago

It gets one then fails after that. Locally I'm not using proxies and I'm not using docker. I haven't tried my own proxies

MEE6•2mo ago

@zac just advanced to level 1! Thanks for your contributions! 🎉

zacOP•2mo ago

The way this crawler works is it first grabs a bunch of ids, then it tries to access details of each id by assembling the url (e.g. mywebsite.com/resources/[id]) where the details are located. grabbing the ids works, then getting the details of the first id works but every url after that fails

BytePulse Labs•2mo ago

You are out of luck if all of the proxy IPs are blocked, but if it just some captcha protection you can try to resolve it with third party service https://docs.apify.com/academy/anti-scraping/techniques/captchas

Captchas | Academy | Apify Documentation

Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them.

azzouzana•2mo ago

What happens if you use the same proxies locally and on Apify? And it's always best to have same setup local vs prod (so would be easier to debug). Having the scraper in docker would tremendously help with that. Also, is there a specific antibot?

zacOP•2mo ago

it's not captchas. I think it's just refusing to load the page for those IPs. I haven't tried using the apify proxies locally. I'll give that a shot!

Matous•3w ago

Hey @zac, sorry to reaching to you so late... Is your issue still open? If it is I can help you with it. For now, here are some general facts: - If you use crawlee's crawler with apify proxy configuration, the proxy ips should be used the same way locally and on the platform. It is possible that some ips are blacklisted but generally, residential proxies should mostly work (try some retries). - Your error message states Navigation timed out -> meaning your request took too long (and we don't know if it got blocked or what was the problem) -> you can prolong this timeout in crawler's initial options navigationTimeoutSecs. - The run on the platform is different mainly because your fingerprints will be different. Some antiblocking measure you can find on apify docs: https://docs.apify.com/academy/anti-scraping - Also everything depends on the crawler you are using (cheerio vs playwright vs...) If you provide more details I can help you with the solution.

Anti-scraping protections | Academy | Apify Documentation

Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.

Avoiding Crawler Detection

Did you find this page helpful?