exotic-emerald
exotic-emerald•17mo ago

Requests timing out - best practices?

Hello everyone! I am trying to scrape a grocery store website and I'm running into some difficulties. I'm using PlayWright/Crawlee and running on the APIFY platform. Any assistance would be greatly appreciated! I have a huge number of URLs to use as starting points for my scrape. And I am initiating the scrape with something like this: (Note: startUrls is an array containing several hundred URLs)
await crawler.run(startUrls);
await crawler.run(startUrls);
Then, in each callback of router.addDefaultHandler, I further scroll through each page, enqueuing more links. So, what i'm trying to do is quite extensive and I expect the scrape to take many hours. When I run my scraper, it works well up to a point, but then I start getting more and more errors like:
- "PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 30 seconds"
- WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.
- "Reclaiming failed request back to the list or queue. page.goto: net::ERR_SOCKET_NOT_CONNECTED"
- "PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 30 seconds"
- WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.
- "Reclaiming failed request back to the list or queue. page.goto: net::ERR_SOCKET_NOT_CONNECTED"
And eventually, the entire thing grinds to a halt with something like:
2024-05-05T14:49:15.152Z /home/myuser/node_modules/playwright-core/lib/server/chromium/crPage.js:492
2024-05-05T14:49:15.156Z this._firstNonInitialNavigationCommittedReject(new Error('Page closed'));
2024-05-05T14:49:15.158Z ^
2024-05-05T14:49:15.160Z
2024-05-05T14:49:15.162Z Error: Page closed
2024-05-05T14:49:15.164Z at FrameSession.dispose (/home/myuser/node_modules/playwright-core/lib/server/chromium/crPage.js:492:52)
2024-05-05T14:49:15.152Z /home/myuser/node_modules/playwright-core/lib/server/chromium/crPage.js:492
2024-05-05T14:49:15.156Z this._firstNonInitialNavigationCommittedReject(new Error('Page closed'));
2024-05-05T14:49:15.158Z ^
2024-05-05T14:49:15.160Z
2024-05-05T14:49:15.162Z Error: Page closed
2024-05-05T14:49:15.164Z at FrameSession.dispose (/home/myuser/node_modules/playwright-core/lib/server/chromium/crPage.js:492:52)
[To be continued...]
5 Replies
MEE6
MEE6•17mo ago
@Red Guy just advanced to level 1! Thanks for your contributions! 🎉
exotic-emerald
exotic-emeraldOP•17mo ago
[Continued from above] I'm wondering if there are some best practices I'm missing here. It seems like I'm being throttled by the website? I tried to change my proxy to residential (which I do have a subscription for) and it does not seem to help, unfortunately. I'm reproducing the code below, in case I'm doing something wrong.
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});

const proxyUrl = proxyConfiguration?.newUrl();
const proxyUrl2 = proxyConfiguration?.newUrl();
const proxyUrl3 = proxyConfiguration?.newUrl();

const crawler = new PlaywrightCrawler({
// https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions
proxyConfiguration,
requestHandler: router,
requestHandlerTimeoutSecs: 30,
maxRequestRetries: 4
});
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});

const proxyUrl = proxyConfiguration?.newUrl();
const proxyUrl2 = proxyConfiguration?.newUrl();
const proxyUrl3 = proxyConfiguration?.newUrl();

const crawler = new PlaywrightCrawler({
// https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions
proxyConfiguration,
requestHandler: router,
requestHandlerTimeoutSecs: 30,
maxRequestRetries: 4
});
lemurio
lemurio•17mo ago
it looks like the request handler is timing out, try to increase the timeout using requestHandlerTimeoutSecs you can also take a look at infiniteScroll, it might be helpful in your case
playwrightUtils | API | Crawlee
A namespace that contains various utilities for Playwright - the headless Chrome Node API. Example usage: ```javascript import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); c...
exotic-emerald
exotic-emeraldOP•16mo ago
I appreciate that, but why would 30 seconds not be enough to load a basic webpage? I am afraid some kind of throttling is going on.
lemurio
lemurio•16mo ago
could be many factors, maybe try switching to Datacenter proxies if that would help

Did you find this page helpful?