Apify Discord Mirror

Updated 11 months ago

Reduce time between "PlaywrightCrawler: Starting the crawler." and the "requestHandler"

At a glance

The community member is experiencing a long delay between the "PlaywrightCrawler: Starting the crawler." log and the actual request being handled in their crawler. They suspect this could be related to the time it takes to connect to the proxy server or something else. The comments suggest several potential solutions:

- Profile the code to identify bottlenecks

- Increase the run's memory, set a proper "waitUntil" event, and use the blockRequest() utility function to block static assets and speed up the process

- Consider using the Cheerio crawler for higher performance

- The community member is already blocking static assets (except JavaScript) to enable rendering, and they plan to try using the "domcontentloaded" option instead of "load" to avoid timeouts due to the website's many scripts.

The community member also mentions that they are using page.route to intercept API requests and extract a bearer token, which may be disabling browser caching and causing additional delays. They are currently using Firefox instead of Chrome to reduce the chances of the website's WAP protection detecting their crawler.

Useful resources
My crawler is having a long delay between the "PlaywrightCrawler: Starting the crawler." log and the actual request being handled. Could this be related to the time if takes to connect to the proxy server or could this be something else?
g
O
G
4 comments
You can try profiling your code to see where the bottlenecks are.
What do You mean by "long delay" ?)

You are using browser. So, in this case, the speed is affected by many factors including the run memory setting, loading resources process, and the required time for rendering the data.

  • You can increasy run's running memory;
  • try to set proper "waitUntil" event and use blockRequest() util func:
Plain Text
    preNavigationHooks: [async (
        // {
        //  blockRequests
        // },
        gotoOptions) => {
        // await blockRequests();
        gotoOptions.waitUntil = 'domcontentloaded'; // fastest resolver 
    }]

blockRequests:
https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests


  • Finally You can try Cheerio crawler for its high performance.
By a long delay I feel that sometimes it looks like it takes a long time before the crawler start actually processing the requests. I'm already blocking the static assets. (expect the JS) because I need the JS to be enabled to render the page.

I'll try with the domcontentloaded because sometimes with the load option the page timeout, due the fact that the website has a lot of scripts :/ to load.

One thing that I'm not sure if this might be taking some time too is that I'm currently using some page.route to intercept some of the API request and extract the bearer token from it, but according to the docs, when I use page.route it disables the browser caching. Taking into consideration that it has a lot of JS to load, I'd like to be able to cache these, but I wasn't able to achieve this either.

Right now I'm user Firefox instead of Chrome as it reduces the changes on the website WAP protection detects as a crawler
Add a reply
Sign up and join the conversation on Discord