Apify Discord Mirror

Updated 5 months ago

Crawlee Playwright Access to Network requests

At a glance

The community member is trying to access the network requests made during a web crawl, specifically to store image URLs. They are currently using page$$.eval but are encountering issues with certain sites that use lazy loading or other techniques to embed image URLs. Another community member suggests using a preNavigationHook to set up a listener on requests and capture the URLs of image resources. The second community member provides an example implementation of this approach.

ccryptorex

Hello,

Is there a method to access the “network” requests that are sent during the crawl?

I’m trying to store image URLs, currently doing page$$.eval - however there are some variations in how certain sites embed image urls.

For example, lazy loading, make network requests and I can see them in DevTools Network tab.

Any way to access this and store it?

Please let me know if my question isn’t clear.

Thanks!

2 comments

PPepa J

Hello :
Add prenavigationHook to the crawler. In the prenavigation hook set listener on requests:

Plain Text

  page.on('request', (req) => {
        if(req.resourceType() === 'image'){
            console.log(req.url());
        }
  }

ccryptorex

thanks for your quick reply!

For anyone else, based on 's guidance I'm going with this:

Plain Text

      preNavigationHooks: [
        async (crawlingContext) => {
            const { page, request } = crawlingContext;
            page.on('request', (pageobj) => {
              const requestUrl = pageobj.url();
              if(pageobj.resourceType() === 'image' && requestUrl.match(/\.(webp|bmp|tif?f|png|jpe?g|gif|svg)$/i)) {
                if(requestUrl.match(excludedImgUrls) == null && requestUrl.length > 0) {
                  cb.push({imgurl: requestUrl, pageurl: request.url});
                }
              }
            })
          },  
      ]

Add a reply