cryptorex

TargetClosedError: Target page, context or browser has been closed (I've tried a lot)

Hello,

I've tried a lot to resolve this issue, from changing memory, concurrency, to requests per minute - I can't seem to understand why this randomly happens. I can't tell if I'm missing an await anywhere. More importantly, I'm not sure how to access context with currently crawlee lib, ie: https://docs.apify.com/academy/node-js/how_to_fix_target-closed

Any guidance would be most helpful. My concern is we're already looking to offload our crawlers to the Apify platform (already testing), so we want to make this work first in our own environment.

Currently on "crawlee": "3.9.2"

Code for reference attached.

4 comments

ccryptorex

Crawlee Playwright Access to Network requests

Hello,

Is there a method to access the “network” requests that are sent during the crawl?

I’m trying to store image URLs, currently doing page$$.eval - however there are some variations in how certain sites embed image urls.

For example, lazy loading, make network requests and I can see them in DevTools Network tab.

Any way to access this and store it?

Please let me know if my question isn’t clear.

Thanks!

2 comments

ccryptorex

infiniteScroll and enqueueLinksByClickingElements

Hello Team,

I'm trying to crawl a page that has lazy loaded images (on scroll) and an element on the first page that is a JS event 'button' that expands the pages of "posts" on the page.

I'm trying to use the below code, however, it seems like the request never gets filled, the stats show 'requestsTotal:0'.

Plain Text

async requestHandler({ request, page, enqueueLinks, enqueueLinksByClickingElements, infiniteScroll, log }) {

    // Extract links from the current page
    // and add them to the crawling queue.
    await enqueueLinks({ 
      ...<snip>
    });

    await enqueueLinksByClickingElements({
      page,
      selector: '.js-page-load',
      requestQueue: rQueue,
    });

    await infiniteScroll(page, { timeoutSecs: 3000, waitForSecs: 1000 });
    
}

I'm trying to target this:

Plain Text

<button class="pagination-load js-page-load" data-href="/page/2/">Load More <span></span></button>

On the first page..so far it seems like this button is only on the first page.

I'm also using preNavigationHooks to read the network requests and store image LINKS only. I don't know if this code should be in preNavigation hooks instead? Not sure. Thanks for your help as always.

2 comments

ccryptorex

Exclude query parameter URLs from crawl jobs

Hello,

I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2

I've tried hooking into the enqueueLinks options like:

Plain Text

await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});

However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx.

I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!

3 comments

ccryptorex

PlayWrightCrawler new request results are bleeding into old requests. RequestQueue issue?

Hello, first some code:

crawl function

Plain Text

     async function crawl (jobId, websiteURL, cb) {

      var crawler = new crawlee.PlaywrightCrawler({
      // Use the requestHandler to process each of the crawled pages.
      async requestHandler({ request, page, enqueueLinks, log }) {

          const element = await page.$$eval('img', as => as.map(a => a.src));
          if (element.length > 0) {
            for (var img of element) {
              if(cb.indexOf(img) === -1) {
                cb.push(img);
              }
            }
          }
          
          // Extract links from the current page
          // and add them to the crawling queue.
          await enqueueLinks();
      },
      sessionPoolOptions: { persistStateKey: jobId, persistStateKeyValueStoreId: jobId },

     });

    await crawler.run([websiteURL]);
    await crawler.teardown()
    
    return cb;
}

setInterval calls this function

Plain Text

 
   async function fetchImagesUrls (uid, jobId, websiteURL) {
   console.log("Fetching images...")

   const results = await crawl(jobId, websiteURL, cb = []);
   console.log(results);

   return results;
}

Background: I'm calling the fetchImagesUrls from a setInterval function simulating a 'cron job'. I purposely make setinterval pick up Job#1 (details are fetched from a DB) then when the Job#1 starts, I make Job#2 be available for processing.

Behavior: Now Job#1 and Job#2 are running from two different calls, however, the results are getting mixed into each other.

I've tried useState() and my own callback (as shown here) - is there a way to make new calls be isolated to their own results set?

I understand I might be missing something regarding JS fundamentals, but some guidance would be much appreciated. Thanks!

6 comments

ccryptorex

Storage of data or returning of results

Hello, this shouldn't take long.

Am I reading correctly (and have tested) that returning results to a promise or a callback isn't an option with this SDK (crawlee w/ new PlaywrightCrawler() for example) ?

We can only write to Datasets and retrieve later for use?

9 comments

Apify Discord Mirror

TargetClosedError: Target page, context or browser has been closed (I've tried a lot)

Crawlee Playwright Access to Network requests

infiniteScroll and enqueueLinksByClickingElements

Exclude query parameter URLs from crawl jobs

PlayWrightCrawler new request results are bleeding into old requests. RequestQueue issue?

Storage of data or returning of results