Apify Discord Mirror

Home
Members
cryptorex
c
cryptorex
Offline, last seen 5 months ago
Joined August 30, 2024
Hello,

I've tried a lot to resolve this issue, from changing memory, concurrency, to requests per minute - I can't seem to understand why this randomly happens. I can't tell if I'm missing an await anywhere. More importantly, I'm not sure how to access context with currently crawlee lib, ie: https://docs.apify.com/academy/node-js/how_to_fix_target-closed

Any guidance would be most helpful. My concern is we're already looking to offload our crawlers to the Apify platform (already testing), so we want to make this work first in our own environment.

Currently on "crawlee": "3.9.2"

Code for reference attached.
4 comments
c
O
A
Hello,

Is there a method to access the “network” requests that are sent during the crawl?

I’m trying to store image URLs, currently doing page$$.eval - however there are some variations in how certain sites embed image urls.

For example, lazy loading, make network requests and I can see them in DevTools Network tab.

Any way to access this and store it?

Please let me know if my question isn’t clear.

Thanks!
2 comments
c
P
Hello Team,

I'm trying to crawl a page that has lazy loaded images (on scroll) and an element on the first page that is a JS event 'button' that expands the pages of "posts" on the page.

I'm trying to use the below code, however, it seems like the request never gets filled, the stats show 'requestsTotal:0'.

Plain Text
async requestHandler({ request, page, enqueueLinks, enqueueLinksByClickingElements, infiniteScroll, log }) {

    // Extract links from the current page
    // and add them to the crawling queue.
    await enqueueLinks({ 
      ...<snip>
    });

    await enqueueLinksByClickingElements({
      page,
      selector: '.js-page-load',
      requestQueue: rQueue,
    });

    await infiniteScroll(page, { timeoutSecs: 3000, waitForSecs: 1000 });
    
}

I'm trying to target this:
Plain Text
<button class="pagination-load js-page-load" data-href="/page/2/">Load More <span></span></button>

On the first page..so far it seems like this button is only on the first page.

I'm also using preNavigationHooks to read the network requests and store image LINKS only. I don't know if this code should be in preNavigation hooks instead? Not sure. Thanks for your help as always.
2 comments
c
Hello,

I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2

I've tried hooking into the enqueueLinks options like:

Plain Text
await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});

However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx.

I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!
3 comments
c
L
Hello, first some code:

crawl function
Plain Text
     async function crawl (jobId, websiteURL, cb) {

      var crawler = new crawlee.PlaywrightCrawler({
      // Use the requestHandler to process each of the crawled pages.
      async requestHandler({ request, page, enqueueLinks, log }) {

          const element = await page.$$eval('img', as => as.map(a => a.src));
          if (element.length > 0) {
            for (var img of element) {
              if(cb.indexOf(img) === -1) {
                cb.push(img);
              }
            }
          }
          
          // Extract links from the current page
          // and add them to the crawling queue.
          await enqueueLinks();
      },
      sessionPoolOptions: { persistStateKey: jobId, persistStateKeyValueStoreId: jobId },

     });

    await crawler.run([websiteURL]);
    await crawler.teardown()
    
    return cb;
}


setInterval calls this function
Plain Text
 
   async function fetchImagesUrls (uid, jobId, websiteURL) {
   console.log("Fetching images...")

   const results = await crawl(jobId, websiteURL, cb = []);
   console.log(results);

   return results;
}


Background: I'm calling the fetchImagesUrls from a setInterval function simulating a 'cron job'. I purposely make setinterval pick up Job#1 (details are fetched from a DB) then when the Job#1 starts, I make Job#2 be available for processing.

Behavior: Now Job#1 and Job#2 are running from two different calls, however, the results are getting mixed into each other.

I've tried useState() and my own callback (as shown here) - is there a way to make new calls be isolated to their own results set?

I understand I might be missing something regarding JS fundamentals, but some guidance would be much appreciated. Thanks!
6 comments
c
L
A
Hello, this shouldn't take long.

Am I reading correctly (and have tested) that returning results to a promise or a callback isn't an option with this SDK (crawlee w/ new PlaywrightCrawler() for example) ?

We can only write to Datasets and retrieve later for use?
9 comments
L
N
c
t
A