Apify Discord Mirror

Updated 4 months ago

How can I wait with processing further logic untill all request from batch are proceeded

At a glance
The community member has provided a code snippet and is facing an issue with the crawling process. The issue is that the processBatch function executes all the batches before the processResults function, which means that the userData.results is not yet created. The community member is unsure whether they should move the logic for saving results to the database to the route handler or if there is a way to stop executing the processBatch function and start executing the route handler, and then move back to executing processResults. In the comments, a community member has provided a pseudo-algorithm for the expected behavior.
Hi

I have this code:
Plain Text
  async processBatch(batch){
// requests: {
//     url: string;
//     userData: CrawlerUserData;
// }[]
    const requests = this.generateRequests(batch)
    await this.crawler.addRequests(requests)

    return this.processResults(requests)
  }
...
  async processResults(requests){
    ...
    for (const request of requests) {
      const userData = request.userData as CrawlerUserData
      if (userData.error) {
        this.statistics.incrementErrors()
        continue
      }

      if (userData.results) {
        ...
        await this.saveResults(userData)
      }
    }

    return batchResults
  }


and this is my route handler:

Plain Text
import { createPlaywrightRouter } from 'crawlee'

export const router = createPlaywrightRouter()

router.addDefaultHandler(async ({ page, request, log }) => {
  const userData = request.userData as CrawlerUserData
  try {
    await page.waitForLoadState('networkidle', { timeout: 5000 })

    const analyzer = new AlertsProximityAnalyzer(userData, callbackCheckingIfDataExist)

    await analyzer.analyze(page) // executing callback

    userData.results = analyzer.results
    // Do I need to save the results here?
  } catch (error) {
    ...
  } finally {
    // Instead of closing the page, reset it for the next use
    await page.evaluate(() => window.stop())
    await page.setContent('<html></html>')
  }
})


The problem is the crawling process executes once the whole code in processBatch is done, eg. all batches are added to requestQueue and processResults is executed ( which do not have any data since there is not yet created userData.results so what I want to know it I need to move my logic to saving results to DB to route handler or is there some way to stop executing this function and start executing route handler and then move back to executing processResults

In response I will paste pseudo algorithm what I expect
W
1 comment
Plain Text
  async processBatch () {
    1. generateRequests
    2. crawler.addRequests()
    3. await logic from route default handler executions
    4. processResults
  }
Add a reply
Sign up and join the conversation on Discord