3 replies

Running the same crawler in paralell

👨‍💻Web-Scraping

I've got a config file like

jobs:
  - name: "amazon.de"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

  - name: "amazon.fr"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

jobs:
  - name: "amazon.de"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

  - name: "amazon.fr"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

this config is processed via p-queue and, depending on the crawler id, I want to run a specific crawler, e.g.:

export const testCrawler = createCrawler({
  id: "test",

  configSchema: z.object({
    urls: z.array(z.string()),
  }),

  handler: async ({ urls }) => {
    if (!crawler) {
      crawler = new PlaywrightCrawler({
        async requestHandler({ request, log }) {
          log.info(`Processing: ${request.url}`);
        },
      });
    }
    await crawler.run(urls);
  },
});

export const testCrawler = createCrawler({
  id: "test",

  configSchema: z.object({
    urls: z.array(z.string()),
  }),

  handler: async ({ urls }) => {
    if (!crawler) {
      crawler = new PlaywrightCrawler({
        async requestHandler({ request, log }) {
          log.info(`Processing: ${request.url}`);
        },
      });
    }
    await crawler.run(urls);
  },
});

different sites might use the crawler but a different

requestHandler

requestHandler

. currently when running this I get

This crawler instance is already running, you can add more requests to it via
crawler.addRequests()
crawler.addRequests()

so it's not possible to spawn multiple crawlers of the same type at the same time? would kinda mess up my mental model (and the current impl) a bit. if so, I guess I need to "collect" all data before running the actual crawler? since different crawler "definitions" (e.g.

testCrawler

testCrawler

) require different configurations, this could get messy

Solution

Hey, there is a slight problem, your code first asks whether there already is a crawler

if (!crawler)

if (!crawler)

, if not it creates one, and then if thre already is one, it still calls

await crawler.run(urls)

await crawler.run(urls)

, that is the issue - you can have multiple crawlers, but you can't have multiple

crawler.run

crawler.run

at the same time.

Jump to solution

Apify & Crawlee•2mo ago•

3 replies

nehalist

Running the same crawler in paralell

👨‍💻Web-Scraping

I've got a config file like

jobs:
  - name: "amazon.de"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

  - name: "amazon.fr"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

jobs:
  - name: "amazon.de"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

  - name: "amazon.fr"
    enabled: true

    crawler:
      id: "test"
      enabled: true
      config:
        urls:
          - "https://example.com"

this config is processed via p-queue and, depending on the crawler id, I want to run a specific crawler, e.g.:

export const testCrawler = createCrawler({
  id: "test",

  configSchema: z.object({
    urls: z.array(z.string()),
  }),

  handler: async ({ urls }) => {
    if (!crawler) {
      crawler = new PlaywrightCrawler({
        async requestHandler({ request, log }) {
          log.info(`Processing: ${request.url}`);
        },
      });
    }
    await crawler.run(urls);
  },
});

export const testCrawler = createCrawler({
  id: "test",

  configSchema: z.object({
    urls: z.array(z.string()),
  }),

  handler: async ({ urls }) => {
    if (!crawler) {
      crawler = new PlaywrightCrawler({
        async requestHandler({ request, log }) {
          log.info(`Processing: ${request.url}`);
        },
      });
    }
    await crawler.run(urls);
  },
});

different sites might use the crawler but a different

requestHandler

requestHandler

. currently when running this I get

This crawler instance is already running, you can add more requests to it via
crawler.addRequests()
crawler.addRequests()

testCrawler

testCrawler

) require different configurations, this could get messy

Solution

Hey, there is a slight problem, your code first asks whether there already is a crawler

if (!crawler)

if (!crawler)

, if not it creates one, and then if thre already is one, it still calls

await crawler.run(urls)

await crawler.run(urls)

, that is the issue - you can have multiple crawlers, but you can't have multiple

crawler.run

crawler.run

at the same time.

Jump to solution

Running the same crawler in paralell

Similar Threads

Running the same crawler in paralell

Similar Threads

Similar Threads

Similar Threads