dual-salmonD
Apify & Crawleeโ€ข2y agoโ€ข
6 replies
dual-salmon

error when crawling download link

Hi All,

im trying to crawl a website that has PDF's to download across different pages

An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci

On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?

here is my crawlee config. since it is a 403, my handler never gets called

  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );
Was this page helpful?