rare-sapphire•15mo ago

error when crawling download link

Hi All, im trying to crawl a website that has PDF's to download across different pages An example https://dca-global.org/file/view/12756/interact-case-study-cedaci On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download? here is my crawlee config. since it is a 403, my handler never gets called

  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );

  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );

Interact Case Study CEDaCI : DCA Global (Data Centre Alliance)

Data Centre Alliance

5 Replies

rare-sapphireOP•15mo ago

seems to be a cookie issue

rare-sapphireOP•15mo ago

not a cookie issue, that is just because when i tested the link in another browser obviously the cookie didnt match. seems to be this issue where a navigation turns into a download and chromium throws its toys out of the pram https://github.com/microsoft/playwright-java/issues/541

GitHub

[Bug]: net::ERR_ABORTED when navigating to a page that only initiat...

Playwright version 1.13.0 Operating system MacOS What browsers are you seeing the problem on? Chromium, Firefox, WebKit Other information No response What happened? / Describe the bug [see the code...

MEE6•15mo ago

@Crafty just advanced to level 1! Thanks for your contributions! 🎉

rare-sapphireOP•15mo ago

I think I have a solution. It isnt perfect but I was able to intercept the download in a preNavigationHook

import { PlaywrightCrawler, sleep } from 'crawlee';
import path from 'path';

const crawler = new PlaywrightCrawler({
  headless: false, // Run in headful mode for debugging
    async requestHandler({request, page, enqueueLinks, session}) {
      console.log(session?.getCookies(request.url))
      const a = 1;
      await enqueueLinks({globs: ['https://dca-global.org/serve-file/**']})
    },
    async errorHandler({ request }) {
      if(request.userData['attachment']) {
        console.log('download not error')
        request.noRetry = true
      }
    },
    preNavigationHooks: [
      async (crawlingContext) => {
        const a = 1;
        crawlingContext.page.once('response', async (resp) => {
          const disposition = await resp.headerValue('content-disposition')
          if (disposition && crawlingContext.request.url == resp.request().url()) {
            crawlingContext.request.userData['attachment'] = true;
            const download = await crawlingContext.page.waitForEvent('download')
            await download.saveAs(path.join('./storage/downloads', download.suggestedFilename()))
          }
        })
    },
    ],
    sessionPoolOptions: {
      maxPoolSize: 1
    }
});

const startUrls = ['https://dca-global.org/file/view/12756/interact-case-study-cedaci'];

await crawler.addRequests(startUrls);

await crawler.run();

import { PlaywrightCrawler, sleep } from 'crawlee';
import path from 'path';

const crawler = new PlaywrightCrawler({
  headless: false, // Run in headful mode for debugging
    async requestHandler({request, page, enqueueLinks, session}) {
      console.log(session?.getCookies(request.url))
      const a = 1;
      await enqueueLinks({globs: ['https://dca-global.org/serve-file/**']})
    },
    async errorHandler({ request }) {
      if(request.userData['attachment']) {
        console.log('download not error')
        request.noRetry = true
      }
    },
    preNavigationHooks: [
      async (crawlingContext) => {
        const a = 1;
        crawlingContext.page.once('response', async (resp) => {
          const disposition = await resp.headerValue('content-disposition')
          if (disposition && crawlingContext.request.url == resp.request().url()) {
            crawlingContext.request.userData['attachment'] = true;
            const download = await crawlingContext.page.waitForEvent('download')
            await download.saveAs(path.join('./storage/downloads', download.suggestedFilename()))
          }
        })
    },
    ],
    sessionPoolOptions: {
      maxPoolSize: 1
    }
});

const startUrls = ['https://dca-global.org/file/view/12756/interact-case-study-cedaci'];

await crawler.addRequests(startUrls);

await crawler.run();

i set the max pool size to 1 to ensure that the cookie was picked up on the previous navigation before downloading the file. The hook awaits the response. checks the disposition header and that it is the initial download and not something like a secondary image, sets a flag in the user data and downloads the file. The trouble is that there is a potential race condition between the flag and the net::ERR_ABORTED any advice would be appreciated! This should avoid the race condition

type downloadPromise = {
  data: Buffer
  suggestedName: string
}
    async requestHandler({ enqueueLinks }) {
      await enqueueLinks({globs: ['https://dca-global.org/serve-file/**']})
    },
    async errorHandler({ request }) {
      if(request.userData['download']) {
        console.log('checking for download')
        try {
          const download = await request.userData.download as downloadPromise | undefined
          if (download) {
            console.log('it was a file download not an error')
            
          } {
            console.log('no download, was actually an error')
          }
        } catch (err) {
          console.log('download failed')
          console.error(err)
        }
      }
    },
    preNavigationHooks: [
      async (crawlingContext) => {
        crawlingContext.request.userData['download'] = new Promise<downloadPromise | undefined>( async (resolve, reject) => {
          try {
            const response = await crawlingContext.page.waitForEvent('response')
            const disposition = await response.headerValue('content-disposition')
            if (disposition && crawlingContext.request.url == response.request().url()) {
              const download = await crawlingContext.page.waitForEvent('download')
              const stream = await download.createReadStream()
              const chunks: Buffer[] = [];
  
              stream.on('data', (chunk: Buffer) => {
                chunks.push(chunk);
              });
  
              stream.on('end', () => {
                const buffer = Buffer.concat(chunks);
                resolve({data: buffer, suggestedName: download.suggestedFilename()})
              });
              setTimeout(() => reject(new Error('download not complete after 15 seconds')), 15000)
            } else {
              resolve(undefined)
            }
          } catch(err) {
            reject(err)
          }
        })
    },
    ],

type downloadPromise = {
  data: Buffer
  suggestedName: string
}
    async requestHandler({ enqueueLinks }) {
      await enqueueLinks({globs: ['https://dca-global.org/serve-file/**']})
    },
    async errorHandler({ request }) {
      if(request.userData['download']) {
        console.log('checking for download')
        try {
          const download = await request.userData.download as downloadPromise | undefined
          if (download) {
            console.log('it was a file download not an error')
            
          } {
            console.log('no download, was actually an error')
          }
        } catch (err) {
          console.log('download failed')
          console.error(err)
        }
      }
    },
    preNavigationHooks: [
      async (crawlingContext) => {
        crawlingContext.request.userData['download'] = new Promise<downloadPromise | undefined>( async (resolve, reject) => {
          try {
            const response = await crawlingContext.page.waitForEvent('response')
            const disposition = await response.headerValue('content-disposition')
            if (disposition && crawlingContext.request.url == response.request().url()) {
              const download = await crawlingContext.page.waitForEvent('download')
              const stream = await download.createReadStream()
              const chunks: Buffer[] = [];
  
              stream.on('data', (chunk: Buffer) => {
                chunks.push(chunk);
              });
  
              stream.on('end', () => {
                const buffer = Buffer.concat(chunks);
                resolve({data: buffer, suggestedName: download.suggestedFilename()})
              });
              setTimeout(() => reject(new Error('download not complete after 15 seconds')), 15000)
            } else {
              resolve(undefined)
            }
          } catch(err) {
            reject(err)
          }
        })
    },
    ],

Pepa J•15mo ago

Thank you for your description of the problem and solution.

error when crawling download link

Did you find this page helpful?