rare-sapphire•15mo ago
error when crawling download link
Hi All,
im trying to crawl a website that has PDF's to download across different pages
An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci
On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?
here is my crawlee config. since it is a 403, my handler never gets called
Interact Case Study CEDaCI : DCA Global (Data Centre Alliance)
Data Centre Alliance
5 Replies
rare-sapphireOP•15mo ago
seems to be a cookie issue
rare-sapphireOP•15mo ago
not a cookie issue, that is just because when i tested the link in another browser obviously the cookie didnt match.
seems to be this issue where a navigation turns into a download and chromium throws its toys out of the pram
https://github.com/microsoft/playwright-java/issues/541
GitHub
[Bug]: net::ERR_ABORTED when navigating to a page that only initiat...
Playwright version 1.13.0 Operating system MacOS What browsers are you seeing the problem on? Chromium, Firefox, WebKit Other information No response What happened? / Describe the bug [see the code...
@Crafty just advanced to level 1! Thanks for your contributions! 🎉
rare-sapphireOP•15mo ago
I think I have a solution. It isnt perfect but I was able to intercept the download in a preNavigationHook
i set the max pool size to 1 to ensure that the cookie was picked up on the previous navigation before downloading the file. The hook awaits the response. checks the disposition header and that it is the initial download and not something like a secondary image, sets a flag in the user data and downloads the file. The trouble is that there is a potential race condition between the flag and the net::ERR_ABORTED
any advice would be appreciated!
This should avoid the race condition
Thank you for your description of the problem and solution.