abstract-brown
abstract-brown2y ago

preNavigationHook needs to listen to response from network and change goToOptions.

Hey y'all, so, basically I'm trying to see if the response is application/pdf, then, it should timeout immediately and ideally skipRequest.
async (crawlingContext, gotoOptions) => {
const { page, request, crawler } = crawlingContext
const queue = await crawler.getRequestQueue()
const crawler_dto = request.userData.crawler_dto

if (!request.url.endsWith('.pdf')) {
gotoOptions.waitUntil = 'networkidle2'
gotoOptions.timeout = 20000
await page.setBypassCSP(true)
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
})
await page.setViewport({ width: 1440, height: 900 })
}

page.on('response', async (page_response) => {
if (page_response.headers()['content-type'] === 'application/pdf') {
gotoOptions.timeout = 1
}
})
},
async (crawlingContext, gotoOptions) => {
const { page, request, crawler } = crawlingContext
const queue = await crawler.getRequestQueue()
const crawler_dto = request.userData.crawler_dto

if (!request.url.endsWith('.pdf')) {
gotoOptions.waitUntil = 'networkidle2'
gotoOptions.timeout = 20000
await page.setBypassCSP(true)
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
})
await page.setViewport({ width: 1440, height: 900 })
}

page.on('response', async (page_response) => {
if (page_response.headers()['content-type'] === 'application/pdf') {
gotoOptions.timeout = 1
}
})
},
7 Replies
fascinating-indigo
fascinating-indigo2y ago
preNavigationHooks are executed before sending the request, and you cannot directly listen to the response within this hook. Instead, you have a couple of options: 1- Listen to the response in requestHandler: You can handle the response within the requestHandler function, which is called after the request has been sent but before the response is processed. 2- Access the response in postNavigationHook: Alternatively, if you need to access the response after it has been received, you can do so in the postNavigationHook. This hook is called after the navigation has occurred and the response has been received.
abstract-brown
abstract-brownOP2y ago
@Hamza thanks for the response. The real problem is the timeout = 20000 seconds before I know it's an application/pdf (from network) so. the router.addDefaultHandler doesn't get called for 20 seconds... or at all? since the request times out (since it's a iframe type pdf) here is URL: https://www.taosnm.gov/DocumentCenter/View/3685/Site-Threshold-Assessment-29-PDF (since the url does not end in PDF, and, you can't technically tell it's a PDF until netwokr loads)
fascinating-indigo
fascinating-indigo2y ago
Try this:
import { NonRetryableError } from 'crawlee';

preNavigationHooks: [
async ({ page }) => {
page.on('response', async (page_response) => {
if (page_response.headers()['content-type'] === 'application/pdf') {
throw new NonRetryableError('PDFs are not supported');
}
});
},
]
import { NonRetryableError } from 'crawlee';

preNavigationHooks: [
async ({ page }) => {
page.on('response', async (page_response) => {
if (page_response.headers()['content-type'] === 'application/pdf') {
throw new NonRetryableError('PDFs are not supported');
}
});
},
]
Lukas Krivka
Lukas Krivka2y ago
@Hamza This would crash the process as you cannot throw in the page.on event handler because you are not able to await that. I think you could do await page.waitForResponse instead and throw after it. Actually that would just get stuck because you dont navigate. I think then use `gotoOptions.waitUntil: 'domcontentloaded' and handle the response type in requestHandler
abstract-brown
abstract-brownOP2y ago
@Lukas Krivka that’s what I ended up doing last night, but, then I will end up getting some pages that don’t load properly because i should be using networkidle2 Also, you can't really await the page.on('response', so by time you get application/pdf you might already be halfway thru the process of scraping that "pdf" page. wait wtf... now I'm not even getting that response (on link above) it's only returning the favicon.ico response! So confused. ahh because the request already happened by the time it's in the default handler..
Lukas Krivka
Lukas Krivka2y ago
You would have to do the networkidle2 in requestHandler. There is no way to stop the page navigation in the middle.

Did you find this page helpful?