typical-coral
typical-coral12mo ago

How to make sure all external requests have been awaited and intercepted?

I'm scrapping pages of a website as part of a content migration. Some of those pages make some post requests to algolia (3-4 requests) on the client side and I need to intercept those requests, since I need some data that is sent in the request body. One thing that is important to note is that I don't know which pages make the requests and which pages don't. Because of that, I'd need to find a way to await for all the external requests FOR EACH PAGE and just strat crawling the page html after that. That way, if I can ensure I awaited for all the requests and still it didn't intercept the algolia request, it would mean that specific page didn't make a request to algolia. I created a solution that seemed to be working at first. However, after crawling the pages a few times, I noticed that, sometimes, it wouldn't show the algolia data in the dataset for a few pages but I could confirm in the browser that the page makes the algolia request. So, my guess is that it ends crawiling the page html before intercepting that algolia request (??). Ideally, it would only start crawling the html AFTER all the external requests ended. I used puppeteer because I found the addInterceptRequestHandler in the docs, but I could use Playwright if it's easier. Can someone here help me to understand what I'm doing wrong? Here is a gist with the code I'm using: https://gist.github.com/lcnogueira/d1822287d718731a7f4a36f05d1292fc (I can't post it here, otherwise my message becomes too long)
Gist
Code used to crawl site
Code used to crawl site. GitHub Gist: instantly share code, notes, and snippets.
3 Replies
Hall
Hall12mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
typical-coral
typical-coralOP12mo ago
Well, I found a simpler solution to solve this specific need...instead of trying to intercept the requests I just got the script elements used to make the request and got the data from them. Something like:
const data = await page.$eval('html', (html) => {
const script = Array.from(html.querySelectorAll('script')).find(script => script.src.includes('my-script-name.js'));
const field = providerScript?.getAttribute('attribute-name');;
return field;
});
const data = await page.$eval('html', (html) => {
const script = Array.from(html.querySelectorAll('script')).find(script => script.src.includes('my-script-name.js'));
const field = providerScript?.getAttribute('attribute-name');;
return field;
});
However, I'd still be interested to know how I can pause my crawling until all the external requests are made so that I can handle another need I have.
Alexey Udovydchenko
Like this: https://docs.apify.com/academy/node-js/caching-responses-in-puppeteer#implementation-in-crawlee - intercept in preNavigationHooks and add relevant "waitfor" in handle function
How to optimize Puppeteer by caching responses | Academy | Apify Do...
Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code.

Did you find this page helpful?