Crawler for SPAs (Single Page Application)

Hi all!

My target is to scrap a website composed of SPAs (Single Page Application) and it looks like existing browser crawlers (i.e. PlaywrightCrawler and PuppeteerCrawler) are not a good fit as each request is processed in a new page, which is a waste of resources.

What I need is to open one browser page and execute multiple XHR / fetch requests to their unofficial API, until I get blocked and need to re-open a new browser page to continue until all requests have been processed.
Note that need a browser to pass fingerprint checks and use the website's internal library to digitally sign each request to their unofficial API.

I'm thinking to solve my need by writing a
SinglePageBrowserCrawler
that extends BasicCrawler and works similarly to BrowserCrawler but manage browser pages differently.

Is it a good idea? Is there a way to do this in a better way?

Thanks in advance for your feedback!
Provides a simple framework for parallel crawling of web pages
using headless Chromium, Firefox and Webkit browsers with Playwright.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.

Since
Playwright
uses headless brow...
MDN Web Docs
XMLHttpRequest (XHR) is a JavaScript API to create HTTP requests. Its methods provide the ability to send network requests between the browser and a server.
MDN Web Docs
The global fetch() method starts the process of fetching a resource from the network, returning a promise that is fulfilled once the response is available.
Provides a simple framework for parallel crawling of web pages.
The URLs to crawl are fed either from a static list of URLs
or from a dynamic queue of URLs enabling recursive crawling of websites.

BasicCrawler
is a low-level tool that requires the user to implement the page
download and data extraction functionality themselves.
If we want a c...
Was this page helpful?